Docstoc

The OAI Protocol for Metadata Harvesting

Document Sample
The OAI Protocol  for Metadata Harvesting Powered By Docstoc
					                  The OAI Protocol for
                  Metadata Harvesting

                                      Andy Powell
                               a.powell@ukoln.ac.uk
                             UKOLN, University of Bath
                          IVOA Registry Meeting, London
                                   March 2003


a centre of expertise in digital information management
www.ukoln.ac.uk
    Contents
    • a brief history of OAI
    • 10 technical things you should know
      about the OAI-PMH




2
    OAI roots
    • the roots of OAI lie in the development
      of eprint archives…
      – arXiv, CogPrints, NACA (NASA), RePEc, NDLTD,
        NCSTRL
    • each offered Web interface for deposit
      of articles and for end-user searches
    • difficult for end-users to work across
      archives without having to learn multiple
      different interfaces
    • recognised need for single search
      interface to all archives
      – Universal Pre-print Service (UPS)
3
    Searching vs. harvesting
    • two possible approaches to building a
      single search interface to multiple eprint
      archives…
       – cross-searching multiple archives based on protocol
         like Z39.50
       – harvesting metadata into one or more ‘central’
         services – bulk move data to the user-interface
    • US digital library experience in this area
      indicated that cross-searching not
      preferred approach
       – distributed searching of N nodes viable, but only for
         small values of N

4
    Searching vs. harvesting



      search service
                       …or…




                              search service


5
    Harvesting requirements
    • in order that harvesting approach can work
      there need to be agreements about…
      – transport protocols – HTTP vs. FTP vs. …
      – metadata formats – DC vs. MARC vs. …
      – quality assurance – mandatory elements,
        mechanisms for naming of people, subjects,
        etc., handling duplicated records, best-practice
      – intellectual property and usage rights – who
        can do what with the records
    • work in this area resulted in the “Santa Fe
      Convention”
6
    Development of OAI-PMH
    • 2 year metamorphosis thru various names
      – Santa Fe Convention, OAI-PMH versions 1.0, 1.1…
      – OAI Protocol for Metadata Harvesting 2.0
    • development steered by international
      technical committee
    • inter-version stability helped developer
      confidence
    • move from focus on eprints to more
      generic protocol
      – move from OAI-specific metadata schema to mandatory
        support for DC

7
    Bluffer’s guide to OAI
                          http://www.openarchives.org/
    1. OAI-PMH is a low-cost mechanism for
       harvesting metadata records
       – from ‘data providers’ to ‘service providers’
    2. allows ‘service provider’ to say ‘give me
       some or all of your metadata records’
       – where ‘some’ is based on date-stamps, sets,
         metadata formats
    3. not limited to repositories of eprints
       – images, museum artefacts, learning objects, …
    4. based on HTTP and XML
       – simple, Web-friendly, autonomous
8      – fast, flexible deployment
    Bluffer’s guide to OAI
    5. OAI-PMH is not a search protocol
      – but use can underpin search-based services
        based on Z39.50 or SRW or SOAP or…
    6. OAI-PMH carries only metadata
      – content (e.g. full-text or image) made available
        separately – typically at URL in metadata
    7. mandates simple DC as record format
      – but extensible to any XML format – IMS,
        ONIX, MARC, METS, etc.
    8. extensible framework for metadata about
      – repository, resources, ‘items’, sets
9     – can include rights metadata
     Bluffer’s guide to OAI
     9. metadata and ‘content’ often made freely
        available – but not a requirement
       – OAI-PMH can be used between closed
         groups
       – or, can make metadata available but restrict
         access to content in some way
     10. underlying HTTP protocol provides
       – access control – e.g. HTTP BASIC
       – compression mechanisms (for improving
         performance of harvesters)
       – could, in theory, also provide encryption if
         required
10
      Resources, items and records


                                                  resource



     item =        all available metadata              item
     identifier          about David


             Dublin Core    MARC      SPECTRUM
              metadata     metadata    metadata        records

11
     Protocol requests
     • six different request types
        –   Identify
        –   ListMetadataFormats
        –   ListSets
        –   ListIdentifiers
        –   ListRecords
        –   GetRecord
     • harvester need not use all types
     • repository must implement all types
     • required and optional arguments
        – on request types
12
     Record structure
     •   metadata about a resource in a
         particular XML format
          • header (mandatory)
              •   identifier (1)
              •   datestamp (1)
              •   setSpec elements (*)
              •   status attribute for deleted item (?)
          • metadata (mandatory)
              •   XML encoded metadata within root tag
                  which provides namespace and schema
              •   repositories must support Dublin Core
          • about (optional)
              •   rights statements
              •   provenance statements
13
     Dublin Core
                                  http://dublincore.org/

     • OAI-PMH mandates use of simple DC
       as lowest common denominator
     • agreed XML schema – ‘oai_dc’
        – simple DC – 15 metadata properties
        – all DC properties optional and repeatable

     Title          Contributor     Source
     Creator        Date            Language
     Subject        Type            Relation
     Description    Format          Coverage
     Publisher      Identifier      Rights
14
     OAI demonstration
     • repository explorer demo




15
     OAI and Google
                eprint
     Web
                archive(s) multimedia
     site(s)               database(s)




               DP9 gateway

                        OAI gateway
                        makes harvested
                        metadata
                        available to
                        Google…
16
     Implementing OAI
     • OAI protocol is relatively simple
     • implementation and deployment tends
       to be very fast
     • lots of available toolkits
       – Java, Perl, PHP, etc.
     • complete tools also available
       – e.g. tools that sit in front of
         existing databases
     • see ‘tools’ area on the
       OAI Web site…

17
     Creative Commons
                        http://www.creativecommons.org/
     • CC is “devoted to expanding the range
       of creative work available for others to
       build upon and share”
     • provides ‘standard’ licences for content
       –   attribution
       –   noncommercial
       –   no derivative works
       –   share alike
     • mechanisms for indicating licence on
       Web pages
     • need similar mechanism in OAI
18
     Questions…




19
a centre of expertise in digital information management
www.ukoln.ac.uk

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:8
posted:9/3/2012
language:English
pages:20