Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

OAI Overview

VIEWS: 13 PAGES: 71

									An Overview of OAI & OAI-PMH

by Filbert Minj & Francis Jayakanth NCSI, IISc

Agenda
Part I – Overview of the OAI Part II – Overview of the OAI-PMH High-Tea Break Part III – Demo of Harvesting of metadata and Searching • Discussion? • • • •

Part I – Overview of the OAI
• • • • • • • General Information The Journal System Growth of ePrint Archives The ePrint System The UPS Prototype The Dawn of OAI Important Resources

Most Relevant Resource
• Open Archives Forum
– http://www.oaforum.org/tutorial/index.php

[This Presentation to a great extent is based on the tutorial available at the above mentioned URL. Several slides from the above site have been interpolated in this ppt file]

General Information
• ePrints
– ePrints are commonly defined as research articles in electronic form (with an underlying assumption that they are available online) • Preprints (Before peer-review) • PostPrints (final, revised, refereed, and accepted draft)

General Information …
• Repository
– a repository is a network accessible server that holds ePrints

• Archive
– is generally accepted as a synonym for repository

General Information …
• ePrint Archive – An established medium to communicate non-peer reviewed scholarly literature (preprints)

General Information …
• Metadata
– structured information about resources [is a descriptive information about an object or a resource whether it is in physical or electronic form]

General Information …
• DC (Dublin Core)
– is a metadata format defined on the basis of international consensus. The DC Metadata Element Set defines fifteen elements for simple resource description and discovery

General Information …
• OAI (Open Archives Initiative)
– OAI is an initiative to develop and promote interoperability standards that aim to facilitate the efficient dissemination of content.

General Information …
• Protocol
– a protocol is a set of rules defining communication between systems. FTP (File Transfer Protocol) and HTTP (Hypertext Transport Protocol) are examples of protocols used for communication between systems across the Internet.

General Information …
• OAI-PMH (OAI Protocol for Metadata Harvesting)
– OAI-PMH is a lightweight harvesting protocol for sharing metadata between services.

General Information …
• Data Provider
– a Data Provider maintains one or more repositories (web servers) that support the OAIPMH as a means of exposing metadata.

• Service Provider
– a Service Provider issues OAI-PMH requests to data providers and uses the metadata as a basis for building value-added services.

General Information …
• Harvesting
– in the OAI context, harvesting refers specifically to the gathering together of metadata from a number of distributed repositories into a combined data store

General Information …
• Interoperability
– is the ability of systems, services and organizations to work together seamlessly toward common or diverse goals. In the technical arena it is supported by open standards for communication between systems and for description of resources and collections, among others. Interoperability is considered here primarily in the context of resource discovery and access.

General Information …
• XML (Extensible Markup Language)
– it defines a means of describing data. XML can be validated against a DTD or schema setting out the elements of the language created

• DTD (Document Type Definition)
– a DTD is a formal specification of the structure of a document

The Journal System
• Significant challenges to the journal system
– – – – – Explosive growth of the Internet Publication delay Full transfer of rights by authors to publishers The implementation of peer-review and Skyrocketing of subscription prices

• Challenges have resulted in exploring alternative models for scholarly communication

Growth of ePrint Archives
• The roots OAI lie in the growing no. of ePrint archives. Several of these began as
– Informal vehicle for dissemination of:
• preliminary research results and • ‘gray’ literature

• A no. of them have evolved into an essential medium for sharing research results among the colleagues in a field

Growth of ePrint Archives…
• arXiv (xxx) – 1991- Physics - Los Alamos (Cornell?) – 2.5 Lac preprints - OAI-PMH • CogPrints – Cog Sci. – Univ. of Southampton – OAI-PMH • RePEc (NetEc) – 1993 – Economics - Univ. of Surrey – Guildford Protocol • NCSTRL – Comp. Sci. – Dienst to OAI – ODU,VT and others • NDLTD – Thesis & Dissertation - Virginia Tech.

Growth of ePrint Archives…
• The growth of ePrint archives exemplify a more equitable and efficient model for disseminating research results • An important challenge is to increase the impact of the ePrint archives. • The growth of ePrint archives demonstrate shift in the traditional scholarly communication model – the journal system

Growth of ePrint Archives…
• There are indications that a growing number of disciplines, organizations and even commercial publishers are inspired by this pioneering work and are investigating alternative models for scholarly communication

Open Access Journals
• BMC (BioMed Central) – open access publisher • PLoS (Public Library of Science) – will launch peer reviewed open access journals
– PloS Biology already launched and – PLoS Medicine will follow

• DOAJ – Directory of Open Access Journals http://www.doaj.org/

ePrint Archives
• Basic aims of ePrint archives initiative :
– create a more effective scholarly communication mechanism and – there by providing an alternative to existing scholarly communication model

ePrint Archives…
• Approaches taken by individual archives differ in number of ways:
– Centralized model
• arXiv

– Distributed departmental/institutional model
• RePEc

– Some deal with gray literature

ePrint Archives…
• Approaches taken by individual archives differ in number of ways …
– Some incorporate metadata of peer-reviewed papers – Some deal with metadata only, others metadata and full text – Different protocols
• Dienst, Guildford…

ePrint Archives…
• Different approaches and protocols used meant:
– Doesn’t facilitate discovery – Different search interfaces – No provision to share metadata (interoperability)

ePrint Archives…
• Key players recognised the need for single search interface to all the archives through interoperability • Two key interoperability problems impairing impact of ePrint archives were identified
– Multiple search interface – No machine-based way for sharing the metadata

ePrint Archives…
• Solutions explored included:
– Cross searching of archives – Harvesting metadata from various archives and build a central index

[ in July 1999, a call for meeting of tech. experts to attend a meeting in Santa Fe, NM in Oct’99 was given by Ginsparg, Luce and Sompel]

Creation of UPS
• Creation of UPS [Universal Preprint Service] for author self archived scholarly literature was proposed
– UPS would be the fundamental and free layer of scholarly information, above which both free and commercial service could flourish

Creation of UPS…
• The first step towards establishing UPS was identification/creation of interoperable technologies and frameworks for the dissemination of ePrints

Luce * Van de Sompel * Ginsparg

UPS Prototype
• Architectural framework for UPS?

Cross searching
(Z39.50)

Harvesting of metadata

UPS Prototype…
• Searching vs. harvesting:
– US digital library experience in this area (e.g. NCSTRL) indicated that cross-searching not preferred approach - distributed searching of N nodes viable, but only for small values of N
• NCSTRL: N > 100; Not satisfactory

UPS Prototype…
• The UPS Prototype at Santa Fe [Oct’99]
– Services based on a collection of harvested metadata – SFX/OpenURL linking

• Based on NCSTRL & Dienst protocol [Insights regarding lack on interoperability Recommendation : metadata harvesting]

UPS Prototype…
• UPS architecture identified two logical roles:
Data Provider Service Provider

(deposit+publish+expose metadata) (harvest+provide service)

The Dawn of OAI
• The name UPS was quickly changed:
– to avoid clash with already established commercial parcel service and – not all e-print archives contained preprints

• The framework within which this universal service would be developed was now designated the Open Archives initiative – OAi, and later OAI

Requirements for Metadata Harvesting
• For harvesting method to work, there must be agreements on:
– – – – Transport protocol (HTTP) Metadata formats (DC, MARC..) Quality assurance (mandatory fields) IP and usage rights (who can do what with the records)

The Dawn of a Protocol
• An initial agreement in key areas made it possible to develop a protocol for metadata harvesting, named the Santa Fe Convention in honour of the meeting where the agreement was reached.

Benefits of Interoperability
• Facilitates information discovery, linking and peer reviewing • Increases visibility (impact) • Single search interface

What’s in the Name
Open Archives Initiative

The protocol is openly documented, and is compliant with open Standards – HTTP, DC and XML

Archive/Repository contains collection of document-like objects

OAI is happening at break-neck speed

Questions?

Part II
An Overview of the OAI-PMH

OAI-PMH Version History
• Santa Fe Convention was the first incarnation of the OAI-PMH [02/2000]:
– Goal: optimise discovery of e-prints – Inputs…
• • • • UPS prototype RePEc/SODA data/service provider model Dienst protocol Deliberations at the Santa Fe Meeting [10/99]

OAI-PMH Version History…
• OAI-PMH V. 1.0 [01/2001] – Goal: optimise discovery of document-like obj. – Inputs…
• • • • • Santa Fe Convention various DLF meetings on metadata harvesting deliberations at Cornell alpha-testers of OAI-PMH v 1.0 recognition of DC as ‘best’ core metadata format for interoperability across multiple archives

OAI-PMH v 1.0 [01/2001]
• Low-barrier interoperability specification
• Metadata harvesting model: data provider / service provider • Focus on document-like objects • HTTP based • XML responses

• Unqualified Dublin Core
• Experimental: 12-18 months

OAI-PMH Version History…
• OAI-PMH V. 2.0
– Goal: recurrent exchange of metadata about resources between systems – Inputs ... • OAI-PMH v.1.0 • feedback on OAI-implementers • deliberations by OAI-tech [09/01 - 06/02] • alpha test group of OAI-PMH v.2.0 [03/02 - 06/02] • officially released June 14, 2002

OAI-PMH v.2.0 [06/2002]
• Low-barrier interoperability specification
• Metadata harvesting model: data provider / service provider • Metadata about resources • HTTP based • XML responses

• Unqualified Dublin Core
• Stable; No backward compatibility • Future releases will be backward compatible

What OAI-PMH is not
• • • • Not a search system on its own Not a database management system Not single metadata schema Not a OAIS

Basic Functioning of OAI-PMH

OAI: General Assumption
• Two groups of ‘participants’ • Data Providers (Open Archives, Repositories)
– free access of metadata – not necessarily free access to full texts / resources – easy to implement, low barrier solution

OAI: General Assumption…
• Two groups of ‘participants’ • Service Providers
– use OAI interfaces of the Data Providers – harvest and store metadata (no live requests!) – may select certain subsets from Data Providers (set hierarchy, date stamp) – offer (value-added) service on the basis of the metadata

Multiple data and service providers

Harvesting based on OAI-PMH

Service providers

Aggregators
Data providers

Aggregator

Service providers

OAI-PMH: Structure Model
Data Provider e-prints e-print

Requests:
Identify ListMetadataformats
Repository

Data Provider

ListSets ListIdentifiers ListRecords
Repository

Images e-print

Service Provider

GetRecord Data Provider OPAC e-print

Harvester

Repository

Data Provider

Data Provider

Responses:
General information

Museum

e-print

Metadata formats
Set structure

Repository

Data Provider

Record identifier Metadata
Repository

Archive e-print

OAI-PMH: Protocol Overview
• Protocol is based on HTTP • Request arguments are issued as GET or POST methods • Responses are encoded in XML syntax • Supports any metadata format (at least: Dublin Core)

OAI-PMH: Protocol Overview…
• Data providers may support granularity for service providers for selective harvesting:
– Define a logical set hierarchy – Date stamps (last change of metadata set)

• Error messages are http based • Supports flow control • Supports six request types (known as ‘verbs’)

OAI Verbs
• • • • • • Identify ListSets ListMetadataFormats ListIdentifiers GetRecord ListRecords

OAI Verbs - Identify
• Purpose
– Return general information about the archive and its policies (e.g., date stamp granularity)

• Parameters
– None

• Sample URL
– http://eprints.iisc.ernet.in/perl/oai2?verb=Identify

OAI Verbs - ListSets
• Purpose
– Provide a listing of sets in which records may be organized

• Parameters
– None

• Sample URL
– http://eprints.iisc.ernet.in/perl/oai2?verb=ListSets

OAI Verbs ListMetadataFormats
• Purpose
– List metadata formats supported by the archive as well as their schema locations and namespaces

• Parameters
– identifier – for a specific record (O)

• Sample URL
http://eprints.iisc.ernet.in/perl/oai2?verb=ListMetadataFormats

OAI Verbs - ListIdentifiers
• Purpose
– List headers for all items corresponding to the specified parameters

• Parameters
– – – – – from – start date (O) until – end date (O) set – set to harvest from (O) metadataPrefix – metadata format to list identifiers for (R) resumptionToken – flow control mechanism (X)

• Sample URL
– http://eprints.iisc.ernet.in/perl/oai2 verb=ListIdentifiers&metadataPrefix=oai_dc

OAI Verbs - GetRecord
• Purpose
– Returns the metadata for a single item in the form of an OAI record

• Parameters
– identifier – unique id for item (R) – metadataPrefix – metadata format for the record (R)

• Sample URL
– http://eprints.iisc.ernet.in/perl/oai2? verb=GetRecord&identifier=oai:iiscePrints.OAI2:10&metadataPrefix=oai _dc

OAI Verbs - ListRecords
• Purpose
– Retrieves metadata records for multiple items

• Parameters
– – – – – from – start date (O) until – end date (O) set – set to harvest from (O) resumptionToken – flow control mechanism (X) metadataPrefix – metadata format (R)

• Sample URL
– http://www.anarchive.org/cgi-bin/OAI? verb=ListRecord&metadataprefix=oai_dc&from=2001-01-01

Protocol Details: Flow Control
“want to have all your records”

Service Provider

archive.org/oai?verb=ListRecords& metadataPrefix=oai_dc “have 267, but give you only 100” 100 records + resumptionToken “anyID1” “want more of this” archive.org/oai?resumptionToken=anyID1

Data Provider

Harvester

“have 267, give you another 100” 100 records + resumptionToken “anyID2” “want more of this” archive.org/oai?resumptionToken=anyID2 “have 267, give you my last 67” 67 records + resumptionToken “”

Repository

OAI Compliant Tools
• • • • eprints.org (http://www.eprints.org) Dspace (http://dspace.org) CDSware (http://cdsware.cern.ch) Kepler (http://kepler.cs.odu.edu/)

OAI-PMH Based Services
• Repository Explorer:
– http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai/

• Serach engines
– Arc: http://arc.cs.odu.edu/ – MyOAI: http://www.myoai.org/ – Physnet: (subset of arXive, IOP…) http://physnet.uni-oldenburg.de/oai/query.php – OAIster: http://oaister.umdl.umich.edu/o/oaister/

Summary
• Low-cost mechanism for harvesting metadata records from one system to another • Based on HTTP and XML – Web-friendly • Development over last 2-3 years has seen move from specific (discovery of e-prints) to generic (sharing descriptions of any resource)

Summary…
• Recommends simple DC as record format but extensible to any format encoded in XML • OAI-PMH is not a search protocol • Metadata and full-text typically made freely available – but not a requirement
– OAI-PMH can be used between closed groups

Other Important Resources
• OAI Web site:
– http://www.openarchives.org/

• Open Archives Forum
– http://www.oaforum.org/tutorial/index.php • The Santa Fe Convention of the Open Archives Intiative by Herbert Van De Sompel and Carl Lagoze, D-Lib magazine,Vol 6 no. 2, Feb 2000

Questions?

Thank you for your Presence & Patience


								
To top