Docstoc

Metadata for Discovery

Document Sample
Metadata for Discovery Powered By Docstoc
					Metadata for Harvesting

      Stuart Lewis
                 Contents

•   Who uses our metadata?
•   How do they gather our metadata?
•   What are some issues and pitfalls?
•   What tools can assist me?
         Who uses our metadata?

• Discovery services
  – Services that help people find the things in
    our repository
     •   Google / traditional search engines
     •   Google Scholar / Academic Live
     •   OAISter
     •   Intute Repository Search
 How do they gather our metadata?

• Crawling and indexing
  – Google / traditional search engines
     • Visit the first page of your site and index it
         – Store all the text in their search index
     • Note down all links on the page
     • Come back and visit each link at a later time
         – Friendly search engines only index 1 page every few
           seconds so as not to overload the server
  – Does not use the metadata in its raw form, so how
    you store your metadata is irrelevant
 How do they gather our metadata?

• Crawling and indexing
  – Google Scholar
    • The same crawling method as traditional
      search engines
    • Use of metadata in HTML head
    • Example:
       – Next slide
  – Raw metadata is used
    • Supported by EPrints and DSpace (V 1.5+)
                                 Example
<link rel="schema.DC" href="http://purl.org/DC/elements/1.0/" />
<meta content="The sedimentary environment of the Darwin cold-water coral mounds, N.
    Rockall Trough" name="DC.title" />
<meta content="Huvenne, V." name="DC.creator" />
<meta content="Masson, D.G." name="DC.creator" />
<meta content="Wheeler, A." name="DC.creator" />
<meta content="QE Geology" name="DC.subject" />
<meta content="GC Oceanography" name="DC.subject" />
<meta content="2008-02-25" name="DC.date" />
<meta content="Article" name="DC.type" />
<meta content="PeerReviewed" name="DC.type" />
<meta content="http://eprints.soton.ac.uk/50407/" name="DC.identifier" />
 How do they gather our metadata?

• Metadata Harvesting
  – OAISter / Intute Repository Search
     • Web interface is not used
     • Open Archives Initiative Protocol for Metadata Harvesting
       (OAI-PMH)
     • XML interface to raw metadata
  – Raw metadata is used
     • Supported by most repository software
 How do they gather our metadata?

• OAI-PMH
  – Base URL
    • http://repository.example.com/oai/request
  – Examples:
    • http://dspace.example.com/dspace-oai/request
    • http://eprints.example.com/perl/oai2
 How do they gather our metadata?

• OAI-PMH
  – Base URL
     • http://example.repository.com/oai/request
  – Append verbs
     • http://example.repository.com/oai/request?verb=GetReco
       rds
  – Six verbs
     •   Identify / ListMetadataFormats
     •   ListSets
     •   ListIdentifiers / ListRecords
     •   GetRecord
 How do they gather our metadata?

• OAI-PMH
  – Base URL
    • http://example.repository.com/oai/request
  – Append verbs
    • http://example.repository.com/oai/request?verb=GetReco
      rds
  – Append parameters
    • http://example.repository.com/oai/request?verb=ListIdent
      ifiers&metadataPrefix=oai_dc
 How do they gather our metadata?

• OAI-PMH
  –   Initial harvesting of all items
  –   Periodic harvesting of updates (from=lastharvest)
  –   Edited records have their datestamp updated
  –   Some services also perform a full harvest
      periodically
       • OAISter once a month
 How do they gather our metadata?

• OAI-PMH
  – Works with different metadata formats. For
    example:
     •   oai_dc (unqualified Dublin Core)
     •   oai_qdc (qualified Dublin Core)
     •   mets (Metadata Encoding and Transmission Standard)
     •   uketd_dc (UK ETD expressed in DC)
     •   uketd_mets (UK ETD expressed in METS + DC)
 What are some issues and pitfalls?

• The use of unqualified Dublin Core
  – The minimum and most often used
    metadata standard for metadata harvesting
    is unqualified Dublin Core (oai_dc)
  – When harvesting takes place, the qualifier
    is dropped:
    • contributor.author => contributor
    • contributor.advisor => contributor
 What are some issues and pitfalls?

• Repeated fields
  – Some discovery services show all fields
  – Some (Intute) only show the first value of a
    repeated field
    • description = A scanned copy of the oldest
      thesis in the library
    • description.abstract = The full abstract of the
      thesis.
  – The abstract (of more value?) is not shown
      What tools can assist me?

• OAI registration
  – Includes a validation stage
• Repository explorer
  – http://re.cs.uct.ac.za/
• OAI-PMH XML Stylesheet
  – Comes with EPrints
  – Configurable option with DSpace
                The End!

• Any questions?

  – stuart.lewis@aber.ac.uk
  – support@rsp.ac.uk

				
DOCUMENT INFO