DTU_Hagedorn.ppt - University of Michigan


									   OAIster: Metadata
Pointing to Digital Objects

                          Kat Hagedorn
    Metadata Harvesting/DLXS Librarian
        University of Michigan Libraries
                     February 18, 2004
• One-year Mellon grant project to test
  the feasibility of making OAI-enabled
  metadata for digital objects accessible
  to the public
• Digital Library Production Service at
  University of Michigan Libraries began
  work in December 2001
• Launched in June 2002
•   Any audience
•   Any subject matter
•   Any format
•   Freely accessible
•   No dead ends
•   One-stop shopping

…retrieving the “hidden web”
          tool we borrowed
• University of Illinois Urbana-Champaign
  open-source OAI protocol harvester
• java edition for our unix environment
• Worked collaboratively to iron out kinks
  – resumptionToken / retryAfter
  – inexplicable kill
  – bogus records in MySQL table
    development environment
• Digital Library Extension Service (DLXS)
• Develop open-source middleware and
  license XPAT search engine for building
  and mounting digital libraries
• Middleware consists of document
  classes, i.e., Text, Image, Bib, FindAid
• Originally designed to make SGML
  encoded texts available online
            tool we developed
• Runs in DLXS environment using
• Current BibClass web templates modified
• Additional java-based transformation tool
  –   DC metadata records concatenated
  –   No-digital-object records filtered out
  –   Records counted
  –   Conversion from UTF-8 to ISO-8859-1
  –   XSLT used to transform DC records into
      BibClass records
               system design
                                           (per source


OAI-enabled                                  XSLT
                          Record         transformation
 DC records               storage              tool

 enabled                                         Search
DC records
                              BibClass          interface
                               indexes           (XPAT)
• One place to look for digital objects
• Big
  – 3,016,251 metadata records
  – 267 institutions (as of last week…)
• Popular
  – Averages 3300 search sessions / month
  – Picked up in March ‘03: average 3500 now
  – 43,894 searches in one year (June 2002 –
    July 2003) search limiters sort results repositories
            repositories: e.g.,
• arXiv Eprint Archive: math and physics pre-
  and post-prints
• Online Archive of California: manuscripts,
  photographs, and works of art held in
  institutions across California
• Sammelpunkt, Elektronisch Archivierte
  Theorie: archive of philosophical publications
• British Women Romantic Poets Project:
  collection of poems written by British women
  between 1789 and 1832
            repositories: stats
• As of February ‘04, out of 267 repositories…
• International and U.S.
  – U.S.: 50.5% (135)
  – Intl: 49.5% (132)
• By subject
  – Humanities: 24% (65)
  – Science: 30% (81)
  – Mixed: 46% (121)
• E-prints and pre-prints
  – Using software: 39% (104)
  – Not using software: 61% (163)
     major issues encountered
• Metadata variation
• Records not leading to digital objects
• Access restrictions on digital objects
  described in records
• Duplicate records for a single digital
     issue: metadata variation
• With more records, users need more
• Consistent metadata needed to
  facilitate these restrictions
• One option: normalization of data
     issue: metadata variation
• Type: the obvious quick win
  – 240 metadata values mapped to four
    generic values (text, image, audio, video)
  – e.g.,
    audio, sound = audio
    motion, animation, newsreels, etc. = video
    watercolour, watercolor, slides, etc. = image
    article, articles, booklet, diss, story, etc. = text
     issue: metadata variation
• Date: where to begin?
  – Most records with at least one date
  – Some records include up to seven dates
  – No consistent style of date
• Subject: out of context, what meaning?
  – Many records with at least one subject element
  – But over 100 records with more than 50 subjects
  – And one record with 1000!
     issue: metadata variation
• Sample date values
  <date>between 1827 and 1833</date>
  <date>November 13, 1947</date>
  <date>SEP 1958</date>
  <date>235 bce</date>
  <date>Summer, 1948</date>
      issue: metadata variation
• Sample subject values
  <subject>1852, Apr. 22. E[veritt] Judson, letter to
    Philuta [Judson].</subject>
  <subject>Slavery--United States--Controversial
  <subject>view of interior with John Henry
  <subject>Particles (Nuclear physics) --
      issue: no digital objects
• Some records contain links to further
  description of digital object
• But not the digital object itself
• Culling difficult
• One option: add explanatory text to site
• Or, unfortunately, spot-check and
  remove repositories with this issue
     issue: access restrictions
• No records where metadata itself is
  restricted in use (as far as we know!)
• Definitely some records where objects
  are restricted to licensed users
• One option: add explanatory text to site
• Or sub-set OAIster into free and
  “partially” free repositories
      issue: duplicate records
• Two records harvested, different
  identifiers, same object described and
  pointed to
• Two records harvested inadvertently
  through aggregators and original
      issue: duplicate records
• Need algorithm to automate de-
• Were duplicates to be identified, how to
  deal with the issue?
  – Suppress?
  – Group?
  – Flag?
• So far, not addressed in OAIster
          future of OAIster
• Advanced searching
• Grouping to aid browsing
• Further normalization of data
• Handling duplicate records
• Saving/emailing/downloading records
• Collaboration with other services:
  search, instructional…
• More user testing…
        current state of protocol
• Popular
• As Peter Suber says:
  – “…no other single idea or technology in the [open-
    source movement] has enjoyed this density of
    endorsement and adoption in a six month period.”
• Data providers over one year:
  –   June ‘02: 56 repositories / 274,062 records
  –   June ‘03: 187 repositories / 1,246,953 records
  –   Over three-fold increase for repositories
  –   Over four-fold increase for records
            future of protocol
• Branching out
  –   DC required vs. highly recommended
  –   Use of OAI in closed environments
  –   Static repository protocol
  –   OAI-rights committee
• OAI evangelism
              contact info
• Kat Hagedorn
• University of Michigan Libraries, Digital
  Library Production Service

