The Open Archives Initiative Protocol for Metadata Harvesting and the IMLS Digital Collections & Content Project at the University of Illinois
Timothy W. Cole (t-cole3@uiuc.edu) Mathematics Librarian & Professor of Library Administration University of Illinois at Urbana-Champaign
Friday 12 November 2004 MCN 2004, Minneapolis, MN
http://imlsdcc.grainger.uiuc.edu/Cole_MCN2004_OAI.ppt
The Digital Information Landscape
The information landscape can be seen as a contour map in which there are mountains, hillocks, valleys, plains and plateaus…. A specialized collection of particular importance is like a sharp peak. Upon a plateau there might be undulations representing strengths and weaknesses…. The landscape is, however, multidimensional. Where one scholar may see a peak another may see a trough. The task is to devise mapping conventions which enable scholars to read the map of the landscape fruitfully, at the appropriate level of generality or specificity.
Michael Heaney (2000), ―An Analytical Model of Collections and their Catalogues.‖
t-cole3@uiuc.edu University of Illinois at UC
2
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
Users & Uses of Digital Libraries
From Bibusages study (French National Library):
Digital Libraries are used in conjunction with Web search engines, generalist portals, commercial sites Mix of intensive & casual users DL users skew somewhat older, higher degree level than average French Internet user population DL users seeking answer for specific information need; most time spent discovering, viewing, & downloading documents
“Digital Libraries … are now attracting a new type of public, bringing about new, unique and original ways for reading and understanding texts.”
Houssem Assadi, et al. ―Users & Uses of Online Digital Libraries in France,‖ ECDL 2003
t-cole3@uiuc.edu University of Illinois at UC
3
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
Managing Digital Collections & Content
How do mandates translate & change in digital world?
Content & collections as virtual ‗information landscapes‘ New users, uses, & metrics Increased emphasis on interoperability & sharing Harvesting – e.g., OAI-PMH Federated searching – e.g., Z39.50 / ZNG, DiGIR, ... Reconciling different descriptive metadata practices New metrics for metadata quality (for interoperability)
4
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
New models for sharing & resource discovery
New Emphasis on ‗Shareable‘ metadata
t-cole3@uiuc.edu University of Illinois at UC
IMLS Digital Library Forum (2001)
Framework of Guidance for Building Good Digital Collections http://www.niso.org/framework/forumframework.html Stresses reusability, persistence, interoperability, verification, and documentation of digital collections & content Accompanying report included recommendations encouraging: Creation of an IMLS Collection Registry Implementation of the Open Archives Initiative Protocol for Metadata Harvesting by IMLS projects creating digital content Development of infrastructure to facilitate interoperability between IMLS projects and initiatives like NSDL
t-cole3@uiuc.edu University of Illinois at UC OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
5
IMLS DCC Project Overview
Collection description & prototype registry for IMLS National Leadership Grant projects with associated digital content Enhance discoverability of collections & content Provide alternative view of one output of IMLS NLG program Prototype item level metadata repository via OAI-PMH Demonstrate potential of metadata for interoperability Serve as testbed for IMLS projects interested in OAI-PMH Facilitate reuse of information resources paid for by IMLS
Research question:
How can resource developers best represent collections and items to meet the needs of service providers and end users?
t-cole3@uiuc.edu University of Illinois at UC
6
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
IMLS Grantees – A Diverse Community
Mix of library, museum, and archive traditions Wide variation in technical skills, technology infrastructure & information management policy Diverse perspectives on intellectual property; use and presentation of metadata & primary resources
Diverse embedded knowledge structures
Results in wide variability in: Metadata formats Content resource types Controlled vocabularies Descriptive metadata practices
t-cole3@uiuc.edu University of Illinois at UC
7
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
Broad Categories of Institutions Represented in Collection Registry
Institutions in IMLS Collection Registry by Category (349 institutions from 134 collections / 92 NLG projects)
Other 17% Specimen Holding 3% Archives 3%
Libraries 41%
Museums 36%
Numbers
10 69 58 48 39 24 15 13 12 12 11 10 8 6
20
30
40
50
60
70
80
0
Acad. Lib. Historical Soc. Public Lib. Other History Mus. General Mus. Other Higher Ed Spec. Mus. State Lib. Research Lib./Archives Art Museum K-12 School Lib. Cons. Nat.His. Mus. Science Mus. Bot. Garden / Herbarium Spec. library Historic Site Arboretum Museum Lib. Private Lib. School Lib. State Mus.
Types of Institutions in IMLS Collection Registry (349 institutions from 134 collections from 92 NLG projects)
Detailed Institution Types Represented in Collection Registry
Institution Types 6 4 3 3 2 1 1 1 1 1
Broad Categories of Institutions Represented in Metadata Repository
Institutions Represented in Metadata Repository
(136 institutions--27 harvested collections/193,677 metadata records) Other 13% Specimen Holding 4% Archives 4% Libraries 42%
Museums 37%
Number of Institutions
40 35 30 25 20 15 10 5 0 34 16 16 14 8 6 6 6 5 4 4 3 3 2
Types of Institutions Represented in Item Level Metadata Repository (136 institutions -- 27 harvested collections/193,677 metadata records)
Detailed Institution Types Represented in Metadata Repository
Institution Types
2 2 1 1 1 1 1 0 0
Zo o
Ac Hi ad st .L or ib y . M us Pu e u bl m Hi ic st L or ic ib. al So c. O Ar th G e en t M us r er eu al m M Re us (K se eu ar m c h 12 ) Sc Li Na ho b t. ol H ./Ar is Sp t o c hi ve ec ry O s M th iali us ze er e d H m um ig us he eu St r E du m at e Li cat Li io b. br n ar A ge y nc Sc Co i-T nso y Bo ec rt iu ta h m ni M c us G S eu ar m de pec ia n lL /H ib er . ba Hi riu m st or ic M S us eu it e St m at Sc L e M ho i b. us ol eu Li m b. Ag e Ar nc bo y re Pr tu m iv at e Li b.
0
Metadata Formats
Metadata Formats in Use Locally Developed Metadata Total Locally Developed Metadata in Locally Developed Metadata Only Other Metadata Standard Total Other Metadata Standard in Other Metadata Standard Only VRA Core total VRA Core in combination with VRA Core Only TEI total TEI in combination with other TEI only MARC total MARC in combination with other MARC only EAD total EAD in combination with other EAD only Dublin Core Total Dublin Core in combination with Dublin Core only 0 21 (27%) 8 (10%) 14 (18%) 10 (13%) 4 (5%) 2 (3%) 2 (3%) 0 16 (21%) 16 (21%) 0 28 (36%) 24 (31%) 4 (5%) 12 (16%) 11 (14%) 1 (1%) 48 (62%) 38 (49%) 10 (13%) 10 20 30 Number of Respondents 40 50 60
Types of Resources
Type of Material in Digital Collection Total Other Other in combination Other Only Total Moving Image Moving Image in Moving Image Only
12 (13%) 12 (13%) 0 14 (16%) 13 (15%) 1 (1%) 16 (18%) 15 (17%) 1 (1%) 26 (29%) 24 (27%) 2 (2%) 72 (81%) 69 (78%) 3 (3%) 79 (89%) 71 (80%) 8 (9%) 0 10 20 30 40 50 60 70 80 90
Type of Material
Total Interactive Resource Interactive Resource in Interactive Resource Only Total Sound Sound in combination Sound Only Total Text Text in combination Text Only Total Images Images in combination Images Only
Number of Respondents
Controlled Vocabularies
Element
Subject Format Type Personal names Geographic names
Top three used Controlled Vocabulary (% of respondents who identified a controlled vocabulary)
LCSH (50%); LC TGM I (19%); AAT (7%); AAT (13%) MIME types (4%) AACR2 (7%)
LC TGM II (7%);
DCMI Type (8%); LC TGM II (7%); LC Name Authority File (47%)
LC Name Authority File (18%); LCSH (15%); Getty Thesaurus of Geographic Names (10%)
t-cole3@uiuc.edu University of Illinois at UC
14
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
Descriptive Practice
Different traditions regarding
Inclusion of interpretive information Granularity of description Presentation of information resources
Shared problems / issues
How to provide context & collection description What exactly to describe Which metadata scheme(s) to use
t-cole3@uiuc.edu University of Illinois at UC
15
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
Illustration – Coverlets (1 of 2)
Description: Digital image of a single-sized cotton coverlet for a bed with embroidered butterfly design. Handmade by Anna F. Ginsberg Hayutin. Source: Materials: cotton and embroidery floss. Dimensions: 71 in. x 86 in. Markings: top right hand corner has 1 1/2 in. x 1/2 in. label cut outs at upper left and right hand
side for head board; fabric is woven in a variation of a rib weave; color each of yellow and gray; hand-embroidered cotton butterflies and flowers from two shades of each color of embroidery floss - blue, pink, green and purple and single top 20 in. bordered with blue and black cotton embroidery thread; stitches used for embroidery: running stitch, chain stitch, French knot and back stitches; selvage edges left unfinished; lower edges turned under and finished with large gray running stitches made with embroidery floss. 21-53K bytes. Available via the World Wide Web.
Format: Epson Expression 836 XL Scanner with Adobe Photoshop version 5.5; 300 dpi;
Coverage: — Date Created: 2001-09-19 09:45:18; Updated: 20011107162451; Created: 2001-04-05;
Created: 1912-1920?
Type: Image
t-cole3@uiuc.edu University of Illinois at UC
16
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
Illustration – Coverlets (2 of 2)
Description: Materials: Textile--Multi, Pigment—Dye; Manufacturing Process: Weaving-Hand, Spinning, Dyeing, Hand-loomed blue wool and white linen coverlet, worked in
overshot weave in plain geometric variant of a checkerboard pattern. Coverlet is constructed from finely spun, indigo-dyed wool and undyed linen, woven with considerable skill. Although the pattern is simpler, the overall craftsmanship is higher than 1934.01.0094A. - D. Schrishuhn, 11/19/99 This coverlet is an example of early "overshot" weaving construction, probably dating to the 1820's and is not attributable to any particular weaver. -- Georgette Meredith, 10/9/1973
Source: —
Format: 228 x 169 x 1.2 cm (1,629 g) Coverage: Euro-American; America, North; United States; Indiana? Illinois? Date: Early 19th c. CE Type: cultural; physical object; original
t-cole3@uiuc.edu University of Illinois at UC
17
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
OAI Protocol for Metadata Harvesting
‗Harvesting‘ approach to interoperability at metadata level
Divides world into Metadata Providers & Service Providers Builds on HTTP, XML, & Community Metadata Standards
t-cole3@uiuc.edu University of Illinois at UC
18
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
Metadata Harvesting Model
How OAI-PMH Works
OAI “VERBS”
Identify
ListMetadataFormats ListSets
Service Provider Metadata Provider
ListIdentifiers ListRecords GetRecord
R H HTTP Request E A P (OAI Verb) R O V E OAI OAI S I S T T HTTP Response O E R (Valid XML) R Y
t-cole3@uiuc.edu University of Illinois at UC
20
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
Why OAI-PMH for IMLS DCC Project
Offers low technical barrier options; primary cost is metadata
e.g., OAI-PMH itself, OAI Static Repository, mod_oai
Is a cross-domain, non-proprietary approach to interoperability
Already used by NSDL, OAIster, etc. Seen as a way to bring content to attention of wider audience 37% of visits to State Library of New South Wales image collection via PictureAustralia (a OAI-PMH based portal)
Facilitates metadata & metadata services research
What makes for good ‗shareable‘ metadata? Contrast & compare metadata designs & workflows Explore normalization, enhancement, aggregated searching issues
21
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
t-cole3@uiuc.edu University of Illinois at UC
OAI-PMH Issues
Harvesting vs. federated
Harvested metadata aggregation always out of date, but Federated real-time performance dependent on weakest link Sorting, ranking, & de-dupping easier with harvesting model Largest OAI-PMH provider serves 4 million records Largest OAI-PMH service provider < 10 million records
Potential scale issues
Integration into existing metadata workflow requires some investment – cost-to-benefit ratio still unclear Practical metadata sharing issues:
Persistent identifiers, date stamps, proper application of protocol Metadata quality, consistency, context, cross-walking, ...
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
t-cole3@uiuc.edu University of Illinois at UC
22
Federated Searching Model
Alternative Approaches for Interoperability
Federated search models Library: NISO Z39.50 Specimen / Natural History: DiGIR More homogeneous metadata schemes, query rules Collaborative, sometimes proprietary project portals RLG Cultural Materials ArtStor GBIF, MaNIS, ... Generally higher technical threshold; rely on higher level of metadata homogeneity & compliance
t-cole3@uiuc.edu University of Illinois at UC
24
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
OAI-PMH as Complement to Other Approaches
OAI-PMH provides a lowest-common-denominator approach to sharing & interoperability
Insufficient for some high-level, domain-specific applications, But useful for sharing across more heterogeneous communities & allowing participation with less technology
Portals can exploit combination of approaches
OAI-PMH metadata harvesters can normalize & augment metadata before sharing on with domain-specific federated search portals
25
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
t-cole3@uiuc.edu University of Illinois at UC
IMLS DCC Collection Registry (alpha)
Features:
Searchable
Browseable An entry point for item-level searching
t-cole3@uiuc.edu University of Illinois at UC
26
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
IMLS DCC Metadata Repository (alpha)
Currently Harvesting: 27 Collections 193,677 Records Ongoing analysis of metadata Documenting practices Potential for normalization Implications for interface & search engine design
t-cole3@uiuc.edu University of Illinois at UC
27
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004
More Information
This presentation:
http://imlsdcc.grainger.uiuc.edu/Cole_MCN2004_OAI.ppt Project Website: http://imlsdcc.grainger.uiuc.edu/ Project PI: Tim Cole, t-cole3@uiuc.edu Project Coordinator: Sarah Shreeves, sshreeve@uiuc.edu
OAI-PMH resources:
http://www.openarchives.org/ Online OAI-PMH tutorial: http://www.oaforum.org/tutorial/ DLF OAI-PMH & shareable metadata best practices (under development): http://oai-best.comm.nsdl.org/
t-cole3@uiuc.edu University of Illinois at UC
28
OAI-PMH & The IMLS DCC Project MCN 2004, 12 November 2004