Fedora
Selecting and Implementing an Open Source Software Digital Repository
Jon Dunn Digital Library Program Indiana University
RLG Members’ Forum, December 12, 2003
Outline
What is a repository and why do we need it? Background on IU environment Background on Fedora Fedora Digital Object Model The Fedora Architecture Fedora use at IU: EVIADA Future Fedora use
Why a repository?
Isn’t what we have good enough?
Web servers, delivery systems File servers Databases Hierarchical storage systems
Why do libraries need repositories?
A digital object is more than just a file!
Example: Electronic Book
Metadata Delivery page image files (JPEG)
Hi-res page image files (TIFF)
Text file (TEI/XML)
A digital object is more than just a file!
Example: Archival Collection
EAD Finding Aid
DL Objects
Digital library “objects” have many parts
Metadata
Descriptive, administrative, structural, preservation, …
Preservation/archival files (several) Delivery files (several) Now: Good practice in file naming, directory organization, project documentation -not scalable! Future: Digital object repository
How do we keep them connected and organized?
Repository Purposes
Access
Web access to digital files and metadata Services/applications for searching, browsing, transformation, etc.
Preservation
Secure storage for digital files and metadata Services for integrity checking, migration, conversion, etc.
Data Persistence
Key is migration Keeping the bits alive
Physical media Logical media format
Keeping the bits understandable
File format Metadata
Small “pockets” of digital content pose a problem for migration
DL Object Repository
Preservation version in MDSS
Users and Applications: Access and Management
Repository System
Delivery version(s) on web server
Metadata records
Motivation for a Digital Repository at Indiana University
Many pockets of digital content and metadata Difficult to sustain
Variable tech support, replacement funding Harder to preserve, migrate data forward to new software and hardware Harder to budget for Cross-collection search Standard interfaces for viewing and playing content Interfaces to course management and other IT services OAI data providers Preservation services (integrity checks, etc.)
Difficult to build common services and applications
Not a New Model…
Digital Repository
Common system for storing, managing, and providing access to digital content and metadata
Integrated Library System
Common system for storing, managing, and providing access to MARC records
“Digital Repository” vs. “Institutional Repository”
Digital repository
Common storage for digital content and metadata Basic infrastructure component: “plumbing” Often implies focus on one application: institutional content, research output e.g. MIT DSpace:
Institutional repository
“capture, store, index, preserve, and redistribute the intellectual output of a university’s research faculty in digital formats”
Background: IU Digital Library Program
Mission:
“…dedicated to the production, maintenance, distribution, and preservation of a wide range of high quality networked information resources for scholars and students at Indiana University and elsewhere”
IU Digital Library Program
Established in 1997 Collaborative venture:
University Libraries (IUL) University Information Technology Services (UITS) School of Library and Information Science (SLIS) School of Informatics
Funding provided by Libraries and UITS University-wide responsibility: 8 campuses Responsibility beyond just the Libraries
IU Digital Library Program: Areas of Responsibility
Digital conversion Metadata Usability / UI design Infrastructure Software development DL research Both direct involvement and consulting roles
IU Digital Library Program: Staff
12.5 full-time equivalent (FTE) permanent staff
3 librarians 9 professional staff: IT, digital conversion, UI/usability 1 support staff (.5 FTE)
10 grant-funded IT staff Student staff, including graduate assistants and interns from the School of Library and Information Science and Computer Science
Object Types at IU
Books Manuscripts Photographs Art images Music audio Video Sheet music Musical score images Music notation files …and more
Questions In Repository Planning at IU
Scope
Just library? Museums and archives? All campuses? Other digital content
Instructional (e.g. faculty materials in OnCourse) Business (PR, Athletics, etc.)
Funding model Standards
Minimum requirements for content formats and metadata
Tools/services/applications
What else is needed to make a repository useful/usable for preservation and access?
Repository Evaluation Criteria
Flexibility
Not a rigid data model Support for many media types, complex digital objects Not locked into one technology platform (OS, database)
Use of modern technologies Easy integration with other systems/tools Means of extension/modification Support for DL standards, particularly metadata
Extensibility
Sustainability Supportability Cost
Fedora
•
FEDORA
• • • • •
•
Flexible Extensible Digital Object and Repository Architecture
Fedora - Background
Began as CS research project at Cornell – 1997-98
Architecture Reference implementation
Trying to create a DL architecture No commercial solutions found
UVa Libraries became interested – 2000
Mellon-funded project – 2001-2003
Joint UVa/Cornell project Update technologies Make use of relational database Make more production-ready IU member of “deployment group” engaged in testing
Fedora - Technical Environment
Open Source software Written in Java OS Platforms:
Windows Linux / Unix Mac OS X (not yet officially supported) MySQL McKoi Oracle8i , Oracle9i
Database support:
What does Fedora do?
Manages files or references to files that make up digital objects Manages associations between objects and interfaces Invokes behaviors of objects Basic DL “plumbing”
What does Fedora not do?
Searching/browsing of metadata and content End-user UI for display/navigation of metadata and content Cataloging tools Preservation services … Fedora is DL “plumbing”… Not an out-of-thebox complete DL system
Fedora 1.2 Software Feature Set
Open Fedora APIs
Repository as web services
Flexible Digital Object Model
Content View: objects as bundle of items (content and metadata) Service View: objects as a set of service methods (“behaviors”) Extensible functionality by associating services with objects
Repository System
Core Services: Management, Access/Search, OAI-PMH Storage: XML object store; relational db object cache; relational db object registry Mediation - auto-dispatching to distributed web services for content transformation Auto-Indexing – system metadata and DC record of each object HTTP Basic Authentication and Access Control Built-in disseminator services: XSLT x-form, image manipulation, xml-to-PDF
Content Versioning
Automatic version control (saves version of content/metadata when modified) Enables date-time stamped API requests (see object as it looked at a point in time)
Clients
Fedora Administrator: GUI client to create/maintain objects Default Web browser interface: search; access objects via default disseminator Command line utilities (batch load, ingest, purge, others) Migration Utility – mass export/ingest
The Fedora Object Model
Persistent ID (PID) Disseminators System Metadata
PID – persistent unique identifier Datastreams – represent content or metadata System Metadata – manage and track the object in the system Disseminator(s) – a service for transforming or presenting the object
Datastreams
Behavior Definition Behavior Mechanism
Object Model Example: Image Objects
Two File Image Object
Data
Hi Resolution Version: tif Low Resolution Version: jpg
MrSID File Image Object
Data
MrSID File
Basic Image Interface: Behavior Definitions
getHighResolutionTIF getLowResolutionJPG
Implementations: Behavior Mechanisms
Two File Image Object
getHighResolutionTIF
returns high resolution TIF returns low resolution JPG
getLowResolutionJPG
MrSID Image Object
getHighResolutionTIF
processes the MrSID file to return a high resolution TIF file of the image
processes the MrSID file to return a low resolution JPG of the image
getLowResolutionJPG
FEDORA’s Interface Implementation
Behavior Definition Object
Persistent ID (PID)
System Metadata
Data Object
Persistent ID (PID)
Disseminators System Metadata Datastreams
Datastreams
Behavior Definition Metadata
Behavior Mechanism Object
Persistent ID (PID) System Metadata Datastreams
Service Binding Metadata (WSDL)
Fedora Architecture
Client Application Web Browser Batch Program Server Application
HT T P SOAP HT T P SOAP HT T P SOAP HT T P
Manage
Acce ss
Se arch
OAI Provide r
Web Service Exposure Layer
Session Management User Authentication
Manage me nt Subsyste m
Object Mgmt Component Mgmt Object Validation
Se curity Subsyste m
Policy Mgmt Policy Enforcement
Acce ss Subsyste m
Object Reflection Object Dissemination
HTTP SOAP
Remote Service
Users/Groups
Local Service
PID Generation
Policies
External Content Source External Content Source
Storage Subsyste m
HTTP
Digital Objects
Datastreams
HT T P
XML Files
FT P
External Content Retriever
FT P
Content
Relational DB
Client and Web Service Interactions
user
user
user
Client application Server application
web browser
Client application
Fedora Service APIs
Fedora Repository System
Content Transform Service
External Service Dispatch
Content Transform Service
API
API
Current Fedora Use at IU: EVIADA
EVIADA
Ethnomusicological Video for Instruction and Analysis Digital Archive (!)
Goals
Digital archive of ethnomusicology field video Instructional tool
Partnership with University of Michigan Funding from Andrew W. Mellon Foundation
Current Fedora Use at IU: EVIADA
Complex objects Many versions of content Original analog video Digital Betacam tape Digital file master – 50 Mbps MPEG-2 Derivative files: MPEG-1, QuickTime, Real, ??? Many types of metadata Collection-level descriptive metadata Annotations: event, scene, action Technical, preservation, digital provenance Using METS+MODS+MARC
Current Fedora Use at IU: EVIADA
Fedora used to manage content and metadata
Streaming video files will be “redirected content”
Web application built with Java, Struts framework, Oracle9i XDB Web-based annotation tool
Creates METS structmap and MODS records
Future Fedora Software Releases
December 2003 – December 2004
Fedora Object XML (FOXML)
Internal storage format; direct expression of Fedora object model Better support for relationships (“kinship” metadata) Better support for audit trail (event history) Format identifiers for dynamic service binding
Shibboleth authentication Policy Enforcement
XACML expression language Fedora policy enforcement module
Web interface for easy content submission Batch object modification utility Administrative Reporting Object Event History (ABC/RDF disseminations) Better support for “collections” New ingest and export formats (METS1.3, DIDL)
Future Fedora Development Proposals
Digital Library in a Box
Full-featured DL application with “Fedora inside” Optimized for common set of content types
Fedora Power Server
Integrity Management Tools Service and link liveness checker Fault Tolerance Mirroring and Replication Peer-to-peer interoperability features Repository clustering Load balancing
Object Creation Tools
Workflow applications based on content models Web interface for document/content submission
Implementing Fedora at IU beyond EVIADA: Next Steps
Define scope Define content, metadata standards Import existing content into Fedora Initial focus on images? Define and implement applications
Example: Common image search service
Ongoing process
Who should use Fedora?
Now
Willingness to do programming, development Willingness to be on the “bleeding edge” Sufficient IT / DL staff Interested in cooperating with others to define best practices
Future: Lower barriers to entry
Thanks to:
Corey Keith, Library of Congress Sandy Payette, Cornell University
More information on Fedora:
www.fedora.info Jon Dunn, jwd@indiana.edu, 812-855-0953
My contact information: