digital archival storage
for the University of Michigan
Project partnership with Google publicly
announced in December 2004.
Bound print collection, about 7 million
volumes, to be scanned over estimated
four to six years.
Direct scanning costs are borne by
UM receives a copy of all digital files,
including OCR and metadata, which we
may use to build services.
UM may share files with other research
libraries under formal agreements.
UM may not redistribute content en masse
to other commercial services or the public.
All uses are subject to copyright.
At about 320 pages per volume and 2.01
files per page, we’ll have 2.2 billion files.
At about 6000 pages per GB or 54.6 MB
per volume, we’ll have 380 TB of data.
Production at full volume can scan about
35K volumes (1867 GB) per week, which
averages to a sustained 3.16 MB per
second for four years.
Not too many libraries do this!
Characteristics of the Data
Extremely well-defined data conventions:
image files are TIFF or JPEG 2000, OCR
files and metadata are UTF-8 text.
A true archival system; indefinite retention
requires its own set of best practices.
Files are largely static.
Much material is in-copyright (security is
MBooks (web server farm/NAS)
Periodic fixity check (checksum validation)
Full-text search? (how?!)
Textual analysis or other research?
Anything beyond MBooks is likely to be
either compute- or IO-intensive, or both.
This is how you annoy storage vendors!
Engagement with Office of the Provost
from the beginning; a University project
housed in the Library
Our Library IT environment has unusual
depth due to our mature digital library.
Consulting relationship with academic
computing and campus storage experts
RFI provided vendor landscape
RFP (very few Yes/No questions!)
Cost Model from RFI Responses
Model includes various ramp-up patterns,
hardware replacement periods, starting cost,
and rate of cost decrease.
Cost per GB from selected RFI responses:
average = median = $7
Too fast means initial investment is huge, no
benefit from Moore’s Law.
Too slow means simultaneous growth and
replacement, costs peak at replacement interval.
Four years is plenty fast, thank you!
Potential Funding Sources
Development of CIC shared digital
repository: multiple redundant sites and
some staff funded by pay-to-play model
Again, engagement with Office of the
Provost from the beginning
“Future-proof” higher-cost investment with
proven vendor and incremental upgrades?
“Throwaway” lower-cost solution with
cutting-edge vendor and forklift upgrade?
Temporary solution (Linux NAS server and
commodity SCSI/SATA arrays) has allowed
project to proceed and further inform us
on the decisions we’ll make.
Must have simultaneous access from potentially
many front-end servers (cluster), so almost
certainly a NAS component.
NAS? NAS gateway to SAN? NAS/SAN hybrid?
Probably most promising in the flexibility
department are the clustered NAS systems with
SAS or SATA back ends.
Keep our options open; the right vendor could
make all the difference.
Highlights of the RFP
Does not ask about compliance with exact
specifications, but asks for detailed explanations
of system architecture: all of the usual, and…
Recommended upgrade path given our
estimated growth pattern and project timeline
Description of how load balancing and service
are impacted as system is scaled and maintained
How virtualization is implemented
Contact me if you’d like to have a copy.
Proposal Evaluation Criteria
Scalability of capacity, performance, and
Proven models/methods for growth
Flexibility in application
RFP responses due (Monday!)
Space, support, backup
Work in CIC on governance and funding
model for shared digital repository
Continued development of MBooks
functionality and integration with existing
digital library resources