; Dartmouth College
Learning Center
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Dartmouth College


  • pg 1
									           HATHI TRUST
           A Shared Digital Repository

Digital Preservation, HathiTrust,
 and the Reimagination of the
        Library Landscape
            Jeremy York
           August 5, 2010
• Digital Preservation in U.S.
• HathiTrust
   –   About HathiTrust
   –   Content
   –   What we do (services)
   –   Governance
   –   Partnership & Resources
• Google Settlement
• Publishing
• Changing Library Landscape
Books and Journals               Archives                      Data

Portico                          Internet Archive              ICPSR
• Centralized                    • Centralized                 • Centralized
• Journals                       • Web files                   • Social science data
• Source files, mainly focused
on XML, highly controlled
LOCKSS                           MetaArchive (NDIIPP)          DATA-PASS (NDIIPP)
• Distributed                    • Distributed                 • Distributed
• Journals                       • Private LOCKSS Network      • Social science data
• Web files, not source images   • Web files
or XML
HathiTrust                       International Internet        GeoMAPP (NDIIPP)
• Centralized                    Preservation Consortium       • Distributed
• Books and Journals             • Distributed                 • Geospatial data
• Master image and OCR files     • Harvesting tools, Access,   • State governments
                                 Preservation strategies

OCLC – Digital Archive
• Centralized
• Master files, web archiving              LOCKSS, DuraCloud, DSpace, Fedora
• CONTENTdm, custom
Mission: Develop a national strategy to collect, preserve and make available
           significant digital content, especially information that is created in
           digital form only, for current and future generations.
• Since 2000
• Broad collaborations with institutions and organizations (e.g., OCLC, Portico)
• Funding (Establishing a network, Preserving Creative America, Preserving State
Government Information)
• Standards/Best Practices
• Tools
      o JHOVE2 (validation)
      o Chronopolis (data grid framework)
      o Dataverse (management, dissemination, exchange, and citation of virtual
      collections (dataverses) of quantitative data)
      o BagIt (transfer utilities - creation, manipulation and validation of bags)
      o Hub and Spoke (repository interoperability)
      o FITS (bundle of identification, validation and metadata extraction tools)
         HathiTrust Digital Library
• Digital Repository
  – Initial focus on digitized book and journal content
  – “Light” archive
• Collections and Collaboration
  –   Comprehensive collection
  –   Shared strategies
  –   Local services
  –   Public Good
                 Current Partners
–   Columbia University
–   New York Public Library
–   University of California system
–   CIC (Committee on Institutional Cooperation)
    University of Chicago       University of Minnesota
    University of Illinois      Northwestern University
    Indiana University          Ohio State University
    University of Iowa          Pennsylvania State University
    University of Michigan      Purdue University
    Michigan State University   University of Wisconsin-Madison
– University of Virginia
– Yale University
Content Distribution

             6,383,209 – Total
             1,234,088 – Public Domain
                * As of August 5, 2010
Language Distribution (1)

                  * As of July 25, 2010
Language Distribution (2)
                                 The next 40
                                 languages make
                                 up ~13% of total

                  * As of July 25, 2010

        * As of July 25, 2010
Originating Institution

                 * As of July 25, 2010
Content over time

              * As of July 25, 2010
Content Growth
What we do
                 Services (1)
• Ingest
  – Google, Internet Archive
  – Working toward sustainable model for ingest of
    content from diverse sources
• Long-term preservation
  – Bit-level, migration
  – Standard and open formats (ITU G4 TIFF,
    JPEG2000, JPG, Unicode)
  – Validation, integrity, redundancy
                      Services (2)
• Preservation…with Access
• Brings concerns of research libraries to bear on the
  way the scholarly record is cared for and made
   –   Scholarly Resource
   –   Bibliographic Search
   –   Full-text search
   –   Collections
   –   Full-PDF download of public domain
                  Services (4)
• Rights Management
  – Rights Database
  – Copyright review
     • US 1923-1963
     • 188k candidates, 85k reviewed
     • 60% in public domain
• Data Distribution
  – Metadata files, Bib API, Data API
• Print on Demand
                Services (5)
•   Community Development Environment
•   Non-Google Ingest
•   Non-Book/Non-Journal Ingest
•   Computational Research
• Leverage partner resources and input to
  create and maintain the library of the future
• This is our library
• The more we use it, the better it will become
                     Strategic   Guidance on
Budget/Finances      Advisory    Policy,
Decision-making       Board      Planning


Partnership &
• Funded for a initial 5 years with
  base-funding from partners
• 3-year review of governance and sustainability
• Budget – separately held within
  UMich budget system
• Cost Models
   – Per GB cost of storage per year with a one-time fee on new
     content to build a capital fund
   – Volume overlap
                    Cost Model 1
Reasonable costs of sustaining the archive, includes cost of
  replacement, capital fund
               Cost Model 1
• Economies of scale keep costs low
  – $0.145/volume/year for Google-digitized
  – about $0.45/volume/year for IA-digitized
• Advantages not fully known until you jump in
                Cost Model 2
• Shared space to deal with shared problems
  – Use HathiTrust as part of broader library strategies
• Beginning to see benefits of aggregating this
  body of materials together
  – Overlap, collection development
  – Coordinated print management
  – Begin to ask “What is missing”?
                     Cost Model 2
                For public domain volumes:

                For a given in-copyright volume:

•   Share in costs of curation
•   Share in uses of relevant materials
•   Voice in future directions
•   Free riders?
• Staff/Expertise – highly integrated
   – Project managers, IT and communications
     staff, copyright experts, administrators (UM,
     Indiana and UC taking the lead)
• Working groups
• Shared development space
e-Commerce           Content Ingest       Content Access                             User Services     Outreach             Legal

                         Transformation        PageTurner           Quality Review                                        Risk management
  Print on Demand                                                                         Usability    Project website    (use of materials)

                           Validation       Collection Builder         Content          User support                          Partner
                                                                     Certification                        Monthly           agreements

                                            Large-scale Search                                                                Advocacy
                                                                                                         Papers and
     Financial                                                                                          presentations

     of partners
                                             Research Center     HathiTrust Functional                 Communication
                                                                                                        with potential
                                                  Catalog            Framework                             partners

                                                                                                       Surveys, general
                                                   APIs                                                    inquiries

                                                                                                        evaluation and
                                                                                                          audit (e.g.,
             Working Groups
• Quality
• Discovery Interface (with OCLC)
• Collections
• Communication
• Usability
• Storage
• Research Center
            Google Settlement (1)
•   2005, Author’s Guild, AAP sued
•   Google claimed fair use
•   Settlement – 2008
•   Amended – Nov 2009
•   Works covered
    – registered with U.S. copyright office, Canada, UK,
• Works not covered
    – public domain, published after 5 Jan 2009
             Google Settlement (2)
• Google continues scanning
• In copyright, non-commercially available out-of-print work
   – Sell individual access, any book retailer - 63% of revenue to rights
     holders, distributed by BRR
   – display up to 20%
   – Copy & paste and printing
   – Rights holders can open access, distribute under CC, set printing limits
   – Institutional subscription (available to libraries, fee based on FTE
• Includes unclaimed works
   – BRR required to search for rights holders and hold revenue on their
• Public access terminals
• Cash payments to Rightsholders whose works were scanned
  before May 5, 2009
                Book Rights Registry
• Book Rights Registry
    – Represent the interests of the Rightsholders – equal
      representation of Author and Publisher sub-classes on board;
      one author and publisher representative from US, UK, Canada,
      Australia; court-appointed representative for rights holders of
      unclaimed works
    – Establish and maintain a database of contact information for
      authors and publishers;
    – Use commercially reasonable efforts to locate Rightsholders;
    – Distribute payments received from Google for the
      Rightsholders’ share of revenues; and
    – Assist in the resolution of disputes between Rightsholders.
    – Funded by Google (initial 34.5 million, ongoing percentage of

        Settlement for HathiTrust
• Complementary
   – Settlement provides access to covered works,
     HathiTrust is preservation, trust for the future
   – Research Center (75% of Google Book Search scanned
     from HathiTrust partner libraries)
• Specifically sanctions
   – Section 108 uses, access for users with print
     disabilities, computational research
• Does not allow
   – Fair use, sale of access, interlibrary loan, e-reserves,
     use in course management systems
• Libraries would like to buy more eBooks
• Cost is high
• Not good models for consortia (multiple users)
• Move to on-demand purchase, leasing of
• Do we need to own it?
     Changing Library Landscape
• Leverage collective resources, expertise
  – Drive costs down
  – Increase discoverability, use
  – Improve strength of archiving
  – Reduce redundancy of collections (digital and
    print), effort
  – Address collective challenges
• Focus on local resources and services
• Redefine who we are, what we provide
  – Collections, research
Thank you!

To top