Digitization and Preservation
Digitization Selection of materials Digitization methods File formats (text, image, sound and video) Preservation issues Methods of preservation Examples of preservation
Digital library materials:
born materials Digitized materials
Many libraries have a history of digitizing materials such as newspapers, a process called „document management‟
Rationale for digitization
To preserve analog collections, and to extend the reach of those collections.
Most individual projects and full-scale programs serve a mix of both purposes.
What is digitization?
IFLA definition: it is the conversion of any fixed or analogue media--such as books, journal articles, photos, paintings, microforms--into electronic form through scanning, sampling, or in fact even rekeying.
Core questions underlying digitization
Is the content original and of substantial intellectual quality? Is it useful in the short and/or long term for research and instruction? Does it match library collecting interests? Is the cost in line with the anticipated value? Does it advance the development of a meaningful organic collection?
Important factors in selecting materials for digitization
Copyright The intellectual nature of the source materials Current and potential users Actual and anticipated nature of use The format and nature of the digital product Describing, delivering, and retaining the digital product Costs and benefits
Keyboarding : very time-consuming and expensive Scanners Digital camera
Scanning can be bitonal, greyscale, or color The most familiar scanners are flatbed ones (8 x 14 inch paper) For text 300 dpi is the minimum quality resolution, some argue for 600 dpi For images 1000 x 1000 is good enough
Pixel Per Inch (PPI)
Dots Per Inch (DPI
(L) Minolta bookscanner; (R) digital camera on stand
Seitz 6x17 Digital shoots at 160 megapixels
Mekel M525 Roll Film Scanner
DRS Microfiche Scanner
Nikon Super Coolscan 9000 ED Film Scanner
Santa Monica Public Library Virtual Reality Tour
Video clips on digitization
Book scanner (semi-automatic)
Kirtas book scanner
Reducing the size file Compression types
higher compression ratio but lower quality – JPEG or JPG Lossless: TIFF: (Tagged Image File Format) or GIF. Lossy and lossless compression options in JPEG2
Example from the American Memory Digital Library
Text and image file formats
Text files: Postscript, PDF, HTML Image formats:
BMP: large file size; not suitable for storage GIF: for images with a few distinct colours JPEG: photographs, smaller file size than GIF TIFF: large photographic images
The most common are: GIF and JPEG
Sound and music file formats
WAV audio file format: popular with PC, but large data files MIDI: transfer musical information between electronic instruments and computers; small file size MP3 & MP4: high quality audio and file compression; small files
Movie file formats
MPEG: ISO standard; extension „mp.‟ QuickTime: standard for digital media; extensions: qt. and mov. AVI: video format for MS Windows played by Windows media player RealMedia: extensions: rm. and ram.
Example from the American Memory Digital Library
Post-processing of digitized text files
Optical Character Recognition (OCR): a character by character recognition of the text It can be automatic but is not entirely reliable Manual cleaning up is necessary You need an image resolution of at least 300 dpi Human intervention is useful both before and after the actual recognition
The OCR process
Identification of text and image blocks in the image Character recognition Word identification/recognition Correction Formatting output
“On January 20, 2001, when George Bush took over the presidency, he also took possession of the White House web site, www.whitehouse.gov. All of the previous content of that site, and its companion searchable document archive, www.pub.whitehouse.gov, were completely wiped clean, replaced with a skeleton site for the new administration. The result was a massive example of 'link rot' in one of the most popular sites on the web. AltaVista reported 170,000 links to the site-many of them 'deep links' (i.e., deep within the hierarchy of a web site)--that were suddenly broken. It is impossible to know how many thousands or millions of personal bookmarks were similarly trashed”.
Digital Preservation: Paradox & Promise Richard Wiggins (Library Journal) — April 15, 2001 24
Why preservation matters?
Digital information may be lost for many reasons:
in an organization Content reorganization Cessation of sponsorship Technology obsolescence Content format obsolescence Hacking and sabotage Disaster, whether natural or man-made
Preservation is the aspect of archival management that preserves the content as well as the look and feel of the digital object (Hodge, 2000) Digital preservation is defined as the managed activities necessary:
1) For the long term maintenance of a byte stream (including metadata) sufficient to reproduce a suitable facsimile of the original document and 2) For the continued accessibility of the document contents through time and changing technology (Jantz & Jiarlo, D-Lib 2005).
Technological Issues Organizational Issues Legal Issues
Digital media: it is fragile and subject to new ways of destruction Changes in technology Authenticity Scale: scaleable architectures and procedures to handle huge quantities of data Platform preservation strategies
Costs: consider both the quantity and level of access Expertise: develop preservation competencies and skills Organizational structures Selection: challenge of selecting quality information to be preserved by the institution
Intellectual property rights: information as well as software Complex nature of electronic materials copyright Access and security Business models and licensing Privacy and confidentiality
Methods of preservation
Migration: the periodic transfer of digital materials from one hardware/software configuration to another, or from one generation of computer technology to the next Emulation: development of software that performs the functions of obsolete hardware and other software Refreshing: copying digital files from one storage medium to another storage medium of the same type
Practical examples of preservation
Some digital archives have decided to automatically convert all PDFs to page-by-page TIFF images The OCLC research finding: TIFF would be a preferred choice Virtual Remote Control at Cornell University CyberCemetery Internet Archive
JSTOR (Journal Storage) begun in 1995
an online system for archiving academic journals It provides full-text searches of digitized (scanned)
back issues of several hundred well known journals, some going as far back as over 100 years
The image files are maintained in a TIFF G4 format The image files are backed up on CD-ROM and on tape
LOCKSS (Lots of Copies Keep Stuff Safe)
Stanford University (started in 1999) 80 libraries and 50 publishers from around the world use the software Is open source software that provides librarians with an easy and inexpensive way to collect, store, preserve, and provide access to their own, local copy of authorized content they purchase Uses Open Archival Information Systems (OAIS) ISO standard
The Canadian experience
The Canadian National Site Licensing Project
64 universities in Canada Funded as a 3-year pilot project totaling($50M) Awarded by the Canada Foundation for Innovation
Content from more than 2200 scholarly publications and research databases has been made available online to over 650,000 researchers and students at Canadian universities
Digital preservation policies
Digitization: standards, preservation, management
Digital preservation for museums: Recommendations
http://www.chin.gc.ca/English/Digital_Content/Preservation_Recom mendations/index.html CyberCemetery: http://govinfo.library.unt.edu/ Internet Archive: http://www.archive.org/index.php