Digitization and preservation -

Document Sample
Digitization and preservation - Powered By Docstoc
					Digitization and Preservation

Outline
Digitization  Selection of materials  Digitization methods  File formats (text, image, sound and video)  Preservation issues  Methods of preservation  Examples of preservation

2

Digitization


Digital library materials:
 Digitally

born materials  Digitized materials


Many libraries have a history of digitizing materials such as newspapers, a process called „document management‟

3

Rationale for digitization
To preserve analog collections,  and to extend the reach of those collections.



Most individual projects and full-scale programs serve a mix of both purposes.

4

What is digitization?


IFLA definition: it is the conversion of any fixed or analogue media--such as books, journal articles, photos, paintings, microforms--into electronic form through scanning, sampling, or in fact even rekeying.

5

Core questions underlying digitization
 


 

Is the content original and of substantial intellectual quality? Is it useful in the short and/or long term for research and instruction? Does it match library collecting interests? Is the cost in line with the anticipated value? Does it advance the development of a meaningful organic collection?
6

Important factors in selecting materials for digitization
   


 

Copyright The intellectual nature of the source materials Current and potential users Actual and anticipated nature of use The format and nature of the digital product Describing, delivering, and retaining the digital product Costs and benefits
7

Digitization
Keyboarding : very time-consuming and expensive  Scanners  Digital camera


8

Scanning
 





Scanning can be bitonal, greyscale, or color The most familiar scanners are flatbed ones (8 x 14 inch paper) For text 300 dpi is the minimum quality resolution, some argue for 600 dpi For images 1000 x 1000 is good enough

Pixel Per Inch (PPI)

Dots Per Inch (DPI

9

(L) Minolta bookscanner; (R) digital camera on stand

10

Kirtas Bookscan
11

Canon 12.8mp

Canon 10mp

Seitz 6x17 Digital shoots at 160 megapixels

Digital cameras

12

Mekel M525 Roll Film Scanner

DRS Microfiche Scanner

13

Nikon Super Coolscan 9000 ED Film Scanner

14

3-D scanners
Santa Monica Public Library Virtual Reality Tour
15

Video clips on digitization


Book scanner (semi-automatic)
Kirtas book scanner



16

Image compression
Reducing the size file  Compression types

 Lossy:

higher compression ratio but lower quality – JPEG or JPG  Lossless: TIFF: (Tagged Image File Format) or GIF.  Lossy and lossless compression options in JPEG2
Example from the American Memory Digital Library
17

Text and image file formats
Text files: Postscript, PDF, HTML Image formats:
BMP: large file size; not suitable for storage GIF: for images with a few distinct colours JPEG: photographs, smaller file size than GIF TIFF: large photographic images

The most common are: GIF and JPEG
18

Sound and music file formats
WAV audio file format: popular with PC, but large data files  MIDI: transfer musical information between electronic instruments and computers; small file size  MP3 & MP4: high quality audio and file compression; small files

19

Movie file formats
MPEG: ISO standard; extension „mp.‟  QuickTime: standard for digital media; extensions: qt. and mov.  AVI: video format for MS Windows played by Windows media player  RealMedia: extensions: rm. and ram.

Example from the American Memory Digital Library
20

Post-processing of digitized text files
    

Optical Character Recognition (OCR): a character by character recognition of the text It can be automatic but is not entirely reliable Manual cleaning up is necessary You need an image resolution of at least 300 dpi Human intervention is useful both before and after the actual recognition

21

The OCR process
Identification of text and image blocks in the image  Character recognition  Word identification/recognition  Correction  Formatting output


22

Preservation

23

Preservation
“On January 20, 2001, when George Bush took over the presidency, he also took possession of the White House web site, www.whitehouse.gov. All of the previous content of that site, and its companion searchable document archive, www.pub.whitehouse.gov, were completely wiped clean, replaced with a skeleton site for the new administration. The result was a massive example of 'link rot' in one of the most popular sites on the web. AltaVista reported 170,000 links to the site-many of them 'deep links' (i.e., deep within the hierarchy of a web site)--that were suddenly broken. It is impossible to know how many thousands or millions of personal bookmarks were similarly trashed”.
Digital Preservation: Paradox & Promise Richard Wiggins (Library Journal) — April 15, 2001 24

Why preservation matters?
Digital information may be lost for many reasons:
 Changes

in an organization  Content reorganization  Cessation of sponsorship  Technology obsolescence  Content format obsolescence  Hacking and sabotage  Disaster, whether natural or man-made

25

Definition




Preservation is the aspect of archival management that preserves the content as well as the look and feel of the digital object (Hodge, 2000) Digital preservation is defined as the managed activities necessary:




1) For the long term maintenance of a byte stream (including metadata) sufficient to reproduce a suitable facsimile of the original document and 2) For the continued accessibility of the document contents through time and changing technology (Jantz & Jiarlo, D-Lib 2005).

26

Preservation issues
Technological Issues  Organizational Issues  Legal Issues


27

Technological issues
Digital media: it is fragile and subject to new ways of destruction  Changes in technology  Authenticity  Scale: scaleable architectures and procedures to handle huge quantities of data  Platform preservation strategies

28

Organizational Issues
Costs: consider both the quantity and level of access  Expertise: develop preservation competencies and skills  Organizational structures  Selection: challenge of selecting quality information to be preserved by the institution

29

Legal Issues
Intellectual property rights: information as well as software  Complex nature of electronic materials copyright  Access and security  Business models and licensing  Privacy and confidentiality

30

Methods of preservation






Migration: the periodic transfer of digital materials from one hardware/software configuration to another, or from one generation of computer technology to the next Emulation: development of software that performs the functions of obsolete hardware and other software Refreshing: copying digital files from one storage medium to another storage medium of the same type
31

Practical examples of preservation


  



Some digital archives have decided to automatically convert all PDFs to page-by-page TIFF images The OCLC research finding: TIFF would be a preferred choice Virtual Remote Control at Cornell University CyberCemetery Internet Archive
32

JSTOR


JSTOR (Journal Storage) begun in 1995
 an online system for archiving academic journals  It provides full-text searches of digitized (scanned)

back issues of several hundred well known journals, some going as far back as over 100 years



The image files are maintained in a TIFF G4 format The image files are backed up on CD-ROM and on tape

33

LOCKSS (Lots of Copies Keep Stuff Safe)
  



Stanford University (started in 1999) 80 libraries and 50 publishers from around the world use the software Is open source software that provides librarians with an easy and inexpensive way to collect, store, preserve, and provide access to their own, local copy of authorized content they purchase Uses Open Archival Information Systems (OAIS) ISO standard
34

The Canadian experience


The Canadian National Site Licensing Project
 64 universities in Canada  Funded as a 3-year pilot project totaling($50M)  Awarded by the Canada Foundation for Innovation



Content from more than 2200 scholarly publications and research databases has been made available online to over 650,000 researchers and students at Canadian universities

35

Resources



Digital preservation policies
http://www.nla.gov.au/padi/topics/172.html




Digitization: standards, preservation, management
http://www.collectionscanada.ca/cidl/040021-801-e.html


  

Digital preservation for museums: Recommendations
http://www.chin.gc.ca/English/Digital_Content/Preservation_Recom mendations/index.html CyberCemetery: http://govinfo.library.unt.edu/ Internet Archive: http://www.archive.org/index.php

36

37