Docstoc

Digitizing of the Botroseyya Collection - Bibliotheca Alexandrina

Document Sample
Digitizing of the Botroseyya Collection - Bibliotheca Alexandrina Powered By Docstoc
					Digital Library Projects
           at
Bibliotheca Alexandrina




        Noha Adly
        16 January 2006
     Infrastructure and Connectivity
 Network
     – Fiber Optical backbone
         • The 11 floors of the library
         • The BA Conference Center (BACC)
         • The Science Museum and Planetarium
     – FTP used for horizontal cabling (2200+ outlets)
     – Gigabit Ethernet technology is deployed
     – Leased lines used to connect remote branches
         • CULTNAT
         • Shallalat
         • Swedish Institute (Anna-Lindh foundation)

 Internet Connectivity
     –   Bandwidth from 10 Mbps to 155 Mbps (STM1)
     –   Plans for wireless Internet access using Wi-Fi Hotspots
     –   Full Internet access through Internet Cafe
     –   BA Conference Center for journalists and press agents

Noha Adly 06                       Bibliotheca Alexandrina         2
               System Architecture


                             Firewall             DMZ
                                                          PCs
                                                Network
                                               Backbone




       External
                     Servers
Noha Adly 06         Bibliotheca Alexandrina                    3
               Infrastructure Overview
               Public PCs                                   Staff PCs
 Reading Tables             145              FAP                              158
 Study Rooms                 85              LIS                              155
 OPAC                        26              ICT                              230
 Print Servers               12              CUL                              233
 Young People                13              EXT                               51
 Children Library              7             Others                            71
 Taha Hussein Lib            10              Total                            898
 Internet Cafe                 7
                                                       74 servers
 Information Literacy Lab      9
                                                       2 firewalls
 Museums                     16
                                                       corporate Antivirus
 Total                      330
Noha Adly 06                Bibliotheca Alexandrina                                 4
               Server Room




Noha Adly 06     Bibliotheca Alexandrina   5
                                 Services
                                         Security
                         ….etc                                 VTLS

               Email                                                     ERP


       Intranet                                                             MMV


         Web
                                                                             Web
        casting


               Backup                                                  Streaming

                                                               Video
                        Anti-Virus
                                           Office              Conf.

Noha Adly 06                         Bibliotheca Alexandrina                       6
               Video Conferencing




Noha Adly 06        Bibliotheca Alexandrina   7
Access Control System – Staff




Noha Adly 06   Bibliotheca Alexandrina   8
               Ticketing Control System




Noha Adly 06           Bibliotheca Alexandrina   9
   ILS – Integrated Library System




Noha Adly 06   Bibliotheca Alexandrina   10
         Library Information System
   Web based
   Support Arabisation
   Trilingual interface (Arabic, French, English)
   Integrated with Multimedia system
   Available 24x7
   In-house development tools
     – Payment Card System
     – Automated Circulation Overdue Notices
     – Membership system
     – Cataloging Performance Tracking
     – Circulation Reports and Statistics
     – Customized Reports
     – etc …


Noha Adly 06                  Bibliotheca Alexandrina   11
               BA Website




Noha Adly 06     Bibliotheca Alexandrina   12
         Statistics – www.bibalex.org




Noha Adly 06       Bibliotheca Alexandrina   13
               Statistics




Noha Adly 06   Bibliotheca Alexandrina   14
Noha Adly 06   Bibliotheca Alexandrina   15
               ISIS - Mission & Goals
 Mission
      – Initiate, carry-out and promote research and development
        of activities and projects related to building a universal
        knowledge center
      – Acting as an incubator for digital and technological
        projects, promoting and nurturing innovations in
        accordance with BA goals
 Goals
      – Preserving the heritage for future generations, and
      – Universal Access to Human Knowledge


Noha Adly 06               Bibliotheca Alexandrina              16
Noha Adly 06   Bibliotheca Alexandrina   17
                 Overview
 Internet Archive
 Million Book Project
 UDBE: Universal Digital Book Encoder
 DAR: Digital Asset Repository
 The Digital Modern History of Egypt
   – Gamal Abdel Nasser
   – Description de l’Egypte
 OACIS


Noha Adly 06       Bibliotheca Alexandrina   18
               Internet Archive



Noha Adly 06        Bibliotheca Alexandrina   19
                     Overview
    Web: 10 billion pages from 1996-2001
    Television: 2000 hours of Egyptian and US TV
    Movies: 1000 archival films
    100 Terabytes of data
    Storage on 200 computers

    The second copy world wide, after the original copy
                     in San Francisco


Noha Adly 06           Bibliotheca Alexandrina        20
Noha Adly 06   Bibliotheca Alexandrina   21
Noha Adly 06   Bibliotheca Alexandrina   22
Noha Adly 06   Bibliotheca Alexandrina   23
Noha Adly 06   Bibliotheca Alexandrina   24
Noha Adly 06   Bibliotheca Alexandrina   25
               Access Statistics




Noha Adly 06       Bibliotheca Alexandrina   26
 Second Generation Machines: Petabox
 Designed to store and process one
  petabyte (million gigabytes).
 Features:
   – Low power: 6 kW per rack, and 60 kW for
     the whole system
   – High density: 64TB Terabytes per 40U rack
   – Local computing to process the data – 800
     low-end PCs
   – Multi operating systems
   – Software to automate mirroring
   – Easy Maintenance: one system administrator
     per petabyte
   – Software to automate mirroring itself
   – Inexpensive design
   – Inexpensive storage

   Noha Adly 06               Bibliotheca Alexandrina   27
               Single Rack Configuration


                                                  Data Node (80)
                                                   1.2TB, 1GHz, 100Mbps


                                                  Admin Node (2)
                                                   1.2TB, 1GHz, 2 x 100Mbps

      43U                                         Switch (2)
                                                   48 x 100Mbps, 2 x 1Gbps

                                                  Router/Firewall (1)
                                                   2 x 3GHz, 2 GB,
                                                   4 x 1Gbps



                                                  All boxes 1U, except Router/Firewall 2U




Noha Adly 06            Bibliotheca Alexandrina                                     28
Noha Adly 06   Bibliotheca Alexandrina   29
                      Progress
 An agreement with the Internet Archive for building the
  Petabox has been signed
 Hard disks for 2 Petabytes have been purchased
 3700 hard disks to reach IA by February1st 2006
 IA will build the machines and load them with the data of
  the web collection of 2002, 2003, 2004 and 2005
 1300 hard disks will be delivered at BA to be assembled
  locally
 New machines for the 2006 collection will be designed and
  manufactured locally.


Noha Adly 06           Bibliotheca Alexandrina            30
               Million Book Project
               Million Book



Noha Adly 06        Bibliotheca Alexandrina   31
                               Goals
 Long-term: Capture all books in digital format;
 Short-term: Digitize 1 million books by 2007;
 Provide a test bed to support research areas, such as
      –   Scanning techniques;
      –   Optical character recognition;
      –   Intelligent indexing;
      –   Machine translation;
      –   Information retrieval.



Noha Adly 06                 Bibliotheca Alexandrina   32
                            Partners
 USA
      – Carnegie Mellon University
      – Internet Archive
 China
      –   Beijing University
      –   Chinese Academy of Science
      –   Fudan University
      –   Chinese Ministry of Education
      –   Nanjing University
      –   State Planning Commission of China
      –   Tsinghua University
      –   Zhejiang University
Noha Adly 06                 Bibliotheca Alexandrina   33
                        Partners
 India
  – Indian Institute of Science
  – International Institute of Information Technology, Hyderabad
  – Arulmigu Kalasalingam College of Engineering
  – Goa University
  – Indian Institute of Information Technology, Allahabad
  – Shanmugha Arts, Science, Technology & Research Academy
  – Tirumala Tirupati Devasthanams
  – Maharashtra Industrial Development Corporation
  – University of Pune
  – Anna University
 …….     Now increased to 22 centers
Noha Adly 06             Bibliotheca Alexandrina              34
Noha Adly 06   Bibliotheca Alexandrina   35
               Digital Lab Workflow




Noha Adly 06         Bibliotheca Alexandrina   36
Noha Adly 06   Bibliotheca Alexandrina   37
                   Image Processing
 Enhances the quality of the scanned images
      – Removes noise
      – Reduces file size
 Functions performed
      –   Despeckle – removes isolated black pixels
      –   Deskew – detects and removes skew
      –   Crop – removes the extra white spaces
      –   Curvature correction
      –   Removal of margins



Noha Adly 06                Bibliotheca Alexandrina   38
         Image Processing Procedure
                                                                     ACDSee
                                                                     Compress
                                   Photoshop
                                   Black edge                Photoshop
                                    Centering                  Resize


   OTIFF       ScanFix        PTIFF.X              ScanFix         PTIFF
                Noise                               Skew


               Recover                             Recover




Noha Adly 06             Bibliotheca Alexandrina                           39
                        OCR - Arabic
     Poses unique challenges
           – Written cursively, with blocks of connected characters
           – a ‘block of characters’ can have more than one base line.
           – Uses external objects such as dots, 'Hamza' and 'Madda'.
           – Diacritization
           – Characters can have more than one shape according to
             their position
           – Overlapping makes it difficult to determine the spacing
     Sakhr Automatic reader is used
     Tricky with old books
     Requires learning
Noha Adly 06                 Bibliotheca Alexandrina             40
Noha Adly 06   Bibliotheca Alexandrina   41
         Pre-OCR Text Enhancement
 Condition of Arabic printings varies
   – Old/new
   – Light/heavy
   – Solid/dot-matrix
 ScanFix’s smoothing and completion features improve
  recognition accuracy
 Separate from actual processing phase
   – Must be tested under OCR right away
   – OCR specialists have a better feel for “good text”




Noha Adly 06            Bibliotheca Alexandrina           42
                  Font Libraries
 Improvement of Arabic OCR results through
   – Tweaking of OCR engine settings
   – Learning
 Libraries for different fonts have been built to achieve
  higher recognition rates
 Databases of character glyphs that describe a particular type
  of script and improve OCR accuracy
 Built on a carefully selected and classified high-variety set
  of scanned images belonging to a batch of about 1000 books
  that boiled down to 15 font groups


Noha Adly 06             Bibliotheca Alexandrina            43
                 Font Classification
 Classification criteria:
   – Script type
        • TA: Traditional Arabic
        • AR: Arabic Transparent
        • DT: Deco type Naskh and Deco type Naskh extension
   – Printing quality: High (H), Medium (M), and Low (L)
   – Font size: 1 (largest) to 5 (smallest)
 “Group X” – virtual font to tag unclassifiable printings and handwriting
 Minimum accuracy number assigned to each group based on testing
  results




Noha Adly 06                 Bibliotheca Alexandrina                    44
                       Font Groups
               Font    Low Bound High Point % Books
               AR-H1        97.70%        99.50%     0.43%
               AR-H2        97.60%        99.50%     3.42%
               AR-H3        97.04%        99.10%     8.53%
               AR-H4           Under construction
               AR-L4        92.70%        96.70%     5.63%
               DT-M1           Under construction
               DT-L2        88.40%        96.80%      7.73%
               TA-H1        97.30%        99.10%      2.03%
               TA-H2        97.60%        99.20%     14.15%
               TA-H3           Under construction
               TA-H4        96.50%        97.74%      2.75%
               TA-L1        94.00%        97.70%      1.81%
               TA-L4        94.00%        97.90%      8.08%
               TA-M2        95.80%        98.80%     28.46%
               TA-M4        94.50%        97.50%     12.58%
               X                                      4.39%
Noha Adly 06               Bibliotheca Alexandrina            45
                            Progress
 Five scanning stations since October 2003
 As of January 1st 2006:
   – 22,214 books digitized & processed (6.7 million pages)
   – 15,550 books OCRed (4.6 million pages)
       • 11,101 Arabic books (3.3 million pages)
       • 4,449 Latin books (1.3 million pages)
 Daily Rates
      –   Scanning: ≈ 2000 pages/person
      –   Processing: ≈ 1800 pages/person
      –   Latin OCR: ≈ 4000 pages/person
      –   Arabic OCR: ≈ 1500 pages/person
 The target is to scan and process 5000 pages/day/scanner,
  leading to ≈ 25,000 books/year

Noha Adly 06                  Bibliotheca Alexandrina         46
Noha Adly 06   Bibliotheca Alexandrina   47
                     Publishing
 Challenges
   – Preservation of layout
   – Searchability of content and metadata
   – Efficient image compression
   – Accommodating low bandwidth user
   – Easy browsing of books
   – Multipaging
   – Multilingual text support




Noha Adly 06            Bibliotheca Alexandrina   48
                     Image-on-Text
 Multilayered:
      – Visible page image
      – Hidden OCR text


 View exact original layout while
  searching and highlighting

 Supported with some OCR suites
  only

 Supported format: DJVU and PDF


Noha Adly 06                 Bibliotheca Alexandrina   49
           UDBE – Universal Digital Book
                    Encoder
                  Conversion                                 Encoding

OCR Engine        OCR Converter                             Format Handler   Target Format
   (A)                (A)                                        (X)              (X)


OCR Engine        OCR Converter
                                          COF
   (B)                (B)


OCR Engine        OCR Converter                             Format Handler   Target Format
   (C)                (C)                                        (Y)              (Y)




    Built around a Common OCR Format (COF)

   Noha Adly 06                   Bibliotheca Alexandrina                           50
        Common OCR Format (COF)
 Captures necessary image-on-text document information

                                                                                 Image
 Inspired by DjVuXML                                                 Page
                                                                                           Page
  and DAFS                                                                       Text
                                                                                          Column
Document Attribute Format Specification       Document
                                                                      Map        Area     Region

                                                                    Preference
 XML-compliant –                                                   Metadata
                                                                                         Paragraph

  simple integration                                                                       Line

                                                                                           Word

                                                                                         Character




 Noha Adly 06                             Bibliotheca Alexandrina                         51
                         Implementation
 OCR Converter for Automatic Reader:
      – Supports 18 Latin languages, Arabic, and Persian
      – Features font learning capabilities
 Format Handlers:
   – DjVu:
               • MRC imaging model high-quality/low-file-size image
                 compression from AT&T Labs
               • Implemented around DjVu Libre and LizardTech’s Document
                 Express
      – PDF:
               • Widely-used PostScript-like Portable Document Format from
                 Adobe
               • Implemented in Java based on iText

Noha Adly 06                      Bibliotheca Alexandrina                    52
               UDBE Performance




Noha Adly 06        Bibliotheca Alexandrina   53
               UDBE Performance




Noha Adly 06        Bibliotheca Alexandrina   54
               UDBE Performance




Noha Adly 06        Bibliotheca Alexandrina   55
Noha Adly 06   Bibliotheca Alexandrina   57
Noha Adly 06   Bibliotheca Alexandrina   58
Noha Adly 06   Bibliotheca Alexandrina   59
Noha Adly 06   Bibliotheca Alexandrina   60
Noha Adly 06   Bibliotheca Alexandrina   61
Noha Adly 06   Bibliotheca Alexandrina   62
Noha Adly 06   Bibliotheca Alexandrina   63
Noha Adly 06   Bibliotheca Alexandrina   64
                   Progress
 A database for the books, metadata and status has
  been designed and implemented.
 The complete cycle of the workflow for producing
  digital books has been automated, and integrated
  with the ILS.
 This work has been extended to accommodate other
  types of materials including slides, maps, images,
  audio and video.




Noha Adly 06        Bibliotheca Alexandrina       65
                    DAR
           Digital Assets Repository



Noha Adly 06        Bibliotheca Alexandrina   66
                           Goals
 Automation of the digitization process

 Integrating the actual content and metadata of varieties of
  object types into one homogeneous repository

 Preservation and archiving of digital media produced by the
  Digital Lab or acquired by the Library in digital format

 Enhancing the interoperability and seamless access to the
  Library digital assets


Noha Adly 06             Bibliotheca Alexandrina                67
                        Standards
 Digital objects descriptive metadata
      – VRA Core Categories
      – MARC 21
 Metadata presentation
      – XML
      – MARC format
      – Dublin Core
 Content dissemination
      – OAI-PMH




Noha Adly 06                  Bibliotheca Alexandrina   68
                  System Architecture
                                                       User Interface


               Archiving   Administration    Digitization     Encoding   Cataloging   Publishing     OAI
                 Tool          Tool            Client           Tool        Tool       Interface   Gateway




                                                        DAF/DAK APIs



                                       Authentication and Authorization Subsystem

                                                Users/groups/permissions
                                                        Database



               Digital Assets Factory              Digital Assets Keeper               Integrated Library
                        (DAF)                               (DAK)                           System



                    Digitization                            Repository                      Catalog
                    Database                                Database                       Database




                                                    Storage Subsystem

                             Offline                                                    Online
                             Storage                                                    Storage




Noha Adly 06                                    Bibliotheca Alexandrina                                      69
                       Progress
 DAF has been fully deployed since March 2004 for books

 In January 2005, support for images and other material was
  introduced.

 The DAK first version was deployed in July 2005, with
  some parts still in the beta version.

 A publishing tool has been implemented with a special
  viewer for digitized assets, and a viewer for books using
  image-on-text technology.



Noha Adly 06             Bibliotheca Alexandrina              70
Noha Adly 06   Bibliotheca Alexandrina   71
Noha Adly 06   Bibliotheca Alexandrina   72
Noha Adly 06   Bibliotheca Alexandrina   73
Noha Adly 06   Bibliotheca Alexandrina   74
               The Digital Modern
                History Of Egypt


Noha Adly 06         Bibliotheca Alexandrina   75
               Gamal Abdel Nasser
                   Collection


Noha Adly 06         Bibliotheca Alexandrina   76
               Nasser – Objectives
 Digitize and publish the collection of the eminent Arab and
  Egyptian president Gamal Abdel Nasser

 Provide online access to his collection through a web based
  system mainly intended for research purposes and
  documentation




Noha Adly 06            Bibliotheca Alexandrina             77
               Nasser – Collection
 Documents published by the Public Records Office,
  London, UK (53,000+ pages)
 Documents published by the United State Department of
  State (30,000+ pages)
 Over 1,300 speeches, audio and printed
 Over 51,000 photos and 1,000 portraits
 More than 1,000 videos (50+ hours)
 A complete archive of the articles published in the
  newspapers
 The decrees issued by the Revolutionary Command Council
  (RCC)
 The daily news of the President

Noha Adly 06          Bibliotheca Alexandrina          78
               Nasser – Collection
 Minutes of the Central Committee for Arab Socialist Union
  (ASU)
 140+ handwritten documents with 593 papers
 A complete archive of the "Bisaraha" articles by Mohammed
  Hassanein Haikal
 Caricature, stamps, coins and plastic arts illustrations
 Books written by and about Nasser
 More than 1,200 national songs
 Over 130 Poems



Noha Adly 06           Bibliotheca Alexandrina          79
                                Nasser
 The entire collection has been digitized
 Database designed and populated with the
  digital objects and their metadata
 Backend applications
      –   Managing the contents
      –   Categorization
      –   Adding and refining descriptions
      –   Adding keywords
 Integration of all the different information
  sources and media under a single interface
 Front end
      – A web based interface
      – Full text Arabic and English search engine



Noha Adly 06                   Bibliotheca Alexandrina   80
               Nasser – Website




Noha Adly 06       Bibliotheca Alexandrina   81
        Description De L’Egypte



Noha Adly 06    Bibliotheca Alexandrina   82
                   Description de l’Egypte
 The work includes
   – 11 plates volumes (950+ pages)
   – 9 text volumes (7500+ pages)
   – Index book
 The volumes recorded
   – Antiquities
   – Modern state
   – Natural history

 They described cities,
  buildings, temples, monuments, arts,
  animals, plants, minerals, society, etc.




    Noha Adly 06                   Bibliotheca Alexandrina   83
                         Digitization
 The complete volumes of plates and text have been fully digitized.




   Noha Adly 06             Bibliotheca Alexandrina            84
               Processing




Noha Adly 06    Bibliotheca Alexandrina   85
                Virtual Browser
 The whole collection has been integrated on a virtual browser
   and made accessible to the public.




Noha Adly 06            Bibliotheca Alexandrina            86
Noha Adly 06   Bibliotheca Alexandrina   87
Noha Adly 06   Bibliotheca Alexandrina   88
                       First Release
 Provide the collection on DVD, in both English and
  French Languages, for the public and for researchers

 A relation established between text and images in a
  searchable form
 Published with two versions of pictures
   – Low resolution for quick browsing
   – High resolution for zooming with dynamic
     loading




 Noha Adly 06                Bibliotheca Alexandrina     89
  Digitizing of the Botroseyya
               Collection


Noha Adly 06     Bibliotheca Alexandrina   90
               Botroseyya – Overview
 This project aims at digitizing the documents pertaining to
  the Botros Ghaly family
 The family has saved a large number of documents related
  to its political role since the late 1800’s.
 The project will attempt to
   – digitize the entire multilingual (Arabic, English, French,
      German, Italian and Turkish) collection, and
   – provide it in searchable form for historians, politicians
      and researchers.



Noha Adly 06             Bibliotheca Alexandrina             91
               Digitization of
  Mohamed Mahmoud Pasha
                 Collection


Noha Adly 06       Bibliotheca Alexandrina   92
           Mohamed Mahmoud Pasha
            Collection – Overview
 This project aims at digitizing the documents pertaining to
  Mohamed Mahmoud Pasha, one of the most famous
  Egyptian Prime Ministers

 The project will attempt to
   – digitize the entire collection of rich and rare historical
     documents and materials never been published before
   – provide it in searchable form for historians, politicians
     and researchers.


Noha Adly 06              Bibliotheca Alexandrina                 93
               Al Hilal Digital
                 Collection


Noha Adly 06        Bibliotheca Alexandrina   94
               Al Hilal – Overview
 This project aims to publish an exhaustive digital copy of
  the issues of Al-Hilal since its first publication in 1892
 Al-Hilal is considered the oldest continuously published
  cultural journal in the Arab world, and the only regular
  journal that has been issued for more than a hundred years
 It had a marked effect on the history of the Arab world in
  general and the history of Egypt in particular
 It played a leading role in modernizing Arab intellectual
  thinking, and opened new collaborations towards the
  cultural evolution


Noha Adly 06            Bibliotheca Alexandrina                95
               Al Hilal – Progress
 The volumes of years 1 to 50 were completely scanned,
  processed and indexed (about 51,000 pages).
 An application has been implemented for browsing through
  the digital copies with searching facilities. The hierarchy for
  titles and subtitles helps users select the desired issues
 The issues of each decade are to be compiled on a CD
  including necessary browsing and searching tools.




Noha Adly 06             Bibliotheca Alexandrina               96
Noha Adly 06   Bibliotheca Alexandrina   97
                     OACIS
Online Access to Consolidated
               Information on Serials

Noha Adly 06           Bibliotheca Alexandrina   98
               OACIS – Mission
 Create a publicly and freely accessible, continuously updated
  listing of Middle East journals and serials, including those
  available in print, microform, and online
 Improve access to Middle Eastern serials in libraries in the
   – United States
   – Europe
   – Middle East
 Make scholarly literature from, and about, the Middle East
  widely and easily available to scholars around the world



Noha Adly 06            Bibliotheca Alexandrina             99
                    OACIS – Statistics
Holds: 23,000+ unique title records




   Noha Adly 06                 Bibliotheca Alexandrina   100
               OACIS – BA contribution
 Over 400 records have been uploaded in the database
 23 volumes have been digitized
 Digitized documents have been integrated into the OACIS
  system through a digital viewer
 Content retrieval web interface for the digitized serials has
  been developed
 Regular update of the OACIS catalog is taking place on
  quarterly basis
 A mirror site of the system at BA has been set and released
  25th January 2005 (http://oacis.bibalex.org)


Noha Adly 06             Bibliotheca Alexandrina             101
               OACIS –Website




Noha Adly 06       Bibliotheca Alexandrina   102
               OACIS – Digital Viewer




Noha Adly 06          Bibliotheca Alexandrina   103
               OACIS – Search Contents




Noha Adly 06           Bibliotheca Alexandrina   104
   Arabic and Middle Eastern
               Electronic Library
                   (AMEEL)


Noha Adly 06         Bibliotheca Alexandrina   105
               AMEEL – Overview
 Develop an Arabic and Middle Eastern Electronic Library
  (AMEEL) containing a large collection of significant
  Middle Eastern resources
 Bring together qualified partners to create a Middle East
  electronic library including:
      – Digital representations of traditional materials,
      – “Born digital" contemporary materials
      – A service structure for Inter Library Loan
 Building an access portal



Noha Adly 06               Bibliotheca Alexandrina          106
Noha Adly 06   Bibliotheca Alexandrina   107
               Thank You


Noha Adly 06    Bibliotheca Alexandrina   108

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:11/15/2012
language:Latin
pages:108