Metadata Architecture for Digital Libraries Conceptual Framework

Document Sample
Metadata Architecture for Digital Libraries Conceptual Framework Powered By Docstoc
					Metadata Architecture for Digital
  Conceptual Framework for
    Indian Digital Libraries
        Madhusudana Rao CR
         C-DAC, Bangalore.

              Metadata for DL
• Introduction
• Metadata
• Digital Library Architecture
  – SODA
• Indian Digital Library
  – Background

                   Metadata for DL
  – Proposed Architecture
• Conclusion

                    Metadata for DL
• Search Engines - General
• Digital Library - General

                   Metadata for DL
• Information Processing & Retrieval
  –   Typical Library Environment
  –   Library Automation
  –   Networking of Libraries
  –   Digital Library
  –   Digital Library initiatives

                     Metadata for DL
• Digital Library Scene
  – Search Engines
     •   Heterogeneous
     •   Vertical Information Retrieval
     •   Unique User Interface
     •   Search engines are different
     •   Protocols are different
     •   Querying & Ranking
     •   Incompatible across the sources
                         Metadata for DL
 – Possible solutions
    •   Identifying the User Group
    •   Identifying the Information Sources
    •   Negotiating with different Information Sources
    •   Resource Description Format
    •   Choose best Information Source to evaluate Query
    •   Evaluate the query at these sources
    •   Merge the Query Results from these sources

                       Metadata for DL
New Protocol
•   User
•   User Query
•   Information Source
•   Networked Environment
•   RDF Metadata
•   User Interface
•   Search & Retrieval
                  Metadata for DL
• Metadata
• Network Protocols
• Possible Solutions for typical environment

                   Metadata for DL

Structured data about

            Metadata for DL
• Data that helps in design, create, describe,
  preserve and use of information systems
  and resources is Metadata.
• Metadata can play in the development of
  effective, authoritative, interoperable,
  scaleable, and preservable information and
  record keeping systems.

                    Metadata for DL
• Information Resource
• Library Catalogue
  – Index, Abstracts, Catalog Records, etc >
    MARC, AACR, LCSH etc.
• Human Generated Textual description
• Machine generated data

                     Metadata for DL
• Content
  – Intrinsic
     • What it contains?
     • What is about?
• Context
  – Extrinsic
     • Who, What, Why, Where, How etc.
• Structure
  – Formal Set         Metadata for DL
• Intrinsic
  – Subject, Title, Author, Publisher, Publication
    place, Other agent, Date, Object type, Form -
    Identifier, Relation, Source, Language,
    Coverage, Abstract, Version, Notes, Signature,
    Classification, keyword

                    Metadata for DL
• Extrinsic
  – System Requirement, Mode of access,
    Availability, Cost, Control, Extent, Encoding
    description, Revision description

                     Metadata for DL
Metadata…for two communities
• Information Generators
• Librarians / Cataloguers

                   Metadata for DL
Metadata… can be
• Information Objects
  – Physical
  – Intellectual Form

                        Metadata for DL
• Typical Physical Library:
  – Catalogue
  – Book Racks
  – Books

                   Metadata for DL
• Electronic Information Environment
  – Users search Metadata
  – Pointers
  – Primary Information available on computer
• Distinction
  – Electronic Environment

                    Metadata for DL
               Two Communities
   Generators                        Libraries &
  Of information    Metadata         Cataloguers


                   Metadata for DL
Metadata…can be
•   Need not be Digital
•   More than description of an object
•   Come from variety of sources
•   Continue to accrue
•   One’s object Metadata can be another
    information object’s metadata

                    Metadata for DL
Metadata…can be
• Intermediate steps to retrieve content
• Surrogates of objects

                    Metadata for DL
Metadata… need
• Internet & WWW witnessed exponential
• Need of the hour in the internet is catalogs
  of some kind
• Internet/WWW is not designed to catalog
  the contents

                    Metadata for DL
• Resource Description is a Challenge
• Tools are available
• Just directories listing of network resources
  and search engines
• Metadata is one of the solutions
• Again Standards are yet to make its impact

                    Metadata for DL
• Increased accessibility
  – Searching > existence of rich and consistent
  – search across multiple collections
  – Distributed across several repositories

                     Metadata for DL
• Retention of Text
  – Collection of objects
  – Complex interrelationships with people, places,
    movements & events
  – Documenting and maintaining those
  – authenticity, structural and procedural integrity

                     Metadata for DL
• Expanding use
  –   Disseminating digital versions
  –   Geography
  –   Economics
  –   Infinite ways to search information
  –   Retrieve to wider community

                       Metadata for DL
• Multi-versioning
  – variant versions
  – High resolution copy for preservation
  – Low resolution copy for thumbnail image for
    quick reference and network transfers

                     Metadata for DL
• Legal Issues
  – Track many layers of rights and reproduction
  – Privacy
  – Proprietary interests

                    Metadata for DL
• Preservation
  – Generations - H/W & S/W
  – Technical, Descriptive and Preservation data
  – Information objects to remain accessible and
    intelligible over time

                    Metadata for DL
• System improvement and economics
  – Benchmarking
  – Planning new systems

                   Metadata for DL cycle
     Creation & Multi

      Preservation &
                                      Searching & Retrieval


                    Metadata for DL
• In order Metadata to be useful & cost-
  effective it is essential
  – Structure, Semantics and Syntax conforms to
  – Capture essence of sources
  – Distributed metadata model

                    Metadata for DL
• There is no single international standard for
• Different levels - complexity, richness to
  simple formats
• Several metadata schemes has been
  proposed for different levels of

                    Metadata for DL
• IAFA templates                • EAD (Encoding Archival
• WWW semantic header             Description)
• URS (Uniform Resource         • GILS (Govt Information
  Citation)                       Locator Service)
• OCLC InterCat project         • Federal Geographic Data
• TEI (Text Encoding and          Committee
  Interchange)                  • Museum Educational Site
• Search engine meta tags         Licensing Project
• Resource Description          • Dublin Core
                       Metadata for DL
Dublin Core

Because it is
 simple…….. Yet
 effective ….

              Metadata for DL
Dublin Core..means
• Dublin, Ohio
• International consensus meetings,
  workshops, etc
• Emerging Infrastructure for Internet
• Support Resource Discovery
• Elements represent a broad interdisciplinary
• Core set of elements for DL
Dublin Core..standard
• Comprises of 15 core elements
• Consensus by an International, Cross-
  disciplinary group representing
  –   Library & Information
  –   Computer Science
  –   Text Encoding
  –   Museum
  –   Related fields of scholarship
                       Metadata for DL
Dublin Core..standard
• Each 15 elements are optional and repetitive
• Each element has a limited set of qualifiers
  and attributes
• Simple DC
• Qualified DC

                   Metadata for DL
Dublin Core..goals
• Simplicity of creation & Maintenance
  – Non-specialist to create descriptive records for
    effective retrieval in an networked environment
• Commonly understood semantics
  – Digital tourist for non specialist searcher
  – Convergence of common, more generic
  – increasing visibility and accessibility
                      Metadata for DL
Dublin Core..goals
• International scope
  – 20 languages
  – Coordinating efforts
  – RDF - WWW
• Technical challenges of Internationalization
  – Multilingual & Multicultural nature of
    electronic information universe

                     Metadata for DL
Dublin Core..goals
• Extensibility
  – Additional resource discovery needs

                    Metadata for DL
Dublin Core..elements
• Content
  – Coverage, Description, type, relation, source,
    subject and title
• Intellectual property
  – Contributor, Creator, Publisher & Rights
• Instantiation
  – Date, Format, Identifier & Language

                     Metadata for DL
Dublin Core..implementation
• Dublin Core web site lists 15 North
  America and Mexico in Europe and 12 Asia
  and Australia

                 Metadata for DL
Digital Library Architecture
• SODA (Smart Objects Dumb Archives)
• STARTS (Stanford Protocol proposal for
  Internet Retrieval and Search)

                  Metadata for DL
Digital Library
• Digital Library Services
  – User
     • Functionality & Interface
  – Searching
  – Browsing
• Archive
  – Managed sets of objects

                       Metadata for DL
Digital Library
• Digital Object
  – Stored and trafficked digital content
     • Simple files,
     • Sophisticated objects

                       Metadata for DL
Digital Library

                                          Library Users

                                         Digital Library         Digital Library
                                            Services             Service

                 Archive 1   Archive 2               Archive N           out of Archives

    Objects in

                                    Metadata for DL
Digital Library.. builds
• Identifying a user group
• Identifying archives holding information of
• Negotiating terms and conditions with
• Creating Indices
• Services such as Search & Browse
                   Metadata for DL
Digital Library.. builds
• Creating User interaction services
  –   Terms & Conditions
  –   Authentication
  –   Billing
  –   Display

                     Metadata for DL
Digital Library.. hindered
• Interoperability
• Object mobility
• Complex archives

                 Metadata for DL
Digital Library..cons
• Digital Libraries are partitioned
  – Discipline - Computer Science, Aeronautics,
    Physics, etc.
  – Format - Technical reports, video, software,
• Interdisciplinary search difficult
• Resource Description includes manuscripts,
  software, data sets etc.
                     Metadata for DL
Digital Library..cons
• Manuscripts Vs Other objects -
• All digital storage and transmission, tight

                    Metadata for DL
• Information generated in several forms
• Differentiated by semantic types (report,
  software, video, data sets etc.)
• Given semantic representation differentiated
  by syntactic representation (PS, PDF,
• Media boundaries exists

                   Metadata for DL
• Archive-independent container construct
• All semantic and syntactic data types
• Objects that logically grouped together
• Archived & manipulated as a single object
• Several objects can communicate with each
• Arbitrary network services
                  Metadata for DL
• Traditional functionality associated with
  archives has been pushed down into objects
• Making objects smarter/increase the
• Archives dumber/decrease the responsibility

                  Metadata for DL
• Archives exists to assist the user to locate
  the objects
• Once the object is found user directly
  interact with the objects

                    Metadata for DL
Smart Objects.. illustration

              Smart Archives             Dumb Archives

            SOSA: Smart objects,        SODA: Smart Objects
 Smart                                  Dumb Archives
            Smart Archives
 objects    Ex: none                    Ex: NCSTRL+

            DOSA: Dumb Objects          DODA: Dumb objects
 Dumb                                   Dumb Archives
            Smart Archives
 Archives   Ex: NCSTRL                  Ex: FTP server

                      Metadata for DL
SODA Model…implementation

           Metadata for DL
• Object oriented containers
• Logically grouped items are
  – Collected
  – Stored
  – Transported as a single unit
• Many forms of same data
• Related & non traditional data (Supportive
                     Metadata for DL
Buckets.. containers
• Multiple packages
• Packages can corresponds semantics
  –   manuscript, software etc.
  –   metadata
  –   terms and conditions
  –   pointers
• Single package can have several items

                       Metadata for DL
        (unique ID)                                                    Access Methods

                                         Terms and Conditions

                                    Metadata (RFC 1807, Dublin Core)

 Packages   , .pdf, .tex, .doc
 inside the
 bucket                                                                          Element
                      Software.tar,.c, .java, .asp                               s inside
                      Images.gif, .jpg

                      Data sets.xls, .tar

                                             Metadata for DL
• Unique ID - handle
• Either standalone or multiple repositories
• Standalone - WWW through TCP/IP
• Moderation of number of buckets through
  intelligence and functionality
• Individual buckets may have custom terms
  and conditions
                   Metadata for DL
• Is of arbitrary size
• Globally unique ID
• 0 or more components called packages
• Package contains 1 or more components -
• Element can be a file or pointer
• Packages and elements can be other buckets
                  Metadata for DL
• Package can be a pointers to a remote
  bucket, another package or element
• Buckets can keep internal logs of actions
• Interactions or communication between
  buckets are made only through defined
• Buckets can initiate actions, they do not
  have to wait to be acted on
                   Metadata for DL
Traditional Vs Bucket repository
       User                                   User

 Repository Interface                  Repository Interface
    intelligence                       Optional intelligence

         Archived objects                      Archived Buckets

                            Metadata for DL


           Index holdings
User       Search/retrieve
           Display holdings

               Metadata for DL
• Author Tool
  –   Metadata
  –   Adds packages
  –   Adds elements to package
  –   Selects applicable clusters
  –   Terms and conditions

                       Metadata for DL
• Management Tool
  – Interface
  – Query and update buckets
• Bucket Matching System
  –   SDI
  –   Find similar works by different authors
  –   Arbitrary SDI
  –   Metadata scrubbing
                       Metadata for DL

             Metadata for DL
• Stanford Digital Library Project
• Search Engine Vendors

                   Metadata for DL
• Document Sources
  – Internal networks
  – Internet
• Source Contents
  – Hidden behind search interfaces
• Algorithms/Protocols are different

                    Metadata for DL

            Metadata for DL
• Large Number of resources
• Each resource consist one or more sources
• Source is collection of files
• Accepts queries from clients and produces
• Sources may be small or large
• Extract the source list from resources
  periodically      Metadata for DL
• Extract Metadata and content summaries
  from source periodically
• Query to a source to a resource
• Communicate with promising resources
• Results are from multiple sources, merge
  them & retrieve them to the user

                   Metadata for DL
STARTS..Query language
• Filter expression
  – Boolean nature
  – Defines documents
• Ranking expression
  – Associates score with documents

                      Metadata for DL
STARTS..Query language
• L-strings
  – language-country
  – string behavior
• Atomic Terms
  – Fields
  – Modifiers
• Complex filter expression
  – and, or, and-not, prox etc
                     Metadata for DL
STARTS..Query language
• Complex ranking expressions
• Global settings

                 Metadata for DL
STARTS..Merging ranks
• Unnormalized score of the document for
  each query
• ID of the sources where document appears
• Statistics
  – Term-frequency, Term-weight, Document-
    frequency, Document-size, Document-count

                   Metadata for DL
STARTS..Source metadata
• Properties of the source
  – Fields supported, score range, linkage etc.
• Content Summary of the source
  – List of words that appear in the source
  – statistics of each word listed
  – total documents in the list etc.

                     Metadata for DL the end
• General Search Engines
  – Gathers all documents on the network
  – Gathers metadata about collections
  – Selects small set of collections
  – Search & retrieve

                    Metadata for DL
• Alexandria Digital Library

                   Metadata for DL
• Text only

              Metadata for DL
Indian Digital Library..
•   Ancient & Diverse culture
•   5000 years old culture
•   Largest Democracy
•   Seventh largest country
•   High population
•   Illiterate
•   Important part of World Economy
                   Metadata for DL
Indian Digital Library..
•   World’s largest middle class
•   Poverty
•   Highly skilled manpower
•   Generates Research Oriented Information
•   Global interest
•   Major players in IT in the World
•   World is looking for ancient Indian Culture
                     Metadata for DL
Indian Scene..IT
• Content is lacking
• Indian Literature control (both bibliographic
  and full text)in almost all fields are sketchy
• DL on Indian Heritage
• World Wide accord for Indian Heritage
• Internet Religion is the hot attraction
                    Metadata for DL
Indian Scene.. IT
• West Research has been done on Veda,
  Upanishads, Shastra, Philosophy etc. but
  soul is missing
• Protection, Preservation, Study, Research,
  Propagation for posterity
• Knowledge Presentation

                   Metadata for DL
Indian Scene.. IT
•   Speech recognition
•   OCR
•   Machine translation
•   NL interfaces
•   Text Processing through Index,
    Concordance, Thesauri, Dictionaries

                    Metadata for DL
Indian Scene.. IT
• National Integration, Guide Humanity,
  Conflicts, Aberrations, intolerance etc
• Value based system
• Historic priceless manuscripts

                   Metadata for DL
Indian Heritage
•   Indian Art
•   Indian Paintings
•   Indian Sculpture
•   Religion

                       Metadata for DL
Proposed Architecture….
• Background
  – User Group
     • Skilled & Illiterates
     • Oral tradition still exists
     • Multilingual
  – Information Sources
     • Content is lacking
     • Literature Control both Bibliographic and Text is
       very weak
                         Metadata for DL
Proposed Architecture….
    • Media
         – Computer Generated files to Palm leaf manuscripts
    •   Language
    •   lack of standards for communication
    •   Geographical boundaries
    •   Accessibility
    •   Reaching rural population
 – Publishing
    • Restricted to regional and local

                          Metadata for DL
Proposed Architecture….
   • National initiates are yet to take off
   • Cooperative publishing is lacking
   • Unicode/Universal protocol yet make its impact
 – Network Resources
   • Communication infrastructure exists but not stable
   • Individuals, Organizations, local, regional are
     generators of sources
   • Loose networks - manpower & infrastructure
   • Lack of communication standards
   • Duplicate works
                     Metadata for DL
Proposed Architecture….
 – Need of Networked Information Sources
    • Many priceless knowledge lost or loosing
    • Future generation missing the value of life told by
    • Protection, Preservation, Study, Research,
      Propagation for posterity
 – Looking for future
    • NII
    • Better CCC, Computer, Communication, Content

                       Metadata for DL
Hybrid Architecture….
• Combination of SODA & STARTS
  – From SODA - Bucket Architecture
  – From STARTS - Search and Retrieval protocol
• Metadata - Dublin Core
  – For its simplicity and popularity

                     Metadata for DL
Bucket Architecture….
• Buckets are logically grouped
  – Language, Region, Content, Media, Images,
    etc. (any combination or together as intelligent)
• Large archives have buckets with many
  different functionality's
• Bucket may contain resources, packages,
  elements, metadata, pointers, etc.

                     Metadata for DL
Bucket Architecture….
• Bucket may be unique entity or many
  buckets may form an entity
• Bucket may be standalone with the content
• Many buckets may become resource
• Each bucket has been built with some
  degree of intelligence and functionality
• Includes author tool and management tool
                  Metadata for DL
Bucket Architecture….
• Similarly user’s buckets are also created
• Bucket matching may take place
• Interactions with packages or elements are
  made only through defined methods on a
• Bucket can initiate actions
• Buckets can exist inside or out of a
                   Metadata for DL
STARTS Architecture….
• Search, Retrieval and Browse within Bucket
• Resources, Sources, Elements, Packages,
  Pointers, etc. based on the Bucket definition
• Search query is made within the source
  defined in Bucket
• Query may be within the bucket or across
  the bucket based on the definition and
                   Metadata for DL
STARTS Architecture….
• Ranking is done within the source
• Matching is done with User’s Bucket
• Results displayed based on Ranking and
  user’s requirements
• Although STARTS uses Z39.50 for
  metadata & transfer protocol, we propose to
  use Dublin Core for metadata
                   Metadata for DL
New Protocol..
•   Need to create standard for communication
•   Information processing and retrieval
•   Feeling universal information source
•   Many sources converge as once resource
•   Global information resource
•   Universal accessibility by unified protocol
•   Global access
                     Metadata for DL
New Protocol..
• Frame work is just beginning

                  Metadata for DL

Shared By: