Docstoc

A Radioactive Metadata Record Approach for Interoperability

Document Sample
A Radioactive Metadata Record Approach for Interoperability Powered By Docstoc
					Coalition for Networked Information Task Force Meeting, April 2005, Washington, DC




       A Radioactive Metadata Record
       Approach for Interoperability Testing
       Use of Special Diagnostic Records in the
       Context of Z39.50 and Online Library Catalogs


                                                                                        William E. Moen
                                                                                        <wemoen@unt.edu>
                                                                School of Library and Information Sciences
                                                                       Texas Center for Digital Knowledge
                                                                                  University of North Texas
                                                                                        Denton, TX 72603
       Overview
        Interoperability
        Radioactive MARC records and their power
        MARC content designation utilization




Moen                CNI Task Force Meeting -- Washington, DC -- April 2005   2
                           IMLS funded projects
          Z39.50 Interoperability Testbed, Phases 1 & 2
              Improve Z39.50 semantic interoperability among libraries for
               information access and resource sharing
              Establish and operate a testbed for interop testing of Z39.50
               clients and servers with library catalogs (Phase 1)
              Explore alternative approach using radioactive MARC records
               (Phase 2)
          MARC Content Designation Utilization
              Provide empirical evidence of MARC content designation use
              Explore the evolution of MARC content designation
              Develop methodological approach to understand the factors
               contributing to current levels of MARC content designation use

Moen                      CNI Task Force Meeting -- Washington, DC -- April 2005   3
       Factors affecting interoperability
          Multiple and disparate systems
              Information retrieval systems, search functionality, etc.
          Multiple protocols
              Z39.50, HTTP, SOAP, SRW/Uetc.
          Multiple data formats, syntax, metadata schemes
              MARC 21, UNIMARC, XML, ISBD/AACR2-based, Dublin Core
          Multiple vocabularies, ontologies, disciplines
              LCSH, MESH, AAT
          Multiple languages, multiple character sets
          Indexing, word normalization, and word extraction policies
Moen                        CNI Task Force Meeting -- Washington, DC -- April 2005   4
       Levels of Z39.50 interoperability
          Low-level protocol (syntactic)
              Do Z-client and Z-servers interchange protocol messages
               according to standard?
          High-level protocol (functional)
              Do Z-client and Z-servers support appropriate Z39.50 services for
               user tasks?
          Semantic level
              Can Z-clients and Z-servers and local IR systems preserve and act
               on meaning of IR tasks?
          User Task level
              Do systems support IR tasks of one or more user groups?



Moen                        CNI Task Force Meeting -- Washington, DC -- April 2005   5
       Threats to interoperability
        Differences in implementation of the standard
        Differences in local information retrieval systems
            Search functionality
            Indexing policies



          These threats can be addressed by
            Z39.50 specifications and configuration (e.g., profiles)
            Enhancing local information retrieval systems

            Recommendations for local indexing decisions


Moen                     CNI Task Force Meeting -- Washington, DC -- April 2005   6
       Z-Interop Phase 1
          Test dataset
              400,000 MARC 21 records from OCLC
          Z39.50 reference implementations
              Z-client, Z-server, information retrieval system
              Configured to the profile specifications
          Test scenarios & searches
              Searches with known result records from dataset
          Benchmarks
              Results of test searches against reference implementations

FOR MORE INFORMATION, VISIT THE PROJECT WEBSITE…
                                                                     http://www.unt.edu/zinterop/
Moen                         CNI Task Force Meeting -- Washington, DC -- April 2005             7
       Phase 1 interop testing
    Reference                  Vendor                                         Test Dataset Loaded
   Z39.50 Client            Z39.50 Server                                     by Vendor or Library

                              Configured
   Configured                 by Vendor                                           Indexed by
   to Support                    for                                                 Vendor
     Profile                 Conformance                                         According to
  Specifications                  to                                                Vendor’s
                                Profile                                          Specifications

 Test Searches

             Retrieval                                                             Retrieval
                                            Compared to
            Benchmarks                                                             Results
Moen                 CNI Task Force Meeting -- Washington, DC -- April 2005                       8
       Z-Interop Phase 2
          The specially designed MARC records, Radioactive MARC
           Records
              Concept coined by Sebastian Hammer, Index Data
              Records will be publicly available, possibly through OCLC
          A set of test searches and automatic testing script that issues
           searches, retrieves records, and develops reports on the
           search and retrieval results
              Developed by Index Data
              Will be released under GPL
          A database of MARC documentation that enables the
           automatic identification of types of searches to issue
              Developed by UNT
              MARCdocs Database
Moen                        CNI Task Force Meeting -- Washington, DC -- April 2005   9
Moen   CNI Task Force Meeting -- Washington, DC -- April 2005   10
       Radioactive MARC records
        Specially designed diagnostic records
        Legitimate instance of MARC record structure
        Fields/subfields contain content-rich tokens
              A token is a string of characters that has a specific
               structure and semantics that will serve as “words” or
               other data values in specific fields/subfields.
          Multiple sets of RadMARC records, distinguished by
           the amount of content designation populated



Moen                      CNI Task Force Meeting -- Washington, DC -- April 2005   11
       Structure of RadMARC tokens
          A single alpha character for left-hand padding.
               Value = r
          A single alpha character to indicate the format of the material being described or type
           of record
               Value = Selected values as defined in MARC Leader/06 – Type of Record or the
                Leader/07 – Bibliographic Level
          Three numbers indicating the Field Tag
               Value = Defined in MARC 21 specifications
          A single integer to indicate number of occurrence the Field Tag
               Value = Sequential number starting with 1
          A single alpha character to indicate the Subfield Code
               Value = Defined in MARC 21 specifications
          A single integer indicating the offset within subfield
               Value = Use the following scheme: 1=first token in subfield, 2=second token in subfield; 3=
                third token in subfield, etc.
          A single alpha character for right-hand padding
               Value = r

Moen                              CNI Task Force Meeting -- Washington, DC -- April 2005                12
       Token example
          ra2451a1r
              r - Left-hand padding
              a - Type of record -- this is a books type record
              245 - Field code
              1 – First occurrence of field in record
              a - Subfield code
              1 - Offset within subfield, where 1 = first token in subfield
              r - Right-hand padding
          RadMARC example record



Moen                         CNI Task Force Meeting -- Washington, DC -- April 2005   13
       Test scripts
        Automate interoperability testing and reporting
        Test searches defined by Bath Profile and US
         National Z39.50 Profile
        RadioMARC Perl module
            Automatically generates Z39.50 queries with tokens as
             search terms
            Sends searches to target servers known to contain
             copies of specific records
            Generates reports dependent on whether or not the
             expected records are present in the result set
          Sample output of testing
Moen                    CNI Task Force Meeting -- Washington, DC -- April 2005   14
       MARCdocs database
          Pilot effort aimed at structuring MARC 21 documentation into
           a relational database
          Stores information about all content designation available in
           the MARC 21 Format for Bibliographic Data specifications
          Stores additional information about profile-defined searches
           necessary to the automatic test scripts
          Implementation uses MySQL and PhP
          Example display from MARCdocs




Moen                     CNI Task Force Meeting -- Washington, DC -- April 2005   15
       Question space for Z-Interop2
        Profile conformance level: Addresses the
         interoperability between the Z-client and Z-server
        Information retrieval (IR) system level: Addresses
         the capability of the IR system underlying the online
         catalog application
        Metadata record level: Concerned with how the IR
         system indexes fields in the metadata record
        Data content level: Addresses normalization of
         data, hyphenated words, special characters and
         diacritics, etc.
Moen                  CNI Task Force Meeting -- Washington, DC -- April 2005   16
       RadMARC record sets
        What content designation should be populated in
         RadMARC records to support interoperability
         testing?
        MARC 21 defines approximately 2,000 structures for
         holding data
        Z-Interop2 approach
            Develop multiple RadMARC record sets
            Increasing amount of content designation populated

          Informed by MARC content designation analysis

Moen                    CNI Task Force Meeting -- Washington, DC -- April 2005   17
       Z-Interop test dataset
        Approximately 1% sample of MARC records from
         OCLC’s WorldCat database
        419,657 total MARC records
        89% of records “full level” cataloging
        Formats represented in test dataset
              Books:                               91%                    Sound recordings:   4%
              Cartographic Materials:              < 1%                   Visual Materials:   1%
              Electronic resources:                < 1%                   Serials:            3%
              Archival/Mixed Materials:            <1%



Moen                        CNI Task Force Meeting -- Washington, DC -- April 2005                   18
       MARC 21 content designation
         MARC 21       Currently                      Obsolete                 Total       MARC 1972
       Field Groups     Defined                                                            (Books Format
                                                                                           Only)

       00x                                     6                       1               7               3
       0xx                                238                          7         245                  28
       1xx                                   66                        1          67                  40
       2xx                                137                       32           169                  15
       3xx                                109                       32           141                   4
       4xx                                   69                        0          69                  37
       5xx                                323                       38           361                   8
       6xx                                184                          5         189                  66
       7xx                                452                       47           499                  41
       8xx                                141                       20           161                  36
       TOTAL                            1725                      183           1908                 278
Moen                  CNI Task Force Meeting -- Washington, DC -- April 2005                           19
       Fields populated in Z-Interop dataset
         MARC 21   Currently          Obsolete Unlikely                            Total
          Field     Defined                     Used
         Groups
         00x                     6                     0                       0             6
         0xx                  96                       1                  33               130
         1xx                  49                       0                       2            51
         2xx                  81                       0                  19               100
         3xx                  23                       6                       0            29
         4xx                  10                       0                  30                40
         5xx                128                        1                       3           132
         6xx                104                        1                       7           112
         7xx                205                        0                       5           210
         8xx                105                        3                       8           116
         TOTAL              807                      12                107                 926
Moen                  CNI Task Force Meeting -- Washington, DC -- April 2005                     20
       Occurrence summary
       Total number of fields/subfields occurring in dataset = 13,849,499
       Frequency                # of Fields/Subfields                            % of All Occurrences
       > 600,000                                                          1                       4.4%
       500,000 > 599,999                                                  0                         0%
       400,000 > 499,999                                                 13                      39.9%
       300,000 > 399,999                                                  6                      14.3%
       200,000 > 299,999                                                  6                      10.6%
       100,000 > 199,999                                                 10                      10.3%
       TOTAL                                                             36                      79.5%
       Only 4% of all fields/subfields account for 80% of all occurrences
       or
       96% of all fields/subfields account for 20% of all occurrences
Moen                        CNI Task Force Meeting -- Washington, DC -- April 2005                      21
       Characteristics of top 36
        Most frequently occurring: 650 $a [Subject data]
        2nd most frequently occurring: 040 $d [Cataloging
         source]
        3rd & 4th most frequently occurring: 260 $a & $b
         [Publication information]
        5th most frequently occurring: 245 $a [Title]
        Contain data useful to end users: 28
        Contain control numbers, etc.: 5
        Contain data useful to catalogers: 3

Moen                  CNI Task Force Meeting -- Washington, DC -- April 2005   22
       Indexing & MARC
        Indexing Guidelines to Support Z39.50 Profile
         Searches (available on Z-Interop website)
        Identified all MARC 21 fields/subfields that can
         contain author, title, or subject data
            Author-related fields/subfields :     119
            AuthorTitle-related fields/subfields:  21
            Title-related fields/subfields:       253
            Subject-related fields/subfields:     144




Moen                    CNI Task Force Meeting -- Washington, DC -- April 2005   23
       Occurrences in test dataset
          537 fields/subfields can contain author, title, subject data
          381 of these actually occur in Z-Interop dataset
          Total occurrences of the 381 = 4,397,712
          19 of the 381 (5%) account for 80% of all occurrences
              9 of 19 are subject-related
              5 of 19 are author-related
              5 of 19 are title-related
          Preliminary testing using only 19 indexed fields:
              95% - 100% of correct records retrieved!
          The 19 fields/subfields

Moen                        CNI Task Force Meeting -- Washington, DC -- April 2005   24
       MCDU Project
        Systematically analyze WorldCat records
        Provide empirical evidence of catalogers’ use of
         MARC content designation
        Contribute to community discussion about core
         elements in MARC bibliographic records – based on
         empirical evidence of actual use
        Inform future sets of RadMARC records



FOR MORE INFORMATION, VISIT THE PROJECT WEBSITE…
                                                                        http://www.mcdu.unt.edu
Moen                   CNI Task Force Meeting -- Washington, DC -- April 2005                25
       Initial RadMARC sets
          Set 1
              10 records
              Populate 19 most frequently occurring Author, Title, Subject fields
              Distinguished by types of materials cataloged
          Set 2
              4 records (100, 110, 111, 130 main entry fields)
              Populate the Author, Title, Subject fields occurring 1000 or more
               times (approximately 50 fields/subfields populated)
          Set 3
              Records populated based on:
                • The LC Network Development and MARC Standards Office
                  recommendations for national level records
                • The Program for Cooperative Cataloging (2003) core record standards.

Moen                         CNI Task Force Meeting -- Washington, DC -- April 2005      26
       Extensibility of RadMARC
          Records can be as simple or as complex as needed
          Custom records for a library that wants specific assessment
           of indexing or other policies to interrogate system behavior
          Assess normalization of characters
          Testing transformation from one metadata scheme to
           another
              MARC Record
              MARCXML Transformation
              MODS Transformation
              DC Transformation
          Other metadata environments?
Moen                     CNI Task Force Meeting -- Washington, DC -- April 2005   27
       Concluding thoughts
        Exploring an innovative conceptual and technical
         approach for interoperability testing.
        Conducting a proof-of-concept for a radioactive
         record approach for diagnosing interoperability
         factors in an identified question space
        Extensible in terms of the current focus
              Creating different sets of RadMARC records to diagnose
               general or specific system and interoperability issues
          Extensible to other application environments,
           metadata schemes, and protocols.
Moen                      CNI Task Force Meeting -- Washington, DC -- April 2005   28
       References
          Z39.50 Interoperability Testbed
              http://www.unt.edu/zinterop/
          MARC Content Designation Utilization Project
              http://www.mcdu.unt.edu/
          Assessing Metadata Utilization: An Analysis of
           MARC Content Designation Use
              http://www.unt.edu/wmoen/publications/MARCPaper_Final2003pdf.pdf
          Indexing Guidelines to Support Z39.50 Profile
           Searches
              http://www.unt.edu/zinterop/Documents/IndexingGuidelines1Feb2002.pdf
          MARCdocs Database (public interface)
              http://meta.lis.unt.edu/MARCdocs2

Moen                           CNI Task Force Meeting -- Washington, DC -- April 2005   29

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:6/19/2013
language:English
pages:29