Docstoc

An Overview of Data Extraction Methods

Document Sample
An Overview of Data Extraction Methods Powered By Docstoc
					An overview of data extraction methods                                                          SHERPA Digital Preservation
                                                                                                http://www.sherpadp.org.uk



An Overview of Data Extraction Methods
Project: SHERPA DP2

Author: Rishi Sharma

Draft/Version: Version 1

Date of edition completion: 19 September 2007

Change history

 Date                   Version                                 Author
 01/08/2007             First draft                             Rishi Sharma
 19/09/2007             First version, revised                  Rishi Sharma
                        on basis of feedback
                        from project team
 29/04/2009             First version – copied                  Gareth Knight
                        into template and
                        performed        minor
                        corrections.



Contents

   Introduction..........................................................................................................................2
      Metadata Standards ........................................................................................................2
      Dublin Core .....................................................................................................................2
      MODS .............................................................................................................................2
   Export methods....................................................................................................................3
      METS ..............................................................................................................................3
      MPEG21-didl ...................................................................................................................4
   Implementation ....................................................................................................................4
      Base64 ............................................................................................................................4
      RSS/ATOM......................................................................................................................4
      Web Extraction ................................................................................................................5
      FTP / SCP .......................................................................................................................5
   Conclusions .........................................................................................................................6
   Appendix..............................................................................................................................7
   Glossary ..............................................................................................................................8




Page 1 of 9                                                            File: data-extraction-v1.doc
Created by Rishi Sharma                                                Created on 9/19/2007 4:11 PM
An overview of data extraction methods                               SHERPA Digital Preservation
                                                                     http://www.sherpadp.org.uk




Introduction

The purpose of this document is to explore different types of metadata standard, but in this
document we are focusing mainly on the Dublin Core and MODS metadata standards.

Metadata Standards
A number of metadata formats exist. These include:
    • Bibliographic Metadata Standards (MARC, MODS, ONIX)
    • Archival/Record Management Metadata Standards (ISAD (G), EAD, EAC)
    • Museum Metadata Standards (SPECTRUM)
    • Image Metadata Standards (VRA Core)
    • Government Metadata Standards (Dublin Core)
    • Learning Metadata Standards (LOM)
    • Multimedia Metadata Standards (MPEG–7)

The following sections provide an overview of the various formats.

Dublin Core

The Dublin Core metadata element set is a standard for cross-domain information resource
description. It provides a simple and standardized set of conventions for describing things
online in ways that make them easier to find. Dublin Core is widely used to describe digital
materials such as video, sound, image, text, and composite media like web pages.

The Dublin Core standard includes two levels: Simple and Qualified. Simple Dublin Core
comprises fifteen elements; Qualified Dublin Core includes three additional elements
(Audience, Provenance and Rights Holder), as well as a group of element refinements (also
called qualifiers) that refine the semantics of the elements in ways that may be useful in
resource discovery.

The Simple Dublin Core Metadata Element Set (the Dublin Core) "is a 15 element metadata
set that is primarily intended to aid resource discovery on the Web. The elements in the
Dublin Core are TITLE, SUBJECT, DESCRIPTION, CREATOR, PUBLISHER,
CONTRIBUTOR, DATE, TYPE, FORMAT, IDENTIFIER, SOURCE, LANGUAGE, RELATION,
COVERAGE and RIGHTS. The metadata elements fall into three groups, which roughly
indicate the class or scope of information stored in them:

  1.   Elements that relate to the Content of the resource [Title, Subject, Description, Type,
       Source, Relation and Coverage].
  2.   Elements that relate to the resource when viewed as Intellectual Property [Creator,
       Publisher, Contributor and Rights].
  3.   Elements that relate to the Instantiation of the resource [Date, Format, Identifier, and
       Languages].

Implementations of Dublin Core typically make use of XML and are Resource Description
Framework based.

MODS
The Metadata Object Description Schema (MODS) is an XML-based bibliographic description
schema developed by the Library of Congress' Network Development and Standards Office.
“MODS” was designed as a compromise between the complexities of the MARC format used
by libraries and the extreme simplicity of Dublin Core metadata. As an XML schema, the
"Metadata Object Description Schema" (MODS) is intended to be able to carry selected data
from existing MARC 21 records as well as to enable the creation of original resource
description records. It includes a subset of MARC fields and uses language-based tags rather
than numeric ones.


Page 2 of 9                                      File: data-extraction-v1.doc
Created by Rishi Sharma                          Created on 9/19/2007 4:11 PM
An overview of data extraction methods                               SHERPA Digital Preservation
                                                                     http://www.sherpadp.org.uk




MODS could potentially be used as an extension schema to METS (Metadata Encoding and
Transmission Standard), to represent metadata for harvesting, for original resource
description in XML syntax, for representing a simplified MARC record in XML, for metadata in
XML that may be packaged with an electronic resource

MODS include a subset of data from the MARC 21 Format for Bibliographic Data. As an
element set that allows for the representation of data already in MARC-based systems, it is
intended to allow for the conversion of core fields while some specific data may be dropped.
As an element set for original resource description, it allows for a simple record to be created
in some cases using more general tags than those available in the MARC record.

Export methods

There is quite a lot metadata standard, which exposes different methods, as we are going to
explore most common method to extract the metadata information. There is couple of
standard method, which provides metadata is XML forms.

METS

The Metadata Encoding and Transmission Standard schema is a standard for encoding
descriptive, administrative, and structural metadata regarding objects within a digital library,
expressed using the XML schema language of the World Wide Web Consortium.

METS is an XML schema design to expose the hierarchical structure of digital library objects
by mentioning the names and locations of the files that comprise those objects.

Typically METS Documents consists of 7 sections

  1.   METS header: This section of metadata describing the METS document itself,
       including such information as creator, editor, etc.

  2.   Descriptive Metadata: This section may point to descriptive metadata external to the
       METS document or contain internally embedded descriptive metadata, or both. Multiple
       instances of both external and internal descriptive metadata may be included in the
       descriptive metadata section.

  3.   Administrative Metadata: This section describes how the files were created and
       stored, rights, and metadata regarding the original source object from which the digital
       library object derives, and information regarding the provenance of the files comprising
       the digital library object. As with descriptive metadata, administrative metadata may be
       either external to the METS document, or encoded internally.

  4.   File Section: The file section lists all files containing content, which comprise the
       electronic versions of the digital object. <File> elements may be grouped within
       <FileGrp> elements, to provide for subdividing the files by object version.

  5.   Structural Map: This section is the heart of a METS document. It outlines a
       hierarchical structure for the digital library object, and links the elements of that
       structure to content files and metadata that pertain to each element.

  6.   Structural Links: This section allows METS creators to record the existence of
       hyperlinks between nodes in the hierarchy outlined in the Structural Map. This is of
       particular value in using METS to archive Websites.

  7.   Behavioral: A behavior section can be used to associate executable behaviors with
       content in the METS object. Each behavior also has a mechanism element, which
       identifies a module of executable code, which implement and runs the behaviors
       defined abstractly by the interface definition.

Page 3 of 9                                      File: data-extraction-v1.doc
Created by Rishi Sharma                          Created on 9/19/2007 4:11 PM
An overview of data extraction methods                                  SHERPA Digital Preservation
                                                                        http://www.sherpadp.org.uk




While METS is useful for actually transmitting the required metadata for each item to be
exported, there is also the need to identify the items to be exported.

As METS also supports the Complex Object Format (Expressive formats that permit
representation of digital objects have emerged from several communities, and are commonly
referred as Complex Object Format)

Complex object formats typically share the following core characteristics:

    •    Representation of a digital object by means of a wrapper XML document.
    •    The ability for represents both simple digital objects (consisting of a single
         datastream), and compound digital objects (consisting of a multiple datastreams).
    •    The ability to unambiguously convey identifiers of the digital object and its constituent
         datastreams.
         The ability to include a datastream in two ways:

              1. By-Value: embedding a base64-encoding [Freed & Borenstein, 1996] of the
                 datastream inside the wrapper XML document.
              2. By-Reference: unambiguously embedding the network location of the
                 datastream inside the wrapper XML document. This approach is considered
                 fully equivalent with the By-Value approach.

    •    The ability to include a variety of secondary information pertaining to a datastream.

This includes descriptive metadata, rights information, technical metadata, etc. This
secondary information can also be provided by by-value or by- reference.

MPEG21-didl
MPEG-21 is a metadata standard devised for use by the audiovisual community to meet the
requirements of its characteristic resources. Its benefits include the ability to define complex,
multi-file, digital objects to which multiple rights typically pertain, that are played over a period
of time, and which typically require description at various levels of granularity from the digital
object as a whole down to the level of, say, individual frames or seconds. MPEG-21 is a
wrapper that can bring together a complicated array of metadata elements from a variety of
standards into a single machine-readable and navigable structure.

Implementation

Base64
Base64 is a positional notation using a base of 64. The first known use of Base 64 encoding
for electronic data transfer was the Privacy-enhanced Electronic Mail (PEM) protocol,
proposed by RFC 989 in 1987. PEM defines a "printable encoding" scheme that uses Base
64 encoding to transform an arbitrary sequence of octets to a format that can be expressed in
short lines of 7-bit characters, as required by transfer protocols such as SMTP.

Typical use of Base64 encoding is desired only for safely, encrypted data transferring via
XML, inserting into a database or lengthy information is used in HTTP environment.
Hibernate, a database persistence framework for Java objects, uses Base64 encoding to
encode a relatively large unique id into a string for use as an HTTP parameter in HTTP forms
or HTTP GET URLs. Also, many applications need to encode binary data in a way that is
convenient for inclusion in URLs, including in hidden web form fields, and Base64 is a
convenient encoding to render them in human unreadable format.

RSS/ATOM
RSS is a Web feed formats used to publish frequently updated content such as blog entries, News
headlines. In core RSS is a XML document or all RSS modules is to extend the basic XML schema. An



Page 4 of 9                                         File: data-extraction-v1.doc
Created by Rishi Sharma                             Created on 9/19/2007 4:11 PM
An overview of data extraction methods                                   SHERPA Digital Preservation
                                                                         http://www.sherpadp.org.uk


RSS document, which is called a "feed", "web feed", or "channel", contains either a summary of content
from an associated web site or the full text.

The ATOM brand is applied to two related standards: the Atom Syndication Format is an XML
language used for web feeds, while the Atom Publishing Protocol is a simple HTTP-based
protocol for creating and updating Web resources.

Atom has a carefully designed payload container. Content must be explicitly labelled as: plain
text, with no mark-up (the default); escaped HTML, as commonly used with RSS 2.0; well-
formed XHTML mark-up; other XML vocabulary; Base64-encoded binary content; and a
pointer to Web content not included in the feed

Although RSS & ATOM are useful for tracking frequently updated data, there are a number of
problems that must be recognised when utilising it for the project.

  1.   There is no Concept of Revisioning or Updates for Items: Since RSS doesn't support
       grouping by revision, you sometimes get the same item record in your feed three or
       four times. The description has been updated, but the summary displayed to the RSS
       reader remains the same.

  2.   Feeds cannot contain other feeds: A RSS/Atom feed cannot contain a web feed
       obtained from a different source, preventing feed nesting.

RSS content can be read using software called a "feed reader" or an "aggregator." The
reader checks the user's subscribed feeds regularly for new content, downloading any
updates that it finds.

RSS can be used to extract the metadata. Eprints 3 introduced the powerful feature called
Export plugin, where data can extract using the RSS export plugin.

Web Extraction
Web data extraction is a method to collect data obtained from multiple web sites using
automated tools. The process of extracting data is also referred to as Web Scraping or Web
Data Mining. In order to achieve metadata from collected data, it uses well-established
techniques and technologies for text/xml manipulation such as XSLT, XQuery, XML Schema
and Regular Expressions.

A number of tools exist to perform web extraction: Web-Harvest is Open Source Web Data
Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful
data from them. It uses XSLT, XQuery and Regular Expressions. Web-Harvest can be used
both from the command line as executable jar file or from the Java code. It may be used on
any platform with installed Java 2 runtime environment. Although it should work even with
Java 1.3, version 1.4 or higher is recommended.

FTP / SCP

File Transfer Protocol is used to transfer data from one computer to another over the Internet,
or through a network. This could be used to extract the metadata directly from server over the
Internet. There is variety of software available for FTP.

SCP is a secure equivalent of FTP. The rsync or SCP utility can be used to synchronise the
data between an EPrints repository and the preservation system. A crontab for every night
can be setup on institutional repositories, which will initiate the data transfer automatically
using SCP to the preservation system. The export Interface can be used to extract the
metadata data. The content extraction tool, which will group all the documents of an e-print
together, has to be implemented separately.

Though FTP has a number of advantages, it has some disadvantages too. The passwords
and file content are sent in clear text, and eavesdropper can easily interpret this. Multiple


Page 5 of 9                                          File: data-extraction-v1.doc
Created by Rishi Sharma                              Created on 9/19/2007 4:11 PM
An overview of data extraction methods                             SHERPA Digital Preservation
                                                                   http://www.sherpadp.org.uk


TCP/IP connections are used in FTP, so firewall software needs to be set accordingly for
these connections. Security problems are also there in FTP, as there is no way to transfer
data in encrypted format. So under most network configurations, user name, password, FTP
commands and someone else on the same network can view file names. However, the
person needs to use the protocol analyzer for this.

Conclusions

METS, MPEG21-didl, RSS/ATOM, WEB Extraction and FTP/SCP provide the mechanism to
export metadata. METS is quite popular as it exposes lot information including the complex
format objects. RSS/ATOM supports the Dublin Core Metadata, RSS and ATOM both having
the capability to handle the complex format object and they are implementing the base64 text
encoding. Every method having some advantage and disadvantages. It is difficult to say
which method should use, really it depends what kind of data, upto what level of data is
required.




Page 6 of 9                                    File: data-extraction-v1.doc
Created by Rishi Sharma                        Created on 9/19/2007 4:11 PM
An overview of data extraction methods                                SHERPA Digital Preservation
                                                                      http://www.sherpadp.org.uk




Appendix

Web-Harvest Command line usage

Syntax           for         command       line        use            is         the        following:

java -jar <whjar> config=<file> workdir=<workpath> [debug=yes|no] [proxyhost=<proxyHost>
[proxyport=<proxyPort>]] [proxyuser=<proxyUser> proxypassword=<proxyPassword>]

Here the available parameters are briefly described:

1. <Whjar> is file path to the Web-Harvest jar (check Download page for the latest version).
2. <File> is the path to the Web-Harvest configuration file.
3. <Workpath> is the path to the working directory. All directories and files created during
configuration execution will be created below the working directory.
4. Optional <debug> parameter is used for turning on debug information. By default,
debugging is turned off.
5. Optional <proxyhost> and <proxyport> parameters specify HTTP proxy configuration.
6. Optional <proxyuser> and <proxypassword> parameters specify HTTP proxy credentials.

Note 1: If the dependant jar files are not on the classpath it is necessary to include them with
classpath       or   cp        option      of     the     java       command        line    tool.

Note 2: If the names of dependant jars are not the same as from download section, it is
necessary to adjust names also in <whjar>/META-INF/MANIFEST.MF.




Page 7 of 9                                       File: data-extraction-v1.doc
Created by Rishi Sharma                           Created on 9/19/2007 4:11 PM
An overview of data extraction methods                               SHERPA Digital Preservation
                                                                     http://www.sherpadp.org.uk




Glossary

Archive
A term for an EPrints based repository, which stores the research papers and metadata
related to it.

E-print
A record, document submitted by the author in the EPrints software. It may consist of several
computer files.

Harvest
It is a method to collect the metadata from the repositories by issuing OAI-PMH request.
Harvester is a client application that issues the request to server repositories.

METS
METS is a metadata framework, which provides an XML document format for encoding
metadata necessary for both management of digital library objects within a repository and
exchange of such objects between repositories.

SOAP/WSDL
SOAP, Simple Object Access Protocol is a lightweight framework for exchanging XML-based
information in a decentralized, distributed environment.
WSDL (Web Service Description Language) is an XML-based language for describing
network services, which communicates using SOAP.

Servlet
An interactive web interface implemented using JAVA technology that receives requests and
generates a response based on the request.

FTP/SCP
File Transfer Protocol and Secured Copy Protocol are the utilities that enable the file transfer
over the net.

Rsync
The rsync remote-update protocol allows transferring just the differences between two sets of
files across the network connection, using an efficient checksum-search.

Metadata
Metadata is data about data. An item of metadata may describe an individual datum, or
content item, or a collection of data including multiple content items.

MARC
MARC is an acronym for MAchine-Readable Cataloging. The MARC standards consist of the
MARC formats, which are standards for the representation and communication of
bibliographic and related information in machine-readable form, and related documentation. It
defines a bibliographic data format that was developed by Henriette Avram at the Library of
Congress beginning in the 1960s. It provides the protocol by which computers exchange, use,
and interpret bibliographic information. Its data elements make up the foundation of most
library                       catalogs                      used                     today.




Page 8 of 9                                      File: data-extraction-v1.doc
Created by Rishi Sharma                          Created on 9/19/2007 4:11 PM
An overview of data extraction methods                                SHERPA Digital Preservation
                                                                      http://www.sherpadp.org.uk


References:
[1] Preserv Project: http://preserv.eprints.org

[2] EPrints software: http://www.eprints.org/

[3] Eprint Export plugin: http://wiki.eprints.org/w/Create_Export_Plugins

[4] JISC Core Project: http://www.core.ecs.soton.ac.uk/

[5] OAI-PMH: http://www.openarchives.org/OAI/openarchivesprotocol.html

[6] METS: http://www.loc.gov/standards/mets/METSOverview.v2.html

[7] WEB Extraction http://web-harvest.sourceforge.net/

[8] WEB Harvest Usages http://web-harvest.sourceforge.net/usage.php

[9] Resource Description Framework http://www.xulplanet.com/tutorials/mozsdk/rdfstart.php

[10] MODS http://www.loc.gov/standards/mods/mods-overview.html




Page 9 of 9                                       File: data-extraction-v1.doc
Created by Rishi Sharma                           Created on 9/19/2007 4:11 PM

				
DOCUMENT INFO