Semantic Extensions to Defuddle Inserting GRDDL into XML by sus16053

VIEWS: 0 PAGES: 5

									Semantic Extensions to Defuddle: Inserting GRDDL into XML
Robert E. McGrath
July 28, 2008
1. Introduction
The overall goal is to enable automatic extraction of semantic metadata from arbitrary data. Our
approach is to accomplish this in two steps: first, convert the data in to an XML data model via
DFDL/Defuddle. Second, process the XML to produce RDF metadata. This can be done via
XSL.
The GRDDL standard defines annotations to associate an XSL with an XML document. This
standard can be used to invoke XSL to extract RDF.
Combining these technologies, we propose to extend DFDF/Defuddle to define GRDDL
statements which will be inserted into the XML output. The Defuddle parser will generate XML
with optional GRDDL statements, and call GRDDL-aware software (e.g., GRDDL.py) to extract
RDF.
2. Description of the Product
The product of this task is to add GRDDL as a post-processing stage to Defuddle. GRDDL fires
zero or more XSL’s which are supposed to create RDF. Thus, this is a mechanism to specify
how to extract RDF metadata from the input files. (GRDDL operates on the XML generated by
Defuddle, so, technically, the extractors will only work on data that is in the DFDL generated
XML. Data in the original that is not put into XML will not be available.)
2.1. What it does:
GRDDL is a standard for declaring ‘transformations’ that create RDF from XML. This enables
the creation of standard XSL to create semantic metadata (in RDF) from XML that conforms to
given schemas. The XSL is a declarative representation of the extraction.
The extraction is quite specific to the schema, and must be hand written. But once available,
GRDDL lets them be applied automatically.
The Defuddle extension applies GRDDL as a post-processing stage. This enables users to insert
GRDDL annotations into the (DFDL-annotated) schema, which automatically extract RDF.
Thus, the annotations schema plus the XSL transforms describe how to extract semantic
metadata from ‘arbitrary’ data files.
2.2. What it is not:
The extraction process requires creating XSL to extract RDF. We have examples, but this is a
case-by-case task.
The result is a single RDF graph. This is, presumably, suitable for use in semantic infrastructure
such as tupelo. Defuddle itself does not use the RDF or interpret it (at least not at this time).
3. Basic Technique
The GRDDL standard defines simple statements that are inserted in various forms of XML or
related languages [1]. Since Defuddle works with XSD and XML, we will focus on this
language.


                                           -1-
The GRDDL assertions define the GRDDL name space and then list one or more
“transformations” to be applied. The transformations are URIs that point to a description of a
transformation. For our purposes, the transformations are XSL stylesheets.
While XSL can produce any output, GRDDL and Defuddle are only interested in cases that
produce RDF. GRDDL engines will likely fail and/or produce nothing if the XSL is not
producing RDF.
The GRDDL assertions may by added in two forms, in an individual document, and in an XML
schema (namespace). The former case applies to that document, the latter to all documents that
refer to that namespace. There may be multiple transforms specified for a document, both in the
document and in its namespaces.
3.1. GRDDL annotations
GRDDL markup may be added in the root element of XML. In this case, there are two
statements, one to define the name space, and one to provide a list of transformations, which are
URLs for XSL stylesheets. There may be more than one transformation in the list. Figure 1
illustrates GRDDL in an unqualified XML element.
        <workflow id="120" name="part1"
        xmlns:grddl="http://www.w3.org/2003/g/data-view#"
        grddl:transformation="http://vesta.ncsa.uiuc.edu/GRDDL/xsl/vistrails2rdf.xsl">
        …
                                                       Figure 1
Alternatively, a schema may be annotated to indicate GRDDL should be applied to all
documents in that namespace. The markup is indirect: a GRDDL transform is applied to the
schema document which extracts RDF (embedded in the schema) which defines a transform to
be applied to a namespace. Figure 2 shows an illustration from the GRDDL documentation. The
“embeddedRDF.xsl” is a standard XSL that extracts RDFa from XML.1 The association of
transformation with documents is indicated in an XML annotation containing RDF indicating the
relationship (namespace, namespaceTransformation, transformation).
            <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
                        xmlns="http:.../Order-1.0"
                        targetNamespace="http:.../Order-1.0"
                        ...
                        xmlns:data-view="http://www.w3.org/2003/g/data-view#"
                        data-view:transformation="http://www.w3.org/2003/g/embeddedRDF.xsl" >
                <xsd:element name="Order" type="OrderType">
                <xsd:annotation
                  <xsd:documentation>This element is the root element.</xsd:documentation>
                </xsd:annotation>
                              ...
              <xsd:annotation>
                <xsd:appinfo>
                  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
                      <rdf:Description rdf:about="http://www.w3.org/2003/g/po-ex">
                        <data-view:namespaceTransformation
                            rdf:resource="grokPO.xsl" />
                      </rdf:Description>
                  </rdf:RDF>
                </xsd:appinfo>
              </xsd:annotation>
            ...
                                                       Figure 2


1
    This XSL is obsolete and should be replaced by ‘inlineRDF.xsl’.


                                                    -2-
3.2. GRDDL Processing
There are several GRDDL aware processors available, notably an addon for Jena
(http://jena.sourceforge.net/grddl/), and a python implementation, GRDDL.py
(http://www.w3.org/2001/sw/grddl-wg/td/GRDDL.py).
The GRDDL standard defines correct behavior, but leaves considerable leeway for
implementations. In particular, the GRDDL engine may execute none, some, or all of the
transformations, in any order. The result of each transform will be an RDF model, all the results
are merged into a single RDF model. The model can be output in various serializations.
4. Proposed Implementation
Defuddle starts from a DFDL-annotated schema and non-XML data, and generates an XML
instance that follows the schema. This is done in two steps, creating a parser if needed, and then
parsing the data file into XML.2 A third step applies XSL to clean up the XML.3
We would add a fourth step to apply GRDDL. This step would run a GRDDL engine. Jena/grddl
reader would be the first choice since it is Java. GRDDL.py could be used. Figure 3 sketches
the processing steps, with the GRDDL step added.




                                                        Figure 3
The GRDDL step needs the GRDDL markup, which specifies transforms to apply.
This could be done in one of two ways:
       1. The DFDL annotated schema could have GRDDL annotations in it. If the resulting XML
          references this schema as its namespace, then that is all that would be needed: GRDDL
          will find the schema and pull out the transforms.



2
    Query: is the result written out as fully qualified XML?
3
    Need more info on this step.


                                                     -3-
   2. The XML created in step 2+3 could have GRDDL annotations. These will work whether
      the XML refers to the schema or not.
      These could be inserted when Defuddle generates the XML. If the schema has GRDDL
      annotations, Defuddle would have to analyze the GRDDL and insert attributes in the root
      element. This is straight-forward
A third alternative would be to create some new DFDL annotations. Unless we have other
reasons to do so, it is not necessary to create new DFDL.
Alternative 1 is simple and elegant, and the schema must exist in all cases. However, it is not
clear if the output always references the schema, or if the schema will be available to the
GRDDL stage.
Alternative 2 will work in less restricted circumstances (GRDDL only needs to find the
transforms, not recur through namespaces). It is less elegant, but doable.
4.1. Output
In either case, the output is RDF. Technically, each transform creates an RDF model, and all the
models are “merged” a la the RDF specification. The result is a single RDF model. The GRDDL
engines will write this out in one of several serializations.
Note that GRDDL produces exactly one output model, which is the union of all the RDF models
produced.
4.2. Usage
The transforms must be XSL, and should generate RDF, but they do not need to be managed by
Defuddle. The user must write or find them, and insert the GRDDL annotations.
The results must be emitted by Defuddle, presumably as serialized RDF. If we want the user to
be able to specify options (notably, to select a serialization for the RDF), we might need to add
parameters to Defuddle.
5. Software Test Plan
The GRDDL standard defines an extensive set of test cases [2]. Since the GRDDL
implementations we build on already pass these tests, we do not need to do them.
The Defuddle-GRDDL needs a few test cases to assure that the wrapper works. In general, the
presence of GRDDL markup should not affect the XML produced by Defuddle, so any existing
tests and example should work as before.
The following cases should be tested:
       1. GRDDL markup not present. Defuddle should produce normal output, GRDDL phase
          should produce no output (or not execute at all).
       2. Correct GRDDL is present. Defuddle should produce normal output, GRDDL phase
          should produce correct output.
             a. GRDDL transform is in local XSL file.
             b. GRDDL transform is in remote XSL.
             c. Multiple GRDDL transforms are present. The result of the GRDDL phase should
                be the RDF merge of the individual outputs.


                                           -4-
           Note: these tests require construction of test XSLT appropriate to test schemas.
       3. Incorrect GRDDL. Defuddle should produce normal output, GRDDL phase should
          produce error messages.
               a. Bad syntax in GRDDL markup. This may cause Defuddle to report errors.
               b. Transform not found (XSL is a bad URL).
               c. Faulty XSL in transform. Errors from the GRDDL processing must be caught
                  and reported.
               d. Output of transform is not RDF.
6. Follow-on: Semantic Descriptions
The Defuddle-GRDDL capability extends Defuddle to generate RDF from arbitrary data sources.
This opens the way to extract semantic descriptions of the data (e.g., as suggested in [3, 4]). This
would be accomplished by creating a DFDL annotated XML schema plus one or more XSLT
transformations to extract semantic descriptions. Ideally, some of the XSLT can be reused (for
example, a common transformation, rdf-ntriples.xsl, was used for all provenance challenge data.)
However, the extraction requires case-by-case coding of the relationship of XML structures to
the desired RDF, i.e., in XPATH and other XSL.
Once the XSLT is in hand, Defuddle will automatically generate the RDF defined in the
transforms. The resulting RDF should be suitable for semantic queries, and can mix with
metadata from many sources. Presumably the RDF would be passed to semantic infrastructure
such as Tupelo (http://tupeloproject.org).
References
[1]    D. Connolly, "Gleaning Resource Descriptions from Dialects of Languages (GRDDL),"
       W3C, W3C Recommendation 11 September 2007 2007.
[2]    Chimezie Ogbuji (ed.), "GRDDL Test Cases," W3C, Recommendation 11 September
       2007 2007.
[3]    P. Watry, "Digital Preservation Theory and Application: Transcontinental Persistent
       Archives Testbed Activity," The International Journal of Digital Curation, vol. 2, pp. 41-
       68, 2007.
[4]    V. Heydegger, J. Neumann, J. Schnasse, and M. Thaller, "Planets: Basic design for the
       extensible characterisation languages," HKI, Cologne October 31 2006.




                                            -5-

								
To top