RDF by yaofenji

VIEWS: 9 PAGES: 51

									 The graph-based data model:
Storing and manipulating data in
       distributed graphs

    (Using RDF and Jena to put the
     SparQL in your smile, and the
          Twinkle in your eye
            …and D2R too)


              Michael Grobe

       Biomedical Applications Group
           Research Technologies
 University Information Technology Services
               Indiana University

                                              1
                  This presentation in perspective
This is actually one of a series of presentations on Linked Data Web
and graph database technologies:
- Introduction to ontologies
- This presentation on RDF, Jena, SparQL, etc
- OWL and inference over ontologies
- Using graph technologies in bioinformatics research

In general, these topics appear “simple”, but are fraught with
complications, limitations, and qualifications…especially when the
casual user attempts to compare them with relational data
approaches to the same or similar problems.
In addition, this is a pretty big elephant surrounded by a lot of “blind
men”.

As a result, this presentation is a survey of concrete examples of
basic components used to manipulate data using stored as graphs,
or “appearing to have been” stored in graphs. It will use the Gene
Ontology for some of these examples.
                                                                           2
                  Table of Contents
Using graphs to represent data
Using RDF to represent graphs
Jena – a Java class library for manipulating RDF
Using SparQL graph templates to query RDF
Using Twinkle to make SparQL queries
Using iSPARQL graphical graph templates to query RDF
Exposing relational data as RDF
Thinking of SparQL queries as SQL

                Table of Non-contents
OWL and inference over ontologies
Using the Semantic Web in bioinformatics research
                                                    3
          Using graphs used to represent data

Here are 2 graphs that represent 2 kinds of information associated
with 4 different persons.

  Graph #1: Person ages         Graph #2: Favorite Friends




                                                                     4
             Using graphs to represent data
Here the 2 graphs are combined using named edges to represent
2 kinds of information associated with the same 4 persons.

Graph #3: Person ages (:age) and favorite friends (:fav)




Read these links as “Smith has age 21” or “Jones has favorite
friend Smith” to make them more “sentence-like”. Each arc is like
the “predicate” of a sentence, connecting a “subject” with an
“object”. (Note that a subject may have >= 0 arcs of each type.)    5
             Using graphs to represent data
Data is sometimes represented using so-called “blank nodes” to
help cluster attributes together.
    Graph #4: Blank nodes linking a name, an age, and a favorite
friend via arcs named :name, :age, and :fav, as follows:




Blank nodes are useful for specifying lists of items, but are
discouraged within the Semantic Web. Use (dereferenceable)
URIs (like “http://www.iu.edu/”) whenever possible.


                                                                   6
                   Using URIs and URLs to represent data
Now if it hadn’t already happened someone could come up with the idea to use
URLs to point to Web documents that describe the “exact” meaning of each edge.

For example, some popular magazine could publish their definition of “favorite
friend” on a page like

   http://CelebrityMagazine.com/fav

and other documents could define “BFF”, “long-time-friend”, “family-friend”, etc, And,
in fact, these definitions could themselves refer to other definitions like some
“superset” of relationships such as:

   http://SomeCelebrityMagazine.com/personal_relationships

or the personal_relationships file, itself, could include a collection of definitions,
including “favorite friend, or “fav”, that we might refer to as:

   http://SomeCelebrityMagazine.com/personal_relationships#fav

using the # convention for targeting a specific location within a URL.

Of course, for a lot of applications this would all be unnecessary; some URI could
just be used to indicate an edge type known to the file creator.
                                                                                         7
                   Using RDF-XML to serialize graphs
Graphs can be “serialized” or represented in a textual format. When graphs are
serialized, each connection is represented by 3 components, a so-called RDF
“triple”. Each triple is composed of a “subject”, “predicate” and “object” where
each edge between each pair of entities becomes a named “predicate”.
Each subject is represented as:
         - a blank node, such as “_2”,
         - a literal value, such as “value”^^type where type is some URI,
           that defines a data type, as in “21”^^:age, or
         - a URI, like http://fake.host.edu/smith
Each object is represented as:
         - a blank node
         - a literal value, or
         - a URI
Each predicate is represented as:
         - a URI, like http://fake.host.edu/contact-schema#fav, or an
           abbreviated URI like example:age which represents a URI that will
           be expanded by substituting a value for the string“example”. If the
           specified URI is dereferenceable to a URL, that URL may identify
           a text file that defines the meaning (or semantics) of the predicate.

(Some writers speak of “object”, “property”, and “property value”.)
                                                                                   8
               Graph #3 as a set of 12 triples (3 for each person)

           |-------------------------------------|
           | Subject | Predicate      | Object |
           =======================================
           | “Blake” | example:fav | “Blake” |
           | “Blake” | example:age | "12"        |
           | “Blake” | example:name | "Blake" |
                          |

           |    “Jones”       | example:fav | “Smith”    |
           |    “Jones”       | example:age | "35"       |
           |    “Jones”       | example:name | "Jones"   |
                          |

           |    “George” | example:fav | “Smith” |
           |    “George” | example:age | "21"      |
           |    “George” | example:name | "George" |
           | “Smith” | example:fav | “Jones” |
           | “Smith” | example:age | "21"        |
           | “Smith” | example:name | "Smith" |
           ---------------------------------------
Here the abbreviation “example:” stands for
        http://fake.host.edu/example-schema#
                                                                     9
  Two ways to represent the Graph #3 triples using RDF-XML

Properties encoded as XML entities:
<rdf:RDF
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:example="http://fake.host.edu/example-schema#">
     <example:Person>
         <example:name>Smith</example:name>
         <example:age>21</example:age>
         <example:fav>Jones</example>
     </example:Person>
</rdf:RDF>

Properties encoded as XML attributes:
<rdf:RDF
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:example="http://fake.host.edu/example-schema#">
     <rdf:Description example:name=“Smith”
                         example:age=“21”
                            example:fav=“Jones”
     </rdf:Description>
</rdf:RDF>

                                                               10
                                    Representing URIs
In work with RDF you will see URIs abbreviated in several ways, using: namespace, PREFIX and
ENTITY definitions, depending on the context :

           xmlns:lib=“http://some.host.edu/directory”
or
           PREFIX <lib:http://some.host.edu/directory>
or
           !ENTITY lib “http://some.host.edu/directory”
If the namespace abbreviations in the entities example above get “expanded”, then Smith is
actually being represented as:
<rdf:RDF
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      <http://fake.host.edu/example-schema#Person>
           <http://fake.host.edu/example-schema#name>
                    Smith
           </http://fake.host.edu/example-schema#name>
                  <http://fake.host.edu/example-schema#age>
                           21
                  </http://fake.host.edu/example-schema#age>
                        <http://fake.host.edu/example-schema#fav
                            Jones
                        </http://fake.host.edu/example-schema#fav">
      </http://fake.host.edu/example-schema#Person>
                                                                                               11
</rdf:RDF>
      Graph #3 using “resources” to represent each person

Persons are modeled as “resources” by replacing the strings for each node
identifier with URIs:
   -------------------------------------------------------------------------------
   |            Subject            | Predicate    |           Object             |
   ===============================================================================
   | <http://fake.host.edu/blake> | example:fav | <http://fake.host.edu/blake> |
   | <http://fake.host.edu/blake> | example:age | "12"                           |
   | <http://fake.host.edu/blake> | example:name | "Blake"                       |
   |                                                                             |
   | <http://fake.host.edu/jones> | example:fav | <http://fake.host.edu/smith> |
   | <http://fake.host.edu/jones> | example:age | "35"                           |
   | <http://fake.host.edu/jones> | example:name | "Jones"                       |
   |                                                                             |
   | <http://fake.host.edu/george> | example:fav | <http://fake.host.edu/smith> |
   | <http://fake.host.edu/george> | example:age | "21"                          |
   | <http://fake.host.edu/george> | example:name | "George"                     |
   |                                                                             |
   | <http://fake.host.edu/smith> | example:fav | <http://fake.host.edu/jones> |
   | <http://fake.host.edu/smith> | example:age | "21"                           |
   | <http://fake.host.edu/smith> | example:name | "Smith"                       |
   -------------------------------------------------------------------------------

Here the abbreviation “example” stands for
        http://fake.host.edu/example-schema#

These URIs need not be dereferenceable.
                                                                                     12
             Representing entries in Graph #3 as “resources”

                                      Format 1
<rdf:RDF
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:example="http://fake.host.edu/example-schema#">
   <example:Person rdf:about=“http://fake.host.edu/smith”>
       <example:name>Smith</example:name>
       <example:age>21</example:age>
       <example:fav rdf:resource=“http://fake.host.edu/jones” />
   </example:Person>
</rdf:RDF>
       - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

                                      Format 2
<rdf:RDF
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:example="http://fake.host.edu/example-schema#">
   <rdf:Description about=“http://fake.host.edu/smith”
            example:name=“Smith” example:age=“21” />
      <example:fav rdf:resource=“http://fake.host.edu/jones” />
   </rdf:Description>
</rdf:RDF>

Note that the resource URI references in this example are not “real” documents; they
are not “dereferenceable”.
                                                                                       13
                A person record using FOAF (from Obitko)
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:foaf="http://xmlns.com/foaf/0.1/"
         xmlns="http://www.example.org/~joe/contact.rdf#">

 <foaf:Person rdf:about= "http://www.example.org/~joe/contact.rdf#joesmith">
     <foaf:mbox rdf:resource="mailto:joe.smith@example.org"/>
     <foaf:homepage rdf:resource="http://www.example.org/~joe/"/>
     <foaf:family_name>Smith</foaf:family_name>
     <foaf:givenname>Joe</foaf:givenname>
 </foaf:Person>

</rdf:RDF>




                                                                               14
      RDF summary and implications for the Semantic Web

A graph may be represented as a collection of triples.
RDF-XML representations of graphs will contain URIs that:
 - serve to identify and/or reference syntactic elements (they define
   tag names), and
 - identify and/or name “resources”: subjects, predicates and/or objects.

Such URIs may be “imaginary” or provide addresses of actual,
“dereferenceable”, web documents, in possibly remote locations.

This can result in a “Gigantic Global Graph”, usually know as the Linked
Data Web or the “Semantic Web,” with RDF as one of W3C’s Semantic
Web architectural levels.
“If HTML and the Web make all online documents look like one huge book,
RDF, schema, and inference languages will make all the data in the world
look like on huge database.” TimBL
[Editor’s note: Here TimBL is using the term “schema” to refer to an RDF
schema that defines RDF triples much more loosely than a relational
database schema defines a collection of tables in a database.]

                                                                            15
                 RDF graphs may be interrogated:

- by physical inspection (for anyone willing to read XML)

- by writing programs that read RDF files, construct the
       represented graphs internally, and then
     - access graph triples in sequential order,
     - select triples according to specified content, and/or
     - apply SparQL queries and access results in sequential order

- using command-line tools that apply SparQL queries

- using GUI interfaces accepting SparQL queries
     - written in text, or
     - represented graphically

- via URLs carrying form data, or SOAP requests to SparQL endpoints
                                                                     16
        How to query an RDF graph using Jena
The Java-based Jena package from HP Labs allows users to
manipulate and query graphs, and import/export RDF, etc.

You can write a program that uses Jena classes to
 - retrieve and parse an RDF file containing a graph or a
   collection of graphs,
 - store it in memory, and then
 - examine each triple in turn, examine one component (say,
   the subject) of each triple in turn, or examine only triples that
   meet specified criteria.

For example, one might examine each stored triple searching for a
specific reference URI, or for a specific literal value.

One might look for persons of a specific age, “21”^^xsd:age, in the
object portion of each triple.

Jena also provides support for inference using rule sets and for
querying via SparQL.
                                                                       17
                              Jena example
In JENA, RDF nodes can have type “Resource”, “URI Resource”, “literal”, or
anonymous (slight extension to standard RDF).
A Jena model is created by a factory:
         Model m = ModelFactory.createDefaultModel();

A Jena ontological model is a model along with a “reasoner”(sic):
         OntModel m = ModelFactory.createOntologyModel();

Jena can
  - read in an RDF serialized graph (from a file, URL, etc.)
  - write a serialized model to a file or STDOT, and
  - perform standard operations on the model. For example, given the
           populated models m and n, Jena can then do:
        Model x = m.add( n );               // Union
        Model y = m.remove( n );            // Set difference
        Model z = m.intersection( n );      // Set intersection
 - save and retrieve model data in/from a database
                                                                             18
                 Reading and writing a model in Jena

String input FIleName = “Some-GO-entries-diddled.rdf”;

Model m = ModelFactory.createDefaultModel();
InputStream in = FileManager.get().open( inputFileName );
if( in == null )
{
   throw new IllegalArgumentException( “File not found.\n” )
}

model.read( in, ““ );       // Treat blank lines as nulls.

model.write( System.out [, {“N-triple”|”RDF/XML”|”XML-ABBREV”}] );

    //which will yield a file of N-triple, RDF/XML, or XML-ABBREV records.



                                                                             19
        Cannonical process to examine each triple in a model
stmtIterator iterator = model.listStatements(); // Statements composed of triples
while( iterator.hasNext() )
{
   Statement statement = iterator.nextStatement();
   Resource subject = statement.getSubject();
   Property predicate = statement.getPredicate();
    // Get the object, which in this example, may be a Resource or just a string, so
    // it is kept in an RDFNode, a superclass of Resource and literal.
    RDFNode object = statement.getObject(); // superclass of Resource and literal
              // Now process the object; here it is just printed.
     System.out.print( subject.toString() );
     System.out.print( “ “ + predicate.toString() );
     if( object instanceof Resource )
     {        // it’s a resource.
          System.out.print( “ “ + object.toString() );           }
     else
     {        // it’s a literal that will be printed with surrounding quotes.
          System.out.print( “ \“” + object.toString() + “ \“” );
     }
}
                                                                                       20
      Statement iterators for accessing selected components
There are several methods for creating iterators over a model:

 - Some simply list the components of each triple:
        - model.listSubjects();
        - model.listObjects();

 - Some compare a specific component with a specified value, as in:
         model.listSubjectsWithProperty( Prop p, RDFNode o);
           (which will get you a collection of subjects possessing
                  property/predicate p and specific value o)

 - Some compare all components against specific values in 2 steps:
        - define a “selector” possessing specific values s, p and o,
          where null or (RDFNode) null matches anything:

                    Selector selector = new SimpleSelector( subject,
                                        predicate, object )

          - and then build the statement list:

                    model.listStatements( selector );

                                                                       21
           SparQL: a graph-based query language
Sparql is a language that lets users query RDF graphs . . . using graph
patterns (written in N3) containing variables.
The query engine will return an exhaustive list of triples that satisfy each
query through value substitution. (aka “query by example”, QBE).
This process is not always intuitive, and/or “SQL has perverted the minds of
a generation of programmers” (J. Random Guy somewhere on the Web).
SparQL is implemented in Jena through the ARQ package, and queries may
be made from within Java scripts (McCarthy, 2005), or via a SparQL client
distributed with Jena. The process to make a query is:

         - build a query in a .rq file, and
         - execute the query using:

                   sparql –query filename.rq
or
                   sparql.bat –query filename.rq

SparQL does not do inference (except when used within Jena against an
ontological model).
                                                                               22
                        A SparQL example
This SparQL example query simply asks for a list of the first 10 triples in the
file specified in the FROM clause:

PREFIX
     rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX example: <http://fake.host.edu/example-schema#>

select $s $o
from <http://kongo.uits.iupui.edu:8546/rdf-example-1.rdf>
where
{
  $s $p $o .
}
LIMIT 10

$s, $p, and $o are variable names that will each be assigned a value as the
query is “satisified.” Variable names may also start with “?”.
                                                                                  23
                  SparQL: a graph-based query language
The basic, partial syntax of a SparQL query is based on N3(/turtle) and similar to:

 BASE <some URI from which relative FROM and PREFIX entries will be offset>
 PREFIX prefix_abbreviation: < some_URI >
 SELECT some_variable_list
 FROM <some_RDF_source >
 WHERE
 {
    { some_triple_pattern . } .
 }


Notes:

 - the “<“ and “>” characters are required literals,
 - the BASE and PREFIX entries are optional and BASE applies to relative
           URIs appearing in either PREFIX or FROM clauses,
 - SELECT may be replaced by CONSTRUCT, ASK, or DESCRIBE,
 - * is a valid variable list, and may be preceded by DISTINCT
 - there may be multiple FROM clauses, which will be treated as a single store,
 - the term WHERE is optional, and may be omitted, and
 - while this syntax resembles SQL, it has very different semantics.
                                                                                      24
         Querying Graph #3 format 1 using SparQL

Here’s a reminder of one of the representations used to store of Graph #3
here stored in a file named rdf-example-1.rdf:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:example="http://fake.host.edu/example-schema#"
>
  <example:Person rdf:about="http://fake.host.edu/smith">
    <example:name>Smith</example:name>
    <example:age>21</example:age>
    <example:fav rdf:resource="http://fake.host.edu/jones" />
  </example:Person>

</rdf:RDF>



                                                                      25
         A SparQL query against the first data representation

C:\Jena-2.5.7\Jena-2.5.7\bat> cat query-example-1.rq
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX example: <http://fake.host.edu/example-schema#>

select *
from <http://kongo.uits.iupui.edu:8546/smiht-format-1.rdf>
where
{
  $s $p $o .
}

C:\Jena-2.5.7\Jena-2.5.7\bat> sparql.bat --query query-example-1.rq
------------------------------------------------------------------------------
| s                            | p            | o                            |
==============================================================================
| <http://fake.host.edu/smith> | example:fav | <http://fake.host.edu/jones> |
| <http://fake.host.edu/smith> | example:age | "21"                          |
| <http://fake.host.edu/smith> | example:name | "Smith"                      |
| <http://fake.host.edu/smith> | rdf:type     | example:Person               |
-------------------------------------------------------------------------------

           (Were you expecting a graph? CONSTRUCT will produce one.)
                                                                           26
           Querying Graph #3 format 2 using Sparql

Here’s a reminder of the other representation of Graph #3 stored in a file
named rdf-example-2.rdf:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:example="http://fake.host.edu/example-schema#"
>
  <example:Person rdf:about="http://fake.host.edu/smith"
        example:name=“Smith”
               example:age=“21” />
     <example:fav rdf:resource="http://fake.host.edu/jones" />
  </example:Person>

</rdf:RDF>
                                                                        27
   The same SparQL query against the second data representation

C:\Jena-2.5.7\Jena-2.5.7\bat> cat query-example-2.rq
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX example: <http://fake.host.edu/example-schema#>

select *
from <http://kongo.uits.iupui.edu:8546/smith-format-2.rdf>
where
{
  $s $p $o .
}

C:\Jena-2.5.7\Jena-2.5.7\bat> sparql.bat --query query-example-2.rq
------------------------------------------------------------------------------
| s                             | p            | o                           |
==============================================================================
| <http://fake.host.edu/smith> | example:fav | <http://fake.host.edu/jones> |
| <http://fake.host.edu/smith> | example:age | "21"                          |
| <http://fake.host.edu/smith> | example:name | "Smith"                      |
------------------------------------------------------------------------------




                                                                           28
  A “distributed” SparQL query against 4 separate RDF files

The next query searches 4 dereferenceable files holding “live” data
in the first representation format above:
C:\Jena-2.5.7\Jena-2.5.7\bat> cat query-example-all.rq

PREFIX   rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX   example: <http://fake.host.edu/example-schema#>

select *
from <http://kongo.uits.iupui.edu:8546/smith>
from <http://kongo.uits.iupui.edu:8546/jones>
from <http://kongo.uits.iupui.edu:8546/george>
from <http://kongo.uits.iupui.edu:8546/blake>
where
{
  $s $p $o .
}

If the data were all in one file, only one FROM clause would have
been required.
                                                                      29
                         Results of the distributed SparQL query
C:\Jena-2.5.7\Jena-2.5.7\bat> sparql.bat --query query-example-all.rq

-------------------------------------------------------------------------------------------
| s                                    | p            | o                                   |
============================================================================================
| <http://kongo.uits.iupui.edu/blake> | example:fav | <http://kongo.uits.iupui.edu/blake>|
| <http://kongo.uits.iupui.edu/blake> | example:age | "12"                                  |
| <http://kongo.uits.iupui.edu/blake> | example:name | "Blake"                              |
| <http://kongo.uits.iupui.edu/blake> | rdf:type      | example:Person                      |
| <http://kongo.uits.iupui.edu/jones> | example:fav | <http://kongo.uits.iupui.edu/smith>|
| <http://kongo.uits.iupui.edu/jones> | example:age | "35"                                  |
| <http://kongo.uits.iupui.edu/jones> | example:name | "Jones"                              |
| <http://kongo.uits.iupui.edu/jones> | rdf:type      | example:Person                      |
| <http://kongo.uits.iupui.edu/george> | example:fav | <http://kongo.uits.iupui.edu/smith>|
| <http://kongo.uits.iupui.edu/george> | example:age | "21"                                 |
| <http://kongo.uits.iupui.edu/george> | example:name | "George"                            |
| <http://kongo.uits.iupui.edu/george> | rdf:type     | example:Person                      |
| <http://kongo.uits.iupui.edu/smith> | example:fav | <http://kongo.uits.iupui.edu/jones>|
| <http://kongo.uits.iupui.edu/smith> | example:age | "21"                                  |
| <http://kongo.uits.iupui.edu/smith> | example:name | "Smith"                              |
| <http://kongo.uits.iupui.edu/smith> | rdf:type      | example:Person                      |
--------------------------------------------------------------------------------------------


In this query all 4 files were searched as if they were in a single file.

Note that the URI contents are different in this live example.

                                                                                      30
                          The magic of “ontologies”

There are many defintions of “ontology”, but in very general terms, an “ontology”
may be thought of as a “taxonomy” of objects (or concepts) based on a particular
relationship between pairs of those objects (or concepts).

A common example of a taxonomy is an “evolutionary tree” in which individual
species are related on the basis of evolutionary descent.

That is, one species of each pair connected by an edge descended from the other.
(Actually, it’s the members of the species who evolve, but . . .)

Within such structures no member is considered to have descended from more than
one immediate species.

Within an ontology,” however, an object or concept may have more than one
immediate “parent,” and no circular sub-graphs are allowed, so the resulting
structure is a Directed Acyclic Graph (DAG).

An ontology can be represented by a “special” RDF graph: It is special in that the
predicates convey transitivity:

if A is a descendant of B, and B is a descendant of C, then A is a descendant of C.

IS_A and PART_OF relationships are commonly used to build ontologies.
                                                                                      31
Here is a portion of the GO is_a DAG (Ashburner, 2004) for
molecular function (example: “chromatin binding” is_a “DNA
binding”)




(Note that this diagram shows some genes, but the Gene Ontology is actually a taxonomy of
terms that can be used to describe or annotate genes, rather than a taxonomy of genes. )
                                                                                            32
Here’s the first entry (of the ~26K) in the GO text
version (with all three parts intermixed)
[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including
the mitochondrial genome, into daughter cells after
mitosis or meiosis, mediated by interactions between
mitochondria and the cytoskeleton." [GOC:mcc,
PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution

You can also get the GO as RDF XML, or as a MySQL
database. A portion of the molecular function extract on the
previous page is shown in RDF XML on next page:            33
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:go="http://www.geneontology.org/dtds/go.dtd#">
 <go:term rdf:about="http://www.geneontology.org/go#all"> (Note: “all” is like “root”.)
  <go:accession>all</go:accession>
  <go:name>all</go:name>
  <go:definition>This term is the most general term possible</go:definition>
 </go:term>

<go:term rdf:about="http://www.geneontology.org/go#GO:0003674">
  <go:accession>GO:0003674</go:accession>
  <go:name>molecular_function</go:name>
  <go:synonym>GO:0005554</go:synonym>
  <go:synonym>molecular function</go:synonym>
  <go:definition>Elemental activities, such as catalysis or binding, describing the actions of a gene
    product at the molecular level. A given gene product may exhibit one or more molecular
    functions.</go:definition>
  <go:is_a rdf:resource="http://www.geneontology.org/go#all" />
 </go:term>
</rdf:RDF>
 <go:term rdf:about="http://www.geneontology.org/go#GO:0005488">
  <go:accession>GO:0005488</go:accession>
  <go:name>binding</go:name>
  <go:synonym>ligand</go:synonym>
  <go:definition>The selective, often stoichiometric, interaction of a molecule with one or more
    specific sites on another molecule.</go:definition>
  <go:is_a rdf:resource="http://www.geneontology.org/go#GO:0003674" />
 </go:term>                                                                                             34
      Find parents of GO:0004003 in the example GO subset
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX go: <http://www.geneontology.org/dtds/go.dtd#>
select *
from <http://discern.uits.iu.edu:8421/Some-GO-entries-
                                             diddled.rdf>
where
{
  <http://www.geneontology.org/go#GO:0004003> go:is_a $parent .
}
Result:
C:\Jena-2.5.7\bat> sparql.bat --query GO-paths-from-4003.rq
-----------------------------------------------
| parent                                      |
===============================================
| <http://www.geneontology.org/go#GO:0008094> |
| <http://www.geneontology.org/go#GO:0008026> |
| <http://www.geneontology.org/go#GO:0003678> |
-----------------------------------------------
  (This query takes about 30 seconds on the whole GO RDF-XML file, using the
         java parameters -Xss4m -Xms30m -Xmx1500m)
                                                                               35
      Find all 3-element paths up from GO:0004003


PREFIX go: <http://www.geneontology.org/dtds/go.dtd#>

select *

from
  <http://discern.uits.iu.edu:8421/Some-GO-entries-diddled.rdf>

where
{
  <http://www.geneontology.org/go#GO:0004003> go:is_a $a .
  $a go:is_a $b .
  $b go:is_a $c .
}

Note that given a table showing the GO DAG, you can get this result
within SQL using multiple joins, but you can’t find N-element paths in
either language (unless you use inference within SparQL).
                                                                         36
    Find all 3-element paths up from GO:0004003 using Twinkle




(A similar query found that 19 GO classes have the maximum hop count of 15 to the GO root.)
                                                                                          37
             Query dbpedia for entries about “Goethe”
                (using Virtuoso iSparql text query)




Note that the predicate “bif:contains” is a Virtuoso “Built-In Function”
that searches back-end text indexes. It might be possible to search
using a standard SparQL regex FILTER, but it would be much slower.
                                                                           38
   The same query using the iSparql “graphical” QBE interface

Here is the same query in graphical form as constructed using the
iSparql QBE interface:




Components can be dragged-and-dropped from the menu at the top of
the window. The whole interactive window is shown on the next page.
                                                                    39
The same query within the whole iSparql QBE window




                                                     40
Results from the iSparql text and/or QBE queries




                                                   41
                     Possible applications for ontologies

Suppose uniprot.org provides a list of 89K proteins, their mappings to NCBI Gene IDs,
and their GO annotations (which it does), and perhaps a small subset looks like:

     XXX        GO:00003682
     YYY        GO:00003682
     ZZZ        GO:00008026
     AAA        GO:00008096

And suppose go.org links GO IDs with GO category names, which it does,

And suppose I have a list of researchers and their various areas of interest, like

     Smith studies gene XXX
     Jones studies nucleic acid binding
     etc.

Then . . . what kinds of questions can I ask that would have been difficult before, like:

     What research interests do Smith and Jones have in common?
     Who might be interested in collaborating on DNA helicase activity?
                                                                                            42
                      Optional clauses in SparQL queries

SparQL has more features than presented so far.

Here are some clauses permitted following the “where” clause:



order by [DESC|ASC| ] ( variable_list )

limit n: print up to n return values.

offset n: start output with the nth return value.




                                                                43
                            Optional clauses in SparQL queries

Permitted within “where” clauses:
FILTER: restricts variable matches in the preceding triple to specified filter patterns, as in:

     { $s $p $date FILTER ( $date > "2005-01-01T00:00:00Z"^^xsd:dateTime ) }
or
     { $s $p $d FILTER
             ( xsd:dateTime( $d ) < xsd:dateTime( "2005-01-01T00:00:00Z“ ) ) }
or
     { ?s ?p ?name FILTER regex( ?name, "^smi", “some_flag“ ) }


UNION: “where” clauses may be constructed as
     { triple_pattern_1 } UNION { triple_pattern_2 }
and any RDF element matching either of these triples will be included in the resulting output.

OPTIONAL { triple_pattern }: identifies a triple that need not appear in an RDF target but whose
absence will not prohibit a pattern match.


GRAPH: directs a query to apply a triple to a specified RDF source, possibly defined by a variable:
     GRAPH      some_RDF_source { another_triple_pattern . } .
or
     GRAPH some_variable { yet_another_triple_pattern . } .

                                                                                                      44
     “A relational view of the Semantic Web” (Newman, 2007)

Relaxing certain requirements normally imposed upon SQL (specifically
type contraints on joined fields), there are strong similarities among
operations applied to relational and graph-based models. For example:
 - triple_pattern . triple_pattern
     approximates an “untyped” join, as demonstrated on the next slide
 - filter
     approximates an SQL conditional
 - union
    approximates an outer union
 - optional
    approximates a left outer join( R, S ), which
     join( R, S ) unioned with an anti-join( R, S), where an anti-join
     difference with a semi-join, and a semi-join
     join and a projection.



                                                                          45
                 “A relational view of the Semantic Web” (Newman, 2007)

Here we look at the triple pattern used to find the 3 hop paths towards the GO root node,

     select $a, $b, $c where
    { <http://www.geneontology.org/go#GO:0004003>
              go:is_a $a .
         $a go:is_a $b .
         $b go:is_a $c .
     }

Which is roughly equivalent to the following SQL query:

       select
           a.parent_id, b.parent_id, c.parent_id
       from
           GO.molecular_function_DAG a
       join
           GO.molecular_function_DAG b
       on
           a.parent_id = b.child_id
       join
           GO.molecular_function_DAG c
       on
           b.parent_id = c.child_id
       where
           a.child_id like 'GO:0004003'

                                                                                            46
   Publishing relational data as “virtual” RDF stores
So far we have accessed RDF presented mostly from free-standing
files. However, legacy relational databases can be published as RDF
stores on the Semantic Web by using gateways like D2R and
Virtuoso (commercial).
The D2R approach requires 2 steps:
 - interrogate the database via JDBC using generate-mapping to
build a configuration (mapping) file from the relational table
definitions, and then
 - start the D2R server with the mapping file.
Notes:
 - Each table row becomes a separate resource/graph.
 - Primary keys (if any) become resource identifiers, and
 - rows in linked tables identified by foreign keys may be
       merged into the entity (?).
The D2R utility dump-rdf can also convert an entire table into RDF
form for access in a single SparQL query.

                                                                      47
        Accessing data via a “SparQL Endpoint”

Since the D2R server makes a “SparQL endpoint” available, one can
execute queries via HTTP requests like

 http://kongo.uits.iupui.edu:6700/sparql?query=
         select ?s ?p ?o where{ ?s ?p ?o .} limit 10

The D2R server also provides a Web form that can be used to
interrogate its content using SparQL. This interface is based on an
AJAX component called SNORQL, and available at

 http://kongo.uits.iupui.edu:6700/sparql

The D2R server also provides an interface for users to browse its
backend data. To use it you just “Web in” to:

 http://kongo.uits.iupui.edu:6700


                                                                      48
                      Portion of a D2R-server mapping file for CLSD
@prefix map: <file:/C:/d2r-server-0.4/mapping-clsd2-GO-DGN.n3#> .
@prefix d2rq: <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#> .
map:database a d2rq:Database;
     d2rq:jdbcDriver "com.ibm.db2.jcc.DB2Driver";
     d2rq:jdbcDSN "jdbc:db2://libra45.uits.iu.edu:50000/clsd2";
     d2rq:username “account";
     d2rq:password “password";
     .
# Table DISEASE_GENE_NET.GENES
map:DISEASE_GENE_NET_GENES a d2rq:ClassMap;
     d2rq:dataStorage map:database;
     d2rq:uriPattern "DISEASE_GENE_NET.GENES/@@DISEASE_GENE_NET.GENES.GENE_ID@@";
     d2rq:class vocab:DISEASE_GENE_NET_GENES;
     .
map:DISEASE_GENE_NET_GENES__label a d2rq:PropertyBridge;
     d2rq:belongsToClassMap map:DISEASE_GENE_NET_GENES;
     d2rq:property rdfs:label;
     d2rq:pattern "DISEASE_GENE_NET.GENES #@@DISEASE_GENE_NET.GENES.GENE_ID@@";
     .
For each field in the table there is an entry like:
     map:GENES_GENE_ID a d2rq:PropertyBridge;
     d2rq:belongsToClassMap map:DISEASE_GENE_NET_GENES;
     d2rq:property vocab:GENES_GENE_ID;
     d2rq:column "DISEASE_GENE_NET.GENES.GENE_ID";
     d2rq:datatype xsd:long;
                                                                             49
     .
                               Triple stores

There exist so-called “triple stores” that can use backend data storage
engines, like MySQL, to house RDF data, and process queries.

For example, Sesame is a triple store that can use serveral different kinds
of backends: DBMS (originally PostgreSQL), simple RDF files, and/or
other, network-accessed triple stores, like Sesame itself.
Sesame also demonstrates a “generic architecture” for RDF and RDFS
storage and query processing, and does not require keeping the whole
graph in memory, when processing requests.

Jena can also employ back-end data base management systems.

There are also some graph based data management systems, like Neo4j,
that can be used to store raw graph structured data. In fact, Neo4j has at
least one overlay product that uses Neo4j to manage RDF.
Neo4j may work well for data collections running into the billions of nodes,
since it does not require its whole graph to be memory-contained (although
it works better with larger memory), and is quite fast. .

                                                                          50
                                             References
Ashburner, M., et al., “Gene ontology: a tool for the unification of biology”, Nature Genetics, 2000.
Berners-Lee, Tim, “Linked Data”, 2006. http://www.w3.org/DesignIssues/LinkedData.html
Bizer, Chris, “The D2RQ Plattform - Treating Non-RDF Databases as Virtual RDF Graphs”,
http://www4.wiwiss.fu-berlin.de/bizer/d2rq/
Bizer, Chris, Richard Cyganiak, Tom Heath, “How to Publish Linked Data on the Web”, 2007.
http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
Cygniak, Richard, “A Relational Algebra for SPARQL”, HP Labs, 2005.
http://www.hpl.hp.com/techreports/2005/HPL-2005-170.pdf
Davis, Ian, “An Introduction to RDF”, http://research.talis.com/2005/rdf-intro/
Dodds, Leigh, “Introducing SparQL: Querying the Semantic Web”, 2005. http://www.xml.com/lpt/a/1628
McBride, Brian, “An Introduction to RDF and the Jena RDF API “, 2007.
http://jena.sourceforge.net/tutorial/RDF_API/index.html
McCarthy, Philip, “Search RDF data with SPARQL”, 2005.
http://www.ibm.com/developerworks/xml/library/j-sparql/
McCarthy, Philip, “Introduction to Jena”, 2004. http://www.ibm.com/developerworks/xml/library/j-jena//

Nic, Miloslav, “RDF Tutorial”, http://www.zvon.org/xxl/RDFTutorial/General/book.html
Newman, Andrew, “A relational view of the Semantic Web”, 2007. http://www.xml.com/lpt/a/1695
Obitko, Marek, “Introduction to ontologies and semantic web”, 2007.
http://www.obitko.com/tutorials/ontologies-semantic-web
W3C, “SparQL query language for RDF”, 2008. http://www.w3.org/TR/rdf-sparql-query/                      51

								
To top