Docstoc

Semantic Web Search

Document Sample
Semantic Web Search Powered By Docstoc
					             “Politehnica” University of Bucharest, Romania
           Faculty of Automatic Control and Computer Science
                      Computer Science Department




                Semantic Web Search
               Diploma Thesis in Computer Science

                       By Raluca PAIU




Supervisors:

    Prof. Dr. Techn. Wolfgang NEJDL
    L3S Research Center, Hannover

    Prof. Dr. Ing. Valentin CRISTEA
    “Politehnica” University of Bucharest

    Dipl. Ing. Paul – Alexandru CHIRITA
    L3S Research Center – Ph. D. Student


                         23rd of August, 2004
        Table of Contents

CHAPTER 1. INTRODUCTION. ...................................................................................... 4
1.1.    L3S RESEARCH CENTER ..................................................................................... 4
1.2.    THE UNIVERSITY OF HANNOVER ....................................................................... 5


CHAPTER 2. EDUTELLA................................................................................................. 6
2.1.    OVERVIEW ......................................................................................................... 6
2.2.    MOTIVATION FOR EDUTELLA ............................................................................. 6
2.3.    BACKGROUND .................................................................................................... 7
  2.3.1. THE JXTA P2P FRAMEWORK ......................................................................................7
  2.3.2. EDUCATIONAL CONTEXT ............................................................................................9
2.4.    EDUTELLA SERVICES ........................................................................................ 10
2.5.    EDUTELLA QUERY SERVICE .............................................................................. 10
  2.5.1. QUERY EXCHANGE ARCHITECTURE .............................................................................. 12
  2.5.2. DATALOG SEMANTICS FOR THE EDUTELLA COMMON DATA MODEL (ECDM) .............................. 13
  2.5.3. EDUTELLA COMMON DATA AND QUERY EXCHANGE MODEL .................................................. 14
  2.5.4. RDF-QEL-I LANGUAGE LEVELS ................................................................................. 15
    2.5.4.1. RDF-QEL Syntax ......................................................................................... 15
    2.5.4.2. RDF-QEL-1 ................................................................................................. 17
    2.5.4.3. RDF-QEL-2 ................................................................................................. 18
    2.5.4.4. RDF-QEL-3 ................................................................................................. 18
    2.5.4.5. Further RDF-QEL-i levels.............................................................................. 18
  2.5.5. REPRESENTING COMPLEX PROPERTY SEMANTICS ............................................................. 18
  2.5.6. QUERYING SCHEMA INFORMATION .............................................................................. 19
  2.5.7. RESULT FORMATS ................................................................................................. 20
    2.5.7.1. Standard Result Set Syntax ......................................................................... 20
    2.5.7.2. RDF Graph Answers .................................................................................... 20
2.6.    REGISTRATION SERVICE AND QUERY MEDIATORS .......................................... 21
  2.6.1. SIMPLE WRAPPING MEDIATORS ................................................................................. 22
  2.6.2. MEDIATOR PEERS HANDLE DISTRIBUTED QUERIES ........................................................... 22


CHAPTER 3. THE TAP SYSTEM. ................................................................................... 23
3.1.    INTRODUCTION ............................................................................................... 23
3.2.    THE SEMANTIC WEB ........................................................................................ 23
3.3.    QUERY INTERFACES FOR THE SEMANTIC WEB ................................................. 25
  3.3.1. RELATED WORK ................................................................................................... 25
  3.3.2. DEPLOYABILITY VERSUS EXPRESSIVENESS ..................................................................... 26
  3.3.3. GETDATA ........................................................................................................... 27

                                                              1
3.4.    SEMANTIC NEGOTIATION ................................................................................ 28
  3.4.1. MOTIVATION ....................................................................................................... 28
  3.4.2. PROBLEMS WITH GLOBAL NAMES................................................................................ 29
  3.4.3. BOOTSTRAPPING ................................................................................................... 31
  3.4.4. DESCRIPTIONS ..................................................................................................... 32
    3.4.4.1. Discriminant Descriptions ............................................................................. 32
    3.4.4.2. Database Keys ........................................................................................... 33
    3.4.4.3. Using Discriminant Descriptions .................................................................... 34
  3.4.5. GETDATA AND SEMANTIC NEGOTIATION ....................................................................... 34
  3.4.6. QUERY COMPLEXITY ............................................................................................... 35
  3.4.7. FAILURE MODES ................................................................................................... 36
  3.4.8. SCOPE OF SEMANTIC NEGOTIATION ............................................................................ 36
  3.4.9. RELATED WORK ................................................................................................... 36
3.5.    REGISTRIES ..................................................................................................... 37
3.6.    TRUST .............................................................................................................. 38
3.7.    THE TAP SYSTEM .............................................................................................. 38
3.8.    SEMANTIC SEARCH .......................................................................................... 39
  3.8.1. CHOOSING A DENOTATION ....................................................................................... 41
  3.8.2. DETERMINING WHAT TO SHOW ................................................................................. 41


CHAPTER 4. INTEGRATING THE TAP SYSTEM WITH EDUTELLA. ................................. 44
4.1.    THE EDUTELLA WRAPPER................................................................................. 44
  4.1.1. PARSING A QEL QUERY .......................................................................................... 44
4.2.    QEL – GETDATA WRAPPER ............................................................................... 45
  4.2.1. CREATING THE N-TREES ......................................................................................... 46
  4.2.2. TRAVERSING THE N-TREES ...................................................................................... 47
    4.2.2.1. Top – Down Traversal.................................................................................. 47
    4.2.2.2. Bottom – Up Traversal ................................................................................. 48
  4.2.3. SENDING THE GETDATA QUERIES, APPLYING THE RESTRICTIONS AND BINDING THE RESULTS ........ 48
  4.2.4. UNIFYING THE RESULTS ........................................................................................... 49
  4.2.5. PRESENTING THE RESULTS TO THE USER ...................................................................... 49
4.3.    AN EXAMPLE .................................................................................................... 49


CHAPTER 5. ADDING PERSONALIZATION TO SEMANTIC WEB SEARCH. ..................... 52
5.1.    HOW TAP APPROACH CAN BE USED IN AN ELEARNING SCENARIO .................. 52
5.2.    IMPLEMENTATION OF THE SCENARIO .............................................................. 53
  5.2.1. THE GLOSSARY OF SEMANTIC WEB TERMS .................................................................... 56
  5.2.2. FINDING RESOURCES THAT MATCH THE KEYWORDS ......................................................... 57
    5.2.2.1. Approach 1 – Single Words .......................................................................... 57
    5.2.2.2. Approach 2 – Single Words Plus Stemmer ...................................................... 57
    5.2.2.3. Approach 3 – Meaningful Expressions ............................................................ 58
  5.2.3. FINDING RESOURCES THAT MATCH THE TITLE ................................................................ 59
    5.2.3.1. Approach 1 – Single Words .......................................................................... 59
    5.2.3.2. Approach 2 – Single Words Plus Stemmer ...................................................... 59
    5.2.3.3. Approach 3 – Meaningful Expressions ............................................................ 60
    5.2.3.4. Approach 4 – Title Reduced to a List of Semantic Web Terms ........................... 60

                                                              2
  5.2.4. EVALUATING THE RESULTS OF THE PROGRAM ................................................................. 61


CHAPTER 6. CONCLUSIONS AND FUTURE WORK. ....................................................... 66


CHAPTER 7. BIBLIOGRAPHY. ..................................................................................... 67


CHAPTER 8. ANEXA A. .................................................................................................. 0
  SOME CODE FROM THE QEL-GETDATA WRAPPER. .................................................................0


CHAPTER 9. ANEXA B. ................................................................................................ 53
  SOME CODE FROM THE PERSONALIZED SEMANTIC WEB SEARCH. ............................................ 53


CHAPTER 10. ANEXA C. .............................................................................................. 82
  THE XML FILE CONTAINING THE LECTURES (JUST ONE LECTURE). ......................................... 82
  THE XSL FILE USED TO CONVERT THE XML FILE, CONTAINING THE LECTURE, INTO RDF. ........... 84




                                                           3
CHAPTER 1. INTRODUCTION.


1.1. L3S Research Center

Since its foundation in 2001, the L3S Research Center has focused on theoretical and applied
research in the innovative areas of information, learning, and knowledge technologies and also
on training and continuing education concepts for academia and industry.
The L3S was created as a coordinating institute for the University of Hanover,     Technical
University Carolo-Wilhelmina at Brunswick, and the Brunswick School of Art. Professors from
the universities of Karlsruhe, Mannheim and Kassel are also actively involved in the L3S
Research Center.
The L3S has established itself nationally and internationally; for example: as coordinator of the
  Network of Excellence PROLEARN, as part of the 6th EU research program; as core partner in
the   KnowledgeWeb and        REWERSE networks working in the field of Semantic Web; as
partner in the EURON (Robotics) network and in            Wallenberg Global Learning Network
(WGLN); and is active in the organization of the most important conferences in these fields.
Today, the tasks of the L3S include research, consulting, and technology transfer in the areas
named above; provision of infrastructure and support in the area of innovative teaching and
learning technologies; and collaboration with both German and international standardization
bodies.
The creation of the L3S Research Center was made possible by funding from the federal state
of Lower Saxony and the Federal Ministry for Education and Research (BMBF) in 2001-2003.
Today, over 80 women and men are employed by the L3S in four research and development
areas:
      eLearning
      Semantic Web and Digital Libraries
      Industrial Informatics
      Mobile / Distributed Computing and Networks
A further area (Grid Computing) is currently being developed.
In the short period of less than three years, the L3S Research Center has established itself as a
nationally and internationally acknowledged center of research. Equally, the L3S Research
Center has provided important service functions during the establishment and development of
new infrastructures for higher education in Lower Saxony. In addition, the L3S attaches a great
deal of importance to the development of applied research and to the creation of sustainable
networks with regional businesses. These activities all ensure that the L3S is an important
location factor for Hanover, Brunswick and Lower Saxony.




                                               4
1.2. The University of Hannover

The heart of the University of Hannover beats in the idyllic Welfenschloss, the Guelph Palace.
Originally founded in 1831, the university now has around 27000 students. 2000 academics and
scientists work at the university in 17 faculties with around 160 departments and institutes.




                                              5
CHAPTER 2. EDUTELLA.


2.1. Overview

While in the server/client – based environment of the World Wide Web metadata are useful and
important, for Peer–To–Peer (P2P) environments metadata are absolutely crucial. Information
Resources in P2P networks are no longer organized in hypertext like structures, which can be
navigated, but are stored on numerous peers waiting to be queried for these resources if we
know what we want to retrieve and which peer is able to provide that information. Querying
peers requires metadata describing the resources managed by these peers, which is easy to
provide for specialized cases, but non-trivial for general applications.




2.2. Motivation for Edutella


P2P applications have been successful for special cases like exchanging music files. However,
retrieving “all recent songs by Madonna” does not need complex query languages nor complex
metadata, so special purpose formats for these P2P applications have been sufficient. In other
scenarios, like exchanging educational resources, queries are more complex, and have to be
built upon standards like IEEE-LOM/IMS metadata with up to 100 metadata entries, which
might even be complemented by domain specific extensions.
Furthermore, by concentrating on domain specific formats, current P2P implementations appear
to be fragmenting into niche markets instead of developing unifying mechanisms for future P2P
applications. There is indeed a great danger that unifying interfaces and protocols introduced by
the World Wide Web get lost in the forthcoming P2P arena.
The Edutella project addresses these shortcomings of current P2P applications by building on
the W3C metadata standard RDF. The project is a multi-staged effort to scope, specify,
architect and implement an RDF – based metadata infrastructure for P2P-networks based on the
JXTA framework.




                                               6
2.3. Background


     2.3.1. The JXTA P2P Framework

JXTA is an Open Source project supported and managed by Sun Microsystems. In essence,
JXTA is a set of XML based protocols to cover typical P2P functionality. It provides a Java
binding offering a layered approach for creating P2P applications (core, services, and
applications). Project JXTA is creating a P2P system by identifying a small set of basic functions
necessary to support P2P applications and providing them as building blocks for higher-level
functions (Figure 1). At the core, capabilities must exist to create and delete peer groups, to
advertise them to potential members, to enable others to find them and to join or leave them.
At the next level capabilities can be used to create a set of peer services, including indexing,
searching and file sharing. Peer applications can be built using these facilities. In addition, peer
commands and a peer shell have been created to create a window into the JXTA technology-
based network.




                                   Figure 1. JXTA Layers



JXTA Core
The JXTA Core provides core support for peer-to-peer services and applications. In a multi-
platform, secure execution environment, the mechanisms of peer group, peer pipes, and peer
monitoring are provided:
• Peer groups establish a set of peers and naming within a peer group with mechanisms to
create policies for creation and deletion, membership, advertising and discovery of other peer
groups and peer nodes, communication, security, and content sharing.
• Peer pipes provide communication channels among peers. Messages sent in peer pipes are
structured with XML, and support transfer of data, content, and code in a protocol-independent
manner — allowing a range of security, integrity, and privacy options.
• Peer monitoring enables control of the behaviour and activity of peers in a peer group and can
be used to implement peer management functions including access control, priority setting,
traffic metering, and bandwidth balancing.

                                                 7
The core layer supports choices such as anonymous vs. registered users, and encrypted vs.
clear text content without imposing specific policies on developers. Policy choices are made, or
when necessary, implemented, at the service and application layers. For example,
administration services such as accepting or rejecting a peer’s membership in a peer group can
be implemented using functions in the core layer.
JXTA Services
Just as the various libraries in UNIX operating systems support higher-level functions than the
kernel, JXTA Services expand upon the capabilities of the core and facilitate application
development. Facilities provided in this layer include mechanisms for searching, sharing,
indexing, and caching code and content to enable cross-application bridging and translation of
files.
Searching capabilities include distributed, parallel searches across peer groups that are
facilitated by matching an XML representation of a query to be processed with representations
of the responses that can be provided by each peer. These facilities can be used for simple
searches — like searching a peer’s repository — to complex searches of dynamically-generated
content that is unreachable by conventional search engines.
P2P searches can be conducted across a company’s intranet, quickly locating relevant
information within a secure environment. By exercising tight control over peer group
membership, and enabling encrypted communication between peers, a company can extend
this capability to its extranet, including business partners, consultants, and suppliers as peers.
The same mechanisms that facilitate searches across the peer group can be used as a bridge to
incorporate Internet search results, and to include data outside of the peer’s own repository, for
example searching a peer’s disk.
The peer services layer can be used to support other custom, application-specific functions —
for example a secure peer messaging system could be built to allow anonymous authorship and
a persistent message store. The peer services layer provides the mechanisms to create such
secure tools; specific tool policies are determined by the application developers themselves.
JXTA Shell
Straddling the boundary between peer services and applications is the JXTA Shell, which
enables both developers and users to experiment with the capabilities of JXTA technology,
prototype applications, and control the peer environment.
The peer shell contains both built-in functions that facilitate access to core-level functions
through a command line interface, and external commands which can be assembled using pipes
to accomplish more complex functions — just like a UNIX operating system shell, but extended
into the peer-to-peer environment. For example, a shell command peers lists the accessible
peers in the group. Other facilities include command-line access to peer discovery, joining and
leaving peer groups, and sending messages between peers. In the future, shell-level commands
will enable administrative control of peer groups, including who can join a given group and what
resources they can access.
JXTA Applications
JXTA Applications are built using peer services as well as the core layer. The project’s
philosophy is to support the fundamental levels broadly, and rely on the P2P development
community to provide additional peer services and applications.
Peer applications enabled by both the core and peer services layers include P2P auctions that
link buyers and sellers directly — with buyers able to program their bidding strategies using a
simple scripting language. Resource-sharing applications like SETI@home can be built more
quickly and easily, with heterogeneous, world-wide peer groups supported from day one.
Instant messaging, mail, and calendaring services can facilitate communication and

                                                8
collaboration within peer groups that are secure and independent of service provider-hosted
facilities.
This layered approach fits very nicely into the application scenarios defined for Edutella:
Edutella Services (described in web service languages like DAML-S or WSDL, etc.)
complement the JXTA Service Layer, building upon the JXTA Core Layer, and
Edutella Peers live on the Application Layer, using the functionality provided by these Edutella
services as well as possibly other JXTA services.
On the Edutella Service layer, we define data exchange formats and protocols (how to exchange
queries, query results and other metadata between Edutella Peers), as well as APIs for
advanced functionality in a library like manner. Applications like repositories, annotation tools or
GUI interfaces connected to and accessing the Edutella network are implemented on the
Application layer.




     2.3.2. Educational Context


Every single university usually has already a large pool of educational resources distributed over
its institutions. These are under control of single entities or individuals, and it is unlikely that
these entities will give up their control, which explains why all approaches for the distribution of
educational media based on central repositories have failed so far. Furthermore, setting up and
maintaining central servers are costly. The costs are hardly justifiable, since a server
distributing educational material would not directly benefit the sponsoring university.
In order to really facilitate the exchange of educational media, approaches based on metadata-
enhanced peer-to-peer (P2P) networks are necessary.
In a typical P2P-based e-Learning scenario, each university acts not only as content provider,
but also as content consumer, including local annotation of resources produces at other sites.
As content provider in P2P network they will not lose their control over their learning resources,
but still provide them for use within the network. As a content consumer both teachers and
students benefit from having access not only to a local repository, but to a whole network,
using queries over the metadata distributed within the network to retrieve required resources.
P2P networks have already been quite successful for exchanging data in heterogeneous
environments, and have been brought into focus with services like Napster and Gnutella,
providing access to distributed resources like MP3 coded audio data. However, pure Napster and
Gnutella like approaches are not suitable for the exchange of educational media. For example,
the metadata in Gnutella is limited to a file name and a path. While this might work for files
with titles like “Madonna – Like a Virgin”, it certainly does not work for “Introduction to Algebra
– Lecture 23”. Furthermore, these special purpose services lead to fragmented communities
which use special purpose clients to access their service.
The educational domain is in need of a much richer metadata markup of resources, a markup
that is often highly domain and resource type specific. In order to facilitate interoperability and
reusability of educational resources, we need to build a system supporting a wide range of such
resources. This places high demands on the interchange of protocols and metadata schemata
used in such a system, as well as on the overall technical structure. Also, we do not want to
create yet another special purpose solution which is outdated as soon as metadata
requirements and definitions change.
This metadata based peer-to-peer system has therefore to be able to integrate heterogeneous
peers (using different repositories, query languages and functionalities) as well as different

                                                 9
kinds of metadata schemas. The two very important assumptions about Edutella are: all
resources maintained in Edutella network can be described in RDF, and all functionality in the
Edutella network is mediated through RDF statements and queries on them. For the local user,
the Edutella network transparently provides access to distributed information resources, and
different clients/peers can be used to access these resources. Each peer will be required to offer
a number of basic services and may offer additional advances services.




2.4. Edutella Services


Edutella connects highly heterogeneous peers (heterogeneous in their uptime, performance,
storage size, functionality, number of users, etc.). However, each Edutella peer can make its
metadata information available as a set of RDF statements. The goal is to make the distributed
nature of the individual RDF peers connected to the Edutella network completely transparent by
specifying and implementing a set of Edutella services. Each peer will be characterized by the
set of services it offers.
Query Service. The Edutella query service is the most basic service within the Edutella
network. Peers register the queries they may be asked through the query service (i.e. by
specifying supported metadata schemas (e.g. “this peer provides metadata according to LOM
6.1 or DCMI standards”) or by specifying individual properties or even values for these
properties (e.g. “this peer provides metadata of the form dc_title(X, Y)” or “this peer provides
metadata of the form dc_title (X,’Artificial Intelligence’)”). Queries are sent through the Edutella
network to the subset of peers who have registered with the service to be interested in this kind
of query. The resulting RDF statements / models are sent back to the requesting peer.
Edutella Replication. This service is complementing local storage by replicating data in
additional peers to achieve data persistence / availability and workload balancing while
maintaining data integrity and consistency. Since Edutella is mainly concerned with metadata,
replication of metadata is the initial focus. Replication of data might be an additional possibility
(though this complicates synchronization of updates).
Edutella Mapping, Mediation, Clustering. While groups of peers will usually agree on using a
common schema (e.g. SCORM or IMS/LOM for educational resources), extensions or variations
might be needed in some locations. The Edutella Mapping service will be able to manage
mappings between different schemata and use these mappings to translate queries over one
schema X to queries over another schema Y. Mapping services will also provide interoperation
between RDF- and XML-based repositories. Mediation services actively mediate access between
different services, clustering services use semantic information to set up semantic routing and
semantic clusters.




2.5. Edutella Query Service


The Edutella Query Service is intended to be a standardized query exchange mechanism for
RDF metadata stored in distributed RDF repositories and is meant to serve as both query

                                                10
interface for individual RDF repositories located at single Edutella peers as well as query
interface for distributed queries spanning multiple RDF repositories. An RDF repository (or
knowledge base) consists of RDF statements (or facts) and describes metadata according to
arbitrary RDFS schemas.
One of the main purposes is to abstract from various possible RDF storage layer query
languages (e.g. SQL) and from different user level query languages (e.g. RQL, TRIPLE). The
Edutella Query Exchange Language and the Edutella common data model provide the syntax
and semantics for an overall standard query interface across heterogeneous peer repositories
for any kind of RDF metadata. The Edutella network uses the query exchange language family
RDF-QEL-i (based on Datalog semantics and subsets thereof) as standardized query exchange
language format which is transmitted in an RDF/XML-format.

                                                                              Software
                                                       dc:title               Engineering



                      rdf:type                   http://www.xyz.com/sw.html



                http://www.lit.edu/types#Book
                                                                              Artificial
                                                       dc:title               Intelligence
                      rdf:type


                                                 http://www.xyz.com/ai.html

              http://www.lit.edu/types#AI-Book

                                                       dc:title               Prolog

                      rdf:type


                                                 http://www.xyz.com/pl.html




                        Figure 2. Knowledge Base as RDF Graph


We will start with a simple RDF knowledge base and a simple query on this knowledge base as
depicted in Figure 2, with the following XML serialization (we use lib as namespace shorthand
for ‘http://www.lit.edu/types#’):
<lib:Book about="http://www.xyz.com/sw.html">
      <dc:title>Software Engineering</dc:title>
</lib:Book>

<lib:Book about="http://www.xyz.com/ai.html">
      <dc:title>Artificial Intelligence</dc:title>
</lib:Book>
<lib:AI-Book about="http://www.xyz.com/pl.html">
      <dc:title>Prolog</dc:title>
</lib:AI-Book>


Evaluating the following query (plain English)
“Return all resources that are a book having the title ‘Artificial Intelligence’ or that are an AI –
Book”

                                                  11
we get the query results shown in Figure 3, depicted as RDF-graph.


                http://www.lit.edu/types#Book
                                                                              Artificial
                                                       dc:title               Intelligence
                      rdf:type


                                                 http://www.xyz.com/ai.html

              http://www.lit.edu/types#AI-Book



                      rdf:type


                                                 http://www.xyz.com/pl.html




                     Figure 3. Query Results as RDF Graph




     2.5.1. Query Exchange Architecture


Edutella peers are highly heterogeneous in terms of the functionality (i.e. services) they offer. A
simple peer has RDF storage capability only. The peer has some kind of local storage for RDF
triples (e.g. a relational database) as well as some kind of local query language (e.g. SQL). In
addition, the peer might offer more complex services, such as annotation, mediation or
mapping.
To enable the peer to participate in the Edutella network, Edutella wrappers are used to
translate queries and results from Edutella query and exchange format to the local format of the
peer and vice versa, and to connect the peer to the Edutella network by a JXTA-based P2P
library.
To handle queries the wrapper uses the common Edutella query exchange format and data
model for query and result representation. For communication with the Edutella network the
wrapper translates the local data model into the Edutella common data model ECDM and vice
versa and connects to the Edutella Network using the JXTA P2P primitives, transmitting the
queries based on the common data model ECDM in RDF/XML form (Figure 4).




                        Figure 4. Query Processing in Edutella




                                                  12
In order to handle different query capabilities, we define several RDF-QEL-i exchange language
levels, describing which kind of queries a peer can handle (conjunctive queries, relational
algebra, transitive closure, etc.). The same internal data model is used for all levels.




     2.5.2. Datalog Semantics for the Edutella Common
          Data Model (ECDM)


Datalog is a non-procedural query language based on Horn clauses without function symbols. A
Horn clause is a disjunction of literals where there is at most one positive (non – negated)
literal. A Datalog program can be expressed as a set of rules / implications (where each rule
consists of one positive literal in the consequent of the rule (the head), and one or more
negative literals in the antecedent of the rule (the body)), a set of facts (single positive literals)
and the actual query literals (a rule without head, i.e. one or more negative literals).
Literals are predicates expressions describing relations between any combination of variables
and constants such as title(http://www.xyz.com/book.html, ‘Artificial Intelligence’). Each rule is
divided into head and body with the head being a single literal and the body being a conjunction
of any number of positive literals (including conditions on variables). Disjunction is expressed as
a set of rules with identical head. A Datalog query then is a conjunction of query literals plus a
possible empty set of rules.
Datalog shares with relational databases and with RDF the central feature, that data are
conceptually grouped around properties (in contrast to object oriented systems, which group
information within objects usually having object identity). Therefore Datalog queries easily map
to relations and relational query languages like relational algebra or SQL. In terms of relational
algebra Datalog is capable of expressing selection, union, join and projection and hence is a
relationally complete query language. Additional features include transitive closure and other
recursive definitions.
The example knowledge base in Datalog reads:
title(http://www.xyz.com/ai.html,’Artificial Intelligence’).
type(http://www.xyz.com/ai.html,Book).
title(http://www.xyz.com/sw.html,’Software Engineering’).
type(http://www.xyz.com/sw.html,Book).
title(http://www.xyz.com/pl.html,’Prolog’).
type(http://www.xyz.com/pl.html,AI-Book).


In RDF any statement is considered to be an assertion. Therefore we can view an RDF
repository as a set of ground assertions either using binary predicates as shown above, or as
ternary statements “s(S,P,O)”, if we include the predicate as an additional argument. In the
following examples, we use the binary surface representation, whenever our query does not
span more than one abstraction level.
In (binary) Datalog notation, our example query is:
aibook(X) :- title(X, ‘Artificial Intelligence’), type(X, Book).
aibook(X) :- type(X, AI-Book).
?- aibook(X).


Since our query is a disjunction of two (purely conjunctive) sub queries, its Datalog
representation is composed of two rules with identical heads. The literals in the rules’ bodies
directly reflect RDF statements with their subjects being variable X and their objects being
                                                 13
bound to constant values such as ‘Artificial Intelligence’. Literals used in the head of rules
denote derived predicates (not necessarily binary ones).
In our example, the query expression ‘aibook(X)’ asks for all bindings of X, which conform to
the Datalog rules and the knowledge base to be queried, with the results:
aibook(http://www.xyz.com/ai.html)
aibook(http://www.xyz.com/pl.html)




     2.5.3. Edutella Common Data and Query Exchange
      Model


Internally Edutella peers use a Datalog based model to represent queries and their results.
Figure 5 describes a UML model of QEL queries and results. This model forms the basis for the
construction of the RDF syntax for QEL. Our Java binding relies on JXTA and makes extensive
use of the Stanford API. The implementation of all classes shown in figure 5 can be found in the
Java package net.jxta.edutella.util.datamodel. All classes whose name start with RDF
represent standard RDF concepts and correspond to their equivalent counterparts within the
Stanford RDF API.




      Figure 5. Edutella Common Data and Query Exchange Model (ECDM)

Each query is represented as an instance of Query which aggregates an arbitrary number of
Rules and QueryLiterals objects. A Rule consists of a QueryLiteral as its head and an
arbitrary number of QueryLiterals as its body. In the body of a rule, literals or outer join
literals can occur. QueryLiterals are instances of StatementLiterals. Results can be
expressed as a set of RDFNodes or as a sequence of Variables.




                                              14
     2.5.4. RDF-QEL-i Language Levels


In the definition of Edutella query exchange language, several important design criteria have
been formulated:
Standard Semantics of query exchange language, as well as a sound RDF serialization. Simple
and standard semantics of the query exchange language is important, as transformations to
and from this language have to be performed into the Edutella peer wrappers, which have to
preserve the semantics of the query in the original query language. A sound encoding of the
queries in RDF to be shipped around between Edutella peers has to be provided.
Expressiveness of the language. We want to interface to both simple graph based query
engines as well as SQL query engines and even with interface engines. It is important that the
language allows expressing simple queries in a form that simple query providers can directly
use, while allowing for advanced peers fully use their expressiveness.
Adaptability to different formalisms. The query language has to be neutral to different
representation semantics, it should be able to use any predicates with predefined semantics
(like rdfs:subclassOf), but not have their semantics built in, in order to be applicable to
different formalisms used by the Edutella peers. It should easily connect to simple RDF
repositories, relational databases and object-relation ones, as well as to inference systems, with
all their different base semantics and capabilities.
Transformability of the query language. The basic query exchange language model must be
easy to translate to many different query languages (both for importing and exporting),
allowing easy implementation of Edutella peer wrappers.
Edutella follows a layered approach for defining the query exchange language. Currently there
are defined language levels RDF-QEL-1, -2, -3, -4 and -5, differing in their expressiveness. All
language levels can be represented through the same internal data model (Figure 5). A query
representation in RDF is also specified, using reified RDF statements to describe triple patterns.
The most simple language level RDF-QEL-1 can also be expressed as unreified RDF graph,
which simplifies query formulation.




              2.5.4.1. RDF-QEL Syntax


As with our internal Datalog model, the RDF representation of a query is modeled as a set of
rules and query literals. A construct for each ECDM query literal type is defined. Reifying an RDF
statement involves creating a model of the RDF triple in the form of type Statement. This
resource has as properties the subject, the predicate and the object of the modeled RDF triple.
Such reified statements are the building blocks for each query. The example query expressed in
RDF-QEL-3 resembles the internal Datalog model described above.
<edu:QEL3Query rdf:about=’#AI_Book_Query’>
      <edu:hasRule rdf:resource=’#r1’/>
      <edu:hasRule rdf:resource=’#r2’/>
      <edu:hasQueryLiteral rdf:resource=’#l1’/>
</edu:QEL3Query>

<edu:Variable rdf:about="#X" rdfs:label="X"/>

<edu:Rule rdf:about=’#r1’>

                                               15
      <edu:hasHead>
            <edu:StatementLiteral>
                  <edu:predicate rdf:resource=’#aibook’/>
                  <edu:arguments>
                        <rdf:Seq>
                              <rdf:li rdf:resource=’#X’/>
                        </rdf:Seq>
                  </edu:arguments>
            </edu:StatementLiteral>
      </edu:hasHead>
      <edu:hasBody>
            <edu:RDFReifiedStatement>
                  <rdf:subject rdf:resource=’#X’/>
                  <rdf:predicate rdf:resource=’&rdf;type’/>
                  <rdf:object rdf:resource=’&lit;Book’/>
            </edu:RDFReifiedStatement>
      </edu:hasBody>
      <edu:hasBody>
            <edu:RDFReifiedStatement>
                  <rdf:subject rdf:resource=’#X’/>
                  <rdf:predicate rdf:resource=’&dc;title’/>
                  <rdf:object>Artificial Intelligence</rdf:object>
            </edu:RDFReifiedStatement>
      </edu:hasBody>
</edu:Rule>

<edu:Rule rdf:about=’#r2’>
      <edu:hasHead>
            <edu:StatementLiteral>
                  <edu:predicate rdf:resource=’#aibook’/>
                  <edu:arguments>
                        <rdf:Seq>
                              <rdf:li rdf:resource=’#X’/>
                        </rdf:Seq>
                  </edu:arguments>
            </edu:StatementLiteral>
      </edu:hasHead>
      <edu:hasBody>
            <edu:RDFReifiedStatement>
                  <rdf:subject rdf:resource=’#X’/>
                  <rdf:predicate rdf:resource=’&rdf;type’/>
                  <rdf:object rdf:resource=’&lit;AI-Book’/>
            </edu:RDFReifiedStatement>
      </edu:hasBody>
</edu:Rule>

<edu:StatementLiteral rdf:about=’#l1’>
      <edu:predicate rdf:resource=’#aibook’/>
      <edu:arguments>
            <rdf:Seq>
                   <rdf:li rdf:resource=’#X’/>
            </rdf:Seq>
      </edu:arguments>
</edu:StatementLiteral>




                                        16
             2.5.4.2. RDF-QEL-1


RDF-QEL-1 is restricted to conjunctive formulas only. While it is possible to express them using
the default RDF-QEL notation, we have designed a special RDF-QEL-1 syntax following the QBE
(Query by Example) paradigm: queries are represented using ordinary RDF graphs having
exactly the same structure as the answer graph, with additional annotations to denote variables
and constraints on them. Any RDF graph query can be interpreted as a logical (conjunctive)
formula that is to be proven from a knowledge base.
Since disjunction can not be expressed in RDF-QEL-1 syntax out example query has to be split
into two separate sub queries (Figure 6):




           Figure 6. Example Query in RDF-QEL-1, Unreified Format


<edu:QEL1Query rdf:about="#AI_Query_1">
      <edu:hasVariable rdf:resource="#X"/>
</edu:QEL1Query>

<edu:Variable rdf:about="#X" rdfs:label="X">
      <rdf:type rdf:resource="&lit;AIBook"/>
</edu:Variable>

<edu:QEL1Query rdf:about="#AI_Query_2">
      <edu:hasVariable rdf:resource="#Y"/>
</edu:QEL1Query>

<edu:Variable rdf:ID="Y" rdfs:label="Y">
      <rdf:type rdf:resource="&lit;Book"/>
      <dc:title>Artificial Intelligence</dc:title>
</edu:Variable>




                                              17
              2.5.4.3. RDF-QEL-2


Extending RDF-QEL-1 with disjunction leads to RDF-QEL-2. Queries of this type can be
transformed into an AND-OR tree of reified statements, allowing for a very user-friendly
visualization. Queries can be stored and reused later, thus we can work with a library of queries
that can be combined to form new queries. Those queries can be used as is or as templates,
where substrings, numerical values, etc are filled in. Details of sub-queries can be suppressed
by hiding them in detailed maps that can be presented hierarchically.




              2.5.4.4. RDF-QEL-3


Going a step further, we arrive at the full Datalog semantics with conjunction, disjunction and
negation of literals. As long as queries are non-recursive this approach is relationally complete.




              2.5.4.5. Further RDF-QEL-i levels


RDF-QEL-4: RDF-QEL-4 allows recursion to express transitive closure and linear recursive
query definitions, compatible with SQL3 capabilities. So a relational query engine with full
conformance to the SQL3 standard will be able to support the RDF-QEL-4 query level.
RDF-QEL-5: Further levels allow arbitrary recursive definitions in stratified or dynamically
stratified Datalog, guaranteeing one single minimal model and thus unambiguous query results.
RDF-QEL-i-A: Support for the usual aggregation functions as defined by SQL92 (e.g. COUNT,
AVG, MIN, MAX) will be denoted by appending “-A” to the query language level, i.e. RDF-QEL-1-
A, RDF-QEL-2-A, etc. RDF-QEL-i-A includes these aggregation functions as edu:count, edu:avg,
edu:min, etc. Additional “foreign” functions like edu:substring, etc. to be used in conditions
might be useful as well, but have not been included yet in RDF-QEL-i-A.
RDF-MEL: RDF-MEL is an extension of RDF-QEL-3 by constructs to modify knowledge bases on
other peers. It provides commands similar to the SQL INSERT, DELETE or UPDATE statements.




     2.5.5. Representing Complex Property Semantics


RDFS already comes with predefined semantics for certain properties (i.e. transitiveness of
rdfs:subclassOf, inheritance for rdf:type). Whenever the query includes these pre-defined
predicates, we presume that we have the pre-defined semantics. This is valid for DAML+OIL
predefined predicates and their semantics as well, i.e. if we use definitions like

                                               18
<daml:TransitiveProperty rdf:ID=“hasAncestor”/>

then transitivity of hasAncestor is assumed:
hasAncestor(X, Y) :- hasAncestor(X, Z), hasAncestor(Z, Y).
without having to be specified explicitly in the query.
If we want to specify something else, we have in principle to specify its semantics as Datalog
rule, and ship it with the query. However, we can add special annotations like
edu:transitive_closure_of            (denoting  transitive      closure     of     properties),
edu:inherited_version_of (inheritance of properties along the subclassOf-hierarchy),
edu:reflexive_version_of (reflexive version of a property) to properties, which can be used
directly by the Edutella peer wrapper (whenever it knows what these edu:properties mean).
This has the advantage that the wrapper does not have to infer the correct semantics from the
corresponding Datalog rules, but can use the predefined semantics for these edu:properties
directly. This keeps the clear semantics for RDF-QEL-i but allow abbreviations which make it
easier to write Edutella peer wrappers. Also, while it is possible to axiomatize quite a lot of
specific operators in Datalog (including the one described above), Datalog also has its
limitations. Datalog (and its extensions) do overlap with description logic fragments of first
order logic (e.g. DAML+OIL), but usually cannot axiomatize them completely (in the other
direction, this observation is true as well).




     2.5.6. Querying Schema Information


As apparent already from the RDFS schema definition, and discussed in more detail in the
recent RDF model theory, RDFS does not distinguish between data and schema level, and
represents all information in a uniform way as a graph. There is no principle difference between
entities at different modeling levels (i.e. objects, classes and meta-classes are represented in a
uniform way), and queries over an RDFS schema should not be more difficult than queries over
RDFS data.
In order to express Datalog like queries ranging over different abstraction levels, instead of
writing properties as binary predicates, we have to switch to a triple syntax, using a ternary
predicate “s”, i.e. instead of writing “book(X, ‘Artificial Intelligence’)”, we write “s(X, book,
‘Artificial Intelligence’)”. If we enforce the restriction, that the predicate symbol “s” always
denotes this special ternary predicate, we can also mix this notation we used so far in our
examples. Generalizing the query from our running example a bit, we now want to ask for any
additional property our AI-book might have, getting the query:
aibook(X) :- title(X, ‘Artificial Intelligence’), type(X, Book).
aibook(X) :- type(X, AI-Book).

book_property(P) :- s(P, rdfs:domain, Book).
ai_book_property(P) :- s(P, rdfs:domain, AI-Book).

ai_book_attribute(X, P, V) :- aibook(X), book_property(P), s(X, P, V).
ai_book_attribute(X, P, V) :- aibook(X), ai_book_property(P), s(X, P, V).

?- ai_book_attribute(X, P, V).



                                                19
     2.5.7. Result Formats

             2.5.7.1. Standard Result Set Syntax


As a default, we represent query results as a set of tuples of variables with their bindings.
Referring to our example, there are two bindings for a single variable:
<edu:ResultSet rdf:about=’#AI_Results’>
      <edu:hasResult>
            <edu:TupleResult>
                  <edu:hasBinding>
                        <edu:VariableBinding>
                              <edu:bindsVariable rdf:resource=’#X’/>
                              <rdf:value
                        rdf:resource=’http://www.xyz.com/ai.html’/>
                        </edu:VariableBinding>
                  </edu:hasBinding>
            </edu:TupleResult>
      </edu:hasResult>

      <edu:hasResult>
            <edu:TupleResult>
                  <edu:hasBinding>
                        <edu:VariableBinding>
                              <edu:bindsVariable rdf:resource=’#X’/>
                              <rdf:value
                        rdf:resource=’http://www.xyz.com/pl.html’/>
                        </edu:VariableBinding>
                  </edu:hasBinding>
            </edu:TupleResult>
      </edu:hasResult>
</edu:ResultSet>




             2.5.7.2. RDF Graph Answers

Another possibility, which has been explored recently in Web related languages focusing on
querying semi-structured data, is the ability to create objects as query results. In the simple
case of RDF-QEL-1, we can return as answer objects the graph representing the RDF-QEL-1
query itself with all Edutella specific statements removed and all variables instantiated. The
results can be interpreted as the relevant sub graph of the RDF graph we are running our
queries against. In other words, the answer graph contains sufficient information, so that
running the query using only the data in the answer graph returns the same result as running
the query against the original database.
<lib:Book about="http://www.xyz.com/ai.html">
      <dc:title>Artificial Intelligence</dc:title>
</lib:Book>


                                              20
<lib:AI-Book about="http://www.xyz.com/pl.html"/>

When we use general RDF-QEL-i queries, we assume the structure of the answer graph to be
defined by the query literals (provided they are all binary predicates). All variables used in the
query literals are assumed to be existentially quantified, so if they are not instantiated during
the query evaluation, they are represented as anonymous nodes in the RDF graph.




2.6. Registration Service and Query Mediators


The wrapper-mediator approach divides the functionality of a data integration system into two
kinds of subsystems. The wrappers provide access to the data in the data sources using a
common data model (CDM) and a common query language. The mediators provide coherent
views of the data in the data sources by performing semantic reconciliation of the CDM data
representations provided by the wrappers. Both common data model (ECDM) and common
query language for the Edutella network have been already defined.
To mediate distributed data sources we are using a two-layered approach: Simple ’wrapping’
mediators distribute queries to the appropriate peer with the restriction that queries can be
answered completely by one Edutella peer. Complex ’integrating’ mediators are able to mediate
distributed queries over multiple repositories. The query syntax to queries to both kinds of
mediator will be identical in both cases. Additionally, the peer announces the hub, which queries
level it can handle (RDF-QEL-1, RDF-QEL-2, etc.). Whenever the hub receives queries, it uses
these registrations to forward queries to the appropriate peers, merges the results, and sends
them back as one result set.
The          packages       net.jxta.edutella.peer,   net.jxta.edutella.provider,
net.jxta.edutella.hub, net.jxta.edutella.consumer contain interfaces to handle the
distributed query mechanisms.
A schematically representation of the wrapper-mediator can be seen in figure 7:




                             Figure 7. Query Mediator Wrapper




                                               21
     2.6.1. Simple Wrapping Mediators


The first layer of functionality for distributed queries in the Edutella network will be based on
simple query hubs and wrapping mediation. While query hubs might have some wrapping
capability, our prototype peers will use them only as registration and query distribution peers
using the Edutella common data and query model, and implement wrapping capability (to and
from the common model) locally within the Edutella peer wrappers. Thus, each Edutella peer
offers a common query interface based on the common model (possibly at different levels as
defined by RDF-QEL-i) to the network.
Registration of peer query capabilities is based on (instantiated) property statements and
schema information, telling the network, which kind of schema the peer uses, with some
possible value constraints (select conditions). These registration messages have the same
syntax as RDFQEL-1 queries, which are sent from the peer to the registration / query
distribution hub. Additionally, the peer announces to the hub, which kind of schema the peer
uses, with some possible value constraints (select conditions). These registration messages
have the same syntax as RDFQEL-1 queries, which are sent from the peer to the registration /
query distribution hub. Additionally, the peer announces to the hub, which query level it can
handle (RDF-QEL-1, RDFQEL-2, etc.) Whenever the hub receives queries, it uses these
registrations to forward queries to the appropriate peers, merges the results, and sends them
back as one result set.




     2.6.2. Mediator Peers handle Distributed Queries


The second layer introduces query mediators or query hubs. These mediators bring in the extra
intelligence required to assemble distributed and heterogeneous queries. These more complex
mediators submit sub-queries to different repositories that might be able to answer them,
collect the sub-results, join and reconcile them, and again return the outcome to the client.
Several mediator servers will be available communicating through JXTA. Each mediator peer
has its own mediator meta-data schema and accesses meta-data from other mediators or data
sources. The views provided through the integrating mediators are transparently queriable
using RDF-QEL-i.
Mediators can cooperate by being defined in terms of other mediators, i.e. the mediators are
composable. The composition of mediators allows for modularity and reuse of the view
definitions while avoiding the administrative and performance bottleneck of having a single
mediator system with a global schema. Different interconnecting topologies can be used to
compose mediator servers depending on the integration requirements. Queries to mediator
peers are decomposed into optimized distributed query plans.




                                               22
CHAPTER 3. THE TAP SYSTEM.


3.1. Introduction


Activities such as XML Web Services and the Semantic Web are working to create a distributed
web of machine readable data. This Web is envisioned to be analogous to the current World
Wide Web (WWW), except it will consist of machine readable data targeted at programs, unlike
the WWW which consists of human readable pages targeted at humans.
The Semantic Web has the potential for having as big an impact as the human readable web.
However, given the differences between machines and humans, the Semantic Web will not be a
simple replacement of human readable formats (such as HTML and images) with machine
readable formats (such as XML and RDF). A number of problems will need to be solved before
the vision can be realized.




3.2. The Semantic Web


The Semantic Web is an extension of the current Web in which information is given well-defined
meaning, enabling programs to understand it. Given the rather broad scope of the definition of
the Semantic Web, it is important to be explicit about what we mean by the term Semantic
Web. In this section, we briefly describe some of the salient and distinctive features of the
Semantic Web that we assume.
The Semantic Web will contain resources corresponding not just to media objects (such as Web
pages, images, audio clips, etc.) as the current Web does, but also to objects such as people,
places, organizations and events. Further, the Semantic Web will contain not just a single kind
of relation (the hyperlink) between resources, but many different kinds of relations between the
different types of resources mentioned above. More concretely, we assume that the data on the
Semantic Web is modeled as per the RDF data model, i.e., as a directed labeled graph, wherein
each node corresponds to a resource and each arc is labeled with a property type. Figure 7 and
8 show two small chunks of the Semantic Web, the first corresponding to the cellist Yo-Yo Ma
and the second corresponding to the person Eric Miller. These two examples illustrate several
salient aspects of the Semantic Web that we discuss below.
Documents versus Real World Objects: The Semantic Web is not a Web of               documents,
but a Web of relations between resources denoting real world objects, i.e., objects such     as
people, places and events. In the first example we     have resources such as the city of Paris,
the musician Yo-Yo Ma, an auction event,         the music album Appalachian Journey, etc. In
the second example, we have the person Eric Miller, the W3C Semantic Web Activity, the
organization W3C, the city of Dublin Ohio, etc.



                                              23
Human versus Machine Readable Information: In figure 8, we have a resource
       corresponding to Eric Miller. This is not the string ”Eric Miller”, but a resource
       denoting a person. There are many people with the name Eric Miller. This denotes only
one particular person with that name. The salient point about the Semantic Web is that it
contains rich machine readable information about these resources.      Compare the data in
figure 8 with Eric Miller’s home page      (http://www.w3.org/People/EM/). Eric’s home page
contains more human           readable information than figure 8, but almost all the machine
understandable         parts of that Web page correspond to how it should be displayed by a
browser.       The data in figure 8 on the other hand, is almost understandable by all machines.
       It states, in a machine understandable language, that Eric is a person, who works for
the W3C, etc.
Relation between the HTML & Semantic Webs: The Semantic Web is an extension of the
current Web. As figure 8 shows, there is a rich set of links from the nodes in the Semantic Web
to HTML documents. These relations typically connect a           concept in the Semantic Web
with the pages that most pertain to it. It is also possible that some of the pages in the current
Web contain semantic markup.       However, the architecture and system described in this
paper does not use such     markups. We assume that robots will gather such markups so that
they are     available on the Semantic Web.
Distributed Extensibility: Another important aspect of the Semantic Web is that different
sites may contribute data about a particular resource. In the example         shown in figure 7,
many different sources have data about Yo-Yo Ma and related resources. Amazon and CDNow
have data about his albums, EBay has data about            auctions related to these albums,
TicketMaster has data about his concert schedule, AllMusic has data about where he was born
(Paris), and so on. Each of these sites can publish data about Yo-Yo Ma without getting
permission from any            centralized authority, i.e., they can all extend the cumulative
knowledge on the        Semantic Web about any resource in a distributed fashion. This distributed
        extensibility is a very important aspect of the Semantic Web.




       Figure 8. A segment of the Semantic Web pertaining to Yo-Yo Ma



                                               24
          Figure 9. A segment of the Semantic Web pertaining to Eric Miller




3.3. Query Interfaces for the Semantic Web


The Semantic Web is going to include a large number of servers that provide data about a wide
range of topics. Client programs, which could be on desktops or running on other servers, can
query these servers for this data. We need a standard way of querying for this data.
There has been much work in the area of query interfaces to graph structured data, in the fields
of Artificial Intelligence and Databases and more recently, in the XML and RDF communities. In
the next subsection, we briefly survey some of the relevant work.




      3.3.1. Related Work


Graph structured models of data, such as Frames and Semantic Networks have been popular in
Artificial Intelligence for a very long time. Various formalizations such as the Logic of Frames
and Description Logics of these concepts have been studied.

                                              25
Many systems such as KRL, KL-One, Classic and Loom, which have included query interfaces,
have been built. There has also been work in the database community, most prominently Lore,
which has focused on storing and querying graph oriented semi-structured data.
More recently, a number of query languages have been proposed and implemented for RDF and
DAML.
A common feature of all these query languages is that they attempt to be sufficiently expressive
so that a very large number of different kinds of queries, including those which might be
computationally expensive to process can be stated in the language. In the next subsection, we
argue that such query languages are inappropriate for the Semantic Web.




       3.3.2. Deployability versus Expressiveness


The public Web, in contrast to intranets behind firewalls, is extremely unpredictable. We argue
that a query interface for the public Semantic Web, which every site on the Semantic Web must
provide, needs to satisfy the following requirements.
Simplicity: It should be straight forwards for a site to offer its data via the query interface.
By this, we are not referring to the effort required to initially develop the interface to the data,
but rather to the computational expense in providing        such an interface in an ongoing basis.
Predictability: The behavior of queries should be very predictable. This is important       both
from the perspective of the data provider and from the perspective of the     client. It would be
highly undesirable if the same query were to sometimes work but not work at other times.
Expressive query interfaces are inherently difficult to support. They are especially difficult to
support when the data is highly distributed, as it will be on the Semantic Web. It is very easy
for someone to (intentionally or unintentionally) construct queries that consume vast amount of
resources. One solution is to limit the amount of computational resources that a query may
consume. However, predicting this in the general case is quite difficult. Further, the amount of
computation resources available is a function of the system load, hardware being used, query
processing engine being used, etc. It is not possible for a client to know what these parameters
for a given site, at a given time, are. Consequently, unless the query is extremely simple, a
client cannot know whether its query will be answered.
The Web provides a good case study. Many big websites use traditional relational databases to
store their data. Middleware scripts or servlets issue SQL queries against these databases to
generate pages. In contrast to the interface provided by the back end database to the
middleware, the interface provided by the site to the client, typically HTTP, is much simpler.
Though the URL argument passed via HTTP could require the execution of SQL queries, these
SQL queries are predetermined and known to be well behaved. No major website provides the
public with a SQL interface to their back end relational database. Even sites which make the
data available in a raw form (such as Yahoo! Quotes), do so using via a HTTP interface and not
via SQL or some other general purpose data query language.
Similarly, we believe that expressive query systems will not be widely made available to the
general public by sites on the Semantic Web. Instead, we argue for a approach in which all
Semantic Web sites provide a light weight query interface, one that is both easier to support
and exhibits predictable behavior.
A simple lightweight query interface would be complementary to more complete query
languages such as RQL or DQL. The former is targeted as querying on the open, uncontrolled
web where the site might not have much control over who is issuing queries, whereas the latter

                                                26
is targeted at the comparatively better behaved and more predictable area behind the firewall.
The lightweight query also does not preclude particular sites from aggregating data from
multiple sites and providing rich query interfaces into these aggregations.
The next subsection, will describe a simple interface called GetData.




       3.3.3. GetData


GetData is intended to be a simple query interface to network accessible data presented as directed
labeled graphs. For reasons explained in the previous section, GetData is not intended to be a complete or
expressive query language a la SQL, RQL or DQL. It is intended to be very easy to build, support and use,
both from the perspective of data providers and data consumers.
GetData allows a client program to access the values of one or more properties (or their inverses) of a
resource from a graph. Each queryable graph has a URL associated with it. Each GetData query is a SOAP
message addressed to that URL. The message specifies two arguments: the resource whose properties
are being accessed and the properties that are being accessed. Optional arguments specify whether the
client wants the inverse of the properties, the number of answers desired, etc.
The answer returned for a GetData query is itself a graph which contains the resource (whose properties
are being queried) along with the properties specified in the query and their respective targets/sources.
Application programming interfaces hide the details of the SOAP messages and XML encoding from the
programmer. An application using the Semantic Web via GetData gets an API which (in an abstract
syntax) looks like:
GetData(<resource>, <property>) -> <value>
Below are some examples of GetData, in an abstract syntax, operating against the graphs shown in
figures 7 and 8.
GetData(<Yo-Yo Ma>, birthplace) -> <Paris,France>
GetData(<Paris, France>,, temperature) -> 57 F
GetData(<Eric Miller>, livesIn) -> <Dublin, Ohio>
<Yo-Yo Ma>, <Paris, France>, <Eric Miller> and <Dublin, Ohio> are references to the resources
corresponding to Yo-Yo Ma, the city of Paris, France, the person Eric Miller and the city of Dublin, Ohio.
Typically, references to resources are via the URI for the resource. In this case, using the TAP KB, these
URIs are
http://tap.stanford.edu/data/MusicianMa, Yo-Yo
http://tap.stanford.edu/data/CityParis, France
http://tap.stanford.edu/data/W3CPersonMiller, Eric and
http://tap.stanford.edu/data/CityDublin, Ohio.
Each of the above GetData queries is a SOAP message whose end point is the URL corresponding to the
graph with the data.
GetData also allows reverse traversal of arcs, i.e., given a resource, a client can request for resources that
are the sources of arcs with a certain label that terminate at that resource. Also, a particular call to
GetData may specify a list of properties all of whose values are requested, e.g.:
GetData(<Eric Miller> , livesIn lastName firstName)



                                                     27
In addition to the core GetData interface, there are two other interfaces which enable graph exploration.
These are:
Search: The search interface enables a client to identify resources which have a particular string      as   a
substring of their title, keyword or more generally, one of their “title properties”.
Reflection: The reflection interfaces, which are similar to the reflection interfaces provided by
        object oriented languages, return lists of arcs coming into and going out of a node. This is very
useful for exploring a graph in the vicinity of a node without any knowledge of what might be around.




3.4. Semantic Negotiation
        3.4.1. Motivation


Today, behind most big websites, there are one or more databases driving the website. Even though the
website may be accessible to everyone on the Internet, the databases are hidden behind firewalls and not
accessible to everyone. The Semantic Web will make the raw data in these databases accessible to
everyone. This will be a substantial improvement. However, this is not enough.
The world of relational databases is highly fragmented. We have islands of data, each within a company or
department, internally integrated and coherent, but difficult, if not impossible to integrate. For example,
Amazon.com, CDNow, AOL Shopping, the department of motor vehicles, etc. all have data about Yo-Yo
Ma, each in their own database. Integrating all this information into a coherent whole is extremely difficult.
What makes it difficult to integrate the data is not the process of extracting the data from the data bases
involved, but the fact that each of them uses different schemas and different names to refer to Yo-Yo Ma.
Our goal is to create a schematically unified, coherent and unified Semantic Web so that the data is not in
a many small silos, but can be viewed as one large virtual database. Ideally, we would like a scenario
wherein if two sites both have a certain piece of data (such as the birthplace of Yo-Yo Ma), a client can
issue the exact same query to either/both of them to retrieve the data, i.e., the formulation of the query
should be purely a function of what the client is trying to retrieve, not who it is trying to retrieve it from.
For this, we need a way of referring to resources that is both independent of the site and is not dependent
on a global name for that resource.
Figures 7 and 9 illustrate what we mean by this. In figure 9, we have a number of sites publishing data
about Yo-Yo Ma, each using a different name for him and the various concepts associated with him.
What we mean by ”Schematically unified Semantic Web” is that it should be possible for a client to get the
view of all this data as shown in figure 7.
Our goal is to create a unified view into all this data. In order to gain this unified view of the Semantic
Web, we need to solve the problem of different sites using different names for the same concept. We
believe that the mechanisms for overcoming these problems have to be built into the foundations of the
query system of the Semantic Web.




                                                     28
                                                                                           Europe


                                                                     France        locatedIn


                                                         City
                                   USNC0491                                    locatedIn
               temperature
       62F                                                  instanceOf


                                                                              NTNC

                                                                              Paris,_France

                                            328723677                 birthplace
     Atlantic Appalachian Way
                                                                         instanceOf Musician
          publisher   Author
                                     Author
     instanceOf                                   0.9855,109071,09
                                publisher                            DateOfBirth
             instanceOf
                                       EMI                                            “8/22/63”
   Music Album       Taverner



                   Figure 10. Data about Yo-Yo Ma from different sources




     3.4.2. Problems with Global Names


In the examples of GetData in the last section, we used the URIs of resources (individuals and properties)
to refer to the resources and properties. URIs provide a simple and uniform way of referring to resources
which solves the name disparity problem. Global names (like URIs, FIP codes, ISBN numbers, etc.)
provide a firm basis for sharing references. If two parties are exchanging data about a book, that book’s
ISBN number is a good way of referring to the book whereby both parties can be quite sure that they are
referring to the same thing.
This system is not without its flaws. There is an overhead in assigning every book a unique ISBN number
and in everyone who deals with books tracking the book’s ISBN number. These problems get magnified in
the context of other kinds of objects. In fact, ISBN numbers are probably one of the few successful
naming schemes out there. As of date, we do not have uniformly used naming schemes for most other
kinds of things such as people, places or events. The complexities involved in creating and tracking
canonical names are quite severe.
Canonical naming schemes such as URIs and ISBN numbers have other shortcomings as well. For
example, Shakespeare’s Hamlet does not have an ISBN number. Particular editions from particular
publishers each have their own ISBN number, but the general book does not have one. If one wants to
refer to Hamlet in general, ISBN numbers are not of any help.



                                                    29
One instance where global names have succeeded very well is the use of URIs for media objects on the
web. Unfortunately, it will not be easy to extend this to other kinds of resources. URIs for real-world
objects (such as people and places) are different from URIs for web-only objects such as web pages and
images. With the later, there is almost always a unique ”owner” or publisher of the object who has the
final say in specifying the URL for that object. Further, since the object is defined by its URL, it is not
possible (or at least relatively rare) for two parties to use different URLs for the same object. In the case of
people, places, etc., it is common for different sites to use different URIs for the same object. To take a
real world example, consider the musician Yo-Yo Ma. No one ”publishes” him. He exists independent of
any identifier for him. Information about him can be found at a number of different sites, including People
Magazine, Amazon and CDNow. Each of these sites uses a different internally unique identifier to identify
Yo-Yo Ma. As another example, consider the city of Paris, France. Different sites (such as weather.com,
Yahoo! and Lonely Planet) contain information about Paris and each uses a different identifier for it.
The phenomenon of different sites of referring to the same object in different ways has especially nasty
effects when it comes to integrating data from these multiple sites. The actual query that is issued to each
site is a function of those sites identifiers for the objects in the query. Typically, this problem is solved by
the client maintaining a ”mapping table”, which stores the mapping between identifiers for the same
object between different sites. This approach is shown in figure 10.
This solution is not only inelegant, but also quite expensive. These mapping tables, which can get quite
large, are expensive to create and maintain. Because of this, it is no longer possible to have a simple,
light-weight client that can use the Semantic Web. Clearly, we need a better solution than mapping tables.




                                                      30
                                                                                           Europe


                                                                   France           locatedIn
                                   USNC0491
              temperature
      62F
                                                       City
                                                                              locatedIn
                                                            instanceOf
                                          USNC0491

                                                                            NTNC
                                                                 NTNC

              USNC0491<->NTNC
              328723677<->0,9855,109971,09
                                                    Calling
                                                    Program
                                                                 0,9855,109971,09

                                        328723677                                   Paris,_France
                                       328723677
                                                                            birthplace
 Atlantic Appalachian Way
      publisher   Author                                                      instanceOf Musician
                                 Author
 instanceOf
                           publisher                  0.9855,109071,09
        instanceOf                                                       DateOfBirth
                                   EMI
Music Album     Taverner                                                                    “8/22/63”


                   Figure 11. The use of Mapping Tables for data integration




      3.4.3. Bootstrapping


 As we argued earlier, it is also quite improbable that we will get unique names for all the kinds of objects
 we would like to exchange data about. At the same time, we also do not believe that it will be possible for
 two agents to be mutually comprehensible and interchange data without any shared vocabulary. Hence,
 despite the argument made above, standardized global names, such as URIs, do have an important role
 to play. In order to define this role, it is important to draw a distinction between names for general
 concepts (e.g., classes such as People or City and properties such as hasAuthor or title) and individuals
 (such as Yo-Yo Ma or Paris, Texas). There are far more of the later than of the former. While it is feasible
 and indeed useful to get agreement on URIs for the former, the use of URIs or any other kind of global
 names for the latter is not practical.



                                                       31
We can expect two agents trying to exchange data to share some but not all of their vocabularies. The
goal of Semantic Negotiation is to enable two agents to bootstrap from a small shared vocabulary to a
larger shared vocabulary. They do this by exchanging and clarifying descriptions of the resources they
wish to create a shared vocabulary for. The descriptions they exchange are in a formal language and only
use vocabulary they already share.




      3.4.4. Descriptions


A description is a directed labeled graph, involving at least one unlabelled node (the node being
described), with all other nodes reachable from this node by traversing arcs either in their specified
direction or in the reverse direction. Formally, we recursively define a description D of an object O as a
conjunction of atomic formulae as follows:
1. p(O;A), where A is a constant or existentially quantified variable and p is a property type, is a
description of O. The description is said to refer to A.
2. p(A;O), where A is a constant or existentially quantified variable and p is a property type, is a
description of O. The description is said to refer to A.
3. if D1 is a description of which refers to A, then p(A;B) where p is a property type and B is a constant or
existentially quantified variable is a description of O.
4. if D1 is a description of O which refers to A, then D1 ^ p(B;A) where p is a property type and B is a
constant or existentially quantified variable is a description of O.
5. if D1 and D2 are descriptions of O, D1 ^ D2 is a description of O.
Expressed as Datalog it is easy to see that descriptions are all regular path expression (with inverse links)
on the graph. We are particularly interested in the class of descriptions that do not involve inverse links,
since they can be expressed via a single XML element inverse links, since they can be expressed via a
single XML element in the RDF serialization. Stated in logical terms, we restrict our attention to the
case where the description is of the form


p1 (O, a)  p2 (O, b)...  q1 (O, y)  q2 ( y, c)... p1                 (1)


where p1 , p2, q1, q2,… are predicates/relations and a, b, c,; … are constants and x, y, … are variables and
O is the object being described. It is easy to see that this corresponds to the case where the description
can be expressed in RDF using a single XML element.




                3.4.4.1. Discriminant Descriptions


Each agent (involved in the negotiation) has some data about the resource, any subset of which
is a description of the resource. On most sites, there will be many descriptions of each resource.
Some of these descriptions are Discriminant in that they are adequate to distinguish it from all
other resources known to that site. i.e.,  is a discriminant description of the object O on the
site A iff,

                                                          32
 A x ( x) A  ( x  O)                                        (2)

where  A is the universal quantifier over the domain of objects on site A,     ( x) A   means that x
satisfies the description        on site A.

A discriminant description may be very simple and involve a single property or be complex,
involving many different properties. Even at a given site, an object may have multiple
discriminant descriptions. In the degenerate case, the URI of an object is a simple discriminant
description.




                3.4.4.2. Database Keys


Database keys are a special case of discriminant descriptions. If      is a key for data on the site
A, then we have

 A x, y   ( ( x) A   ( y ) A  ( x  y ))                   (3)

Notice that this is much stronger statement about       than formula 2. Formula 2 applies at the
instance level, for one particular object O. Formula 3 on the other hand, applies to all objects on
site A. The following example illustrates the difference between keys and the more general and
weaker concept of discriminant descriptions.
Example: Consider two sites, both of which have information about musicians. Each site uses
its own internal identifiers to identify each musician. Associated with each musician are several
pieces of information, such has his/her first name, last name, compositions, albums, etc. The
two sites are trying to map each others local identifiers for musicians. Let us assume that they
have the same coverage of some particular kind of musician, say western classical composers
and kind of musician, say western classical composers and above mentioned properties for all
the composers. Since there are multiple distinct composers with the same last name (e.g.,
Bach) we know that just the last name alone is not an adequate key. A key for musicians might
involve the last name, first name and date of birth and/or some other combination of
properties. However, for certain last names, such as Tchaikovsky, we know that there is only
one composer with that last name. So, lastName(O, “Tchaikovsky”) is an adequate shared
discriminant description for Tchaikovsky, even though lastName is not a key for composers.
So, even if the two sites don’t share a key, by using discriminant descriptions, they might still
be able to map a substantial fraction of the domain.
As shown in the above example, while shared keys provide a strong constraint, also require
more shared information and are hence less widely applicable. On the other hand, it is often
possible to identify shared discriminant descriptions for subsets of the domain, even when no
shared key exists.
In rare cases, two different sites might use the same key for a particular kind of object. In such
cases, if we can assume that an object is in the domain of both sites, we can use the key to
map the object across the two sites. In the more frequently occurring case where there is no
shared key but some shared vocabulary, Semantic Negotiation tries to identify a shared
discriminant description to do the mapping.




                                                33
              3.4.4.3. Using Discriminant Descriptions


Semantic Negotiation is the process of going back and forth to identify a common discriminant
description of the resource that the two parties are trying to establish a shared reference for. If
such a description can be established and the two assumptions given below can be made, it
follows that the description identifies the same resource on both sides. I.e., if x on site A and y
on site B both satisfy the shared discriminant description  on A and B respectively, then they
must refer to the same object. Such a description    ,   since it is part of both sites, will only use
vocabulary that is shared.
In order to use shared discriminant descriptions to map resources across sites, we need to
make the following two assumptions.
    The resource being mapped exists in the domain of both sites.
    Both sites have complete values for all the objects (on each site) for the properties used
     in the discriminant description.
We recognize that these two assumptions are rather strong. In many cases, we might not be
able to a priori ascertain that the object exists in the domains of both sites. Further, web data
sources, unlike traditional databases, often don’t have complete information about objects in
their domains.
Semantic Negotiation utilizes the classic Computer Science technique of introducing one more
level of indirection. If two parties do not share a name for an object, they use descriptions of
the object, based on names they do share, to come to an agreement on the name of the object.
We justify this added level of indirection based on the reduction in the number of terms for
which names need to be shared.




     3.4.5. GetData and Semantic Negotiation


Semantic Negotiation is a general purpose mechanism that can be used in many different
contexts for synchronizing names and schemas. GetData uses Semantic Negotiation to reduce
its dependency on the URIs of resources being shared between the caller and provider of
GetData. More specifically, GetData allows for the resource whose properties are being queried
to be referred to via descriptions. The provider of GetData is expected to interpret the
description, map it to one of its resources and return the answer. Semantic Negotiation, in the
context of GetData, proceeds as follows:
   1. The client sends the server a GetData query, using the description to refer to the object
      whose property is being accessed. The description is assumed to be discriminant on the
      client.
   2. If the server does not understand the description, i.e., it uses terms that server does not
      know about, the server responds with an error code indicating that the description was
      not understood. In this case, it also lists the particular terms not understood and if the
      description included the class of the object, and the class was understood, it might
      include some of the properties of that class it does know about. Based on this feedback,
      the client can try to provide a description that the server is more likely to understand.




                                                34
   3. If the server understands the description but there are no objects matching that
      description, it returns an error code saying so. It can optionally also tell the client which
      fragment of the description was not satisfied by any of its objects.
   4. If the server understands the description but there are multiple objects matching the
      description, it returns an error code saying so. In this case, depending on how many
      different objects match, the server may return a list of these, along with descriptions
      that are discriminant on the server. The client may choose one amongst these and retry
      the query.
   5. If the server understands the description and there is a single object matching the
      description, it returns the values that were requested. In the case where the answer is a
      list of resources, the answer may include additional data about each resource, which the
      client may cache, in anticipation of future queries about these resources. This is just a
      form of proactive caching. Optionally, this additional data may also include the server’s
      names for these resources so as to reduce the need for negotiation for the names of
      these resources.
In an experiment with data about 5000 randomly chosen musicians from the websites of
Amazon.com and CDNow, the TAP system was able to match all of the musicians on the two
sites using semantic negotiation and sharing of 9 general terms (the classes Composer and
MusicAlbum and the propertes type firstName, lastName, musicGenre, title, hasAuthor,
hasPublisher). This vast reduction (almost 2 orders of magnitude) in the number of terms that
need to be shared justifies the added level of indirection.




     3.4.6. Query Complexity


We argued for GetData based on the need for simplicity and predictability in query languages.
The incorporation of descriptions to identify objects goes against the spirit of simplicity which
motivated GetData. In order to recoup some, if not all of the simplicity and predictability, we do
the following:
    Mappings between objects, once generated, can be cached. Descriptions used by
     GetData can include information about each site’s local identifiers. Similarly, answers to
     GetData also include local identifiers. This enables easy generation of mappings between
     identifiers, which can then be cached.
    Descriptions are limited to regular path expressions without inverses. Such queries are
     much easier to process than queries in more general query languages such as Datalog.
    Descriptions, as used for Semantic Negotiation in GetData have a particular purpose,
     i.e., to identify an object. In addition to the regular path expression constraint, this
     allows processors to further optimize the processing of descriptions.




                                               35
     3.4.7. Failure Modes


Semantic Negotiation is a form of loose coupling. As with all forms of loose coupling, we can
have a failure to couple. Some of the ways in which this can happen include the following.
The two parties involved might not share enough vocabulary for there to be a shared
discriminant description. In the earlier example of Yo-Yo Ma, if one of them was a music store
and the other was a hospital, they might not have enough shared vocabulary items to share a
common discriminant description, in which case we have to fall back to using URIs to identify
Yo-Yo Ma. Alternately, they might both share common vocabularies with a third party who
might be able to broker the relationship.
In addition to shared vocabulary items (like ”Musician” and ”firstName”), semantic negotiation
also relies heavily on sharing literals. For example, it is quite possible that one site uses the first
name ”Yo-Yo” and the other uses ”YoYo”. Similar and worse problems could occur with dates,
phone numbers, etc. We are relying on the use of heuristic canonicalization schemes to reduce
the number of these errors.
One or both of the two assumptions described in the previous section might not hold, leading to
situation in which case semantic negotiation may lead to the two sites misidentifying different
objects as being the same. For example, imagine two sites, one which only knew about cities in
Texas and another which only knew about cities in France, might agree use a locally
discriminant description (A city named Paris), to misidentify the two Paris’s.




     3.4.8. Scope of Semantic Negotiation


The examples in the previous section involved negotiating shared names for individuals such as
musicians. Though in practice we expect most of the use of Semantic Negotiation to be for such
purposes, in principle, there is no reason why this strategy cannot be used to bootstrap more
abstract/core vocabulary terms such as classes and properties.
Descriptions for classes and properties could be extensional, i.e., they list a number of uses of
the class or property. The goal is to find enough shared instances of usage of the term so as to
reduce the probability of accidental sharing to an acceptably low level. So, for example in the
case of classes, the two parties, in addition to identifying shared super classes, can try to
identify shared members. If both parties have a sufficiently large number of commonly shared
members and if the there are no objects known to both parties which are a member of the class
only on one side, the parties may choose to conclude that their respective classes are the same.




     3.4.9. Related Work


There is a substantial body of work on the problem of integrating data from different sources.
One focus of research has been on integrating data from sources that use different mechanisms
for storing and retrieving their data. Systems such as TSIMMIS and InfoMaster use the concept
of mediators to bridge these differences. We assume that all data sources make their data


                                                  36
available as a directed labeled graph accessible via GetData. So in that sense, the GetData
interface is similar in spirit to the concept of mediators.
Another area of research has been on the problem of schema differences between databases.
Systems, such as InfoSleuth, Information Manifold and InfoMaster have focused on bridging
differences in the schema so as to provide the user with a unified view of all the data. The
typical approach has been to map the schemas into a “World View” or more general schema, in
a more expressive language and use this representation of the schemas to translate between
databases. Systems such as Clio provide mechanisms to semi-automatically generate mappings
between the schemas of different databases. All these systems assume that individuals (such as
individual people, places, etc.) have uniform names or keys.
Heuristic ontology and schema mappings have been studied in systems such as PROMPTDIFF.
These approaches are targeted at merging two ontologies which have “forked” from a common
origin, which is very different from the problem of integrating independently developed data
sources on the Semantic Web.
In all these systems the integration is a posteriori, i.e., the databases have been constructed
and exist and need to be integrated. Further, the integration setup is offline or batch mode.
This work differs from previous work in the following aspects. First, the focus is on creating an
infrastructure for publishing data where integration is not an a posteriori concern but a primary
aspect of the architecture. Second, the focus is on dynamic or on-line data integration as
opposed to previous systems in which the integration was either done in batch mode or
involved a batch mode schema integration step. Finally the interest is on instance-level
integration, not schema-level integration.
Recent work in the area of probabilistic representation languages is related to our work. It too
tries to solve the problem of identity, except it does not have a negotiation component. The INS
system, which uses intentional names to address network components such as printers, is
similar to our use of descriptions to refer to entities.




3.5. Registries


Different sites/graphs have different kinds of information (i.e., different properties) for different
types of resources. Each GetData request is targeted at a particular URL corresponding to a
graph which is assumed to have the data. Once we have a number of these graphs, keeping
track of which graph has what data can become too complex for a client to handle.
We use a simple registry, which is available as a separate server, to keep track of which URL
has values for which properties about which classes of resources. The registry can be abstracted
as a simple lookup table which given a class and a property returns a list of URLs which might
have values for that property for instances of that class. More complex registries, based on
descriptions of objects, are also available as part of TAP. With the registry in place, a client can
direct the query to the registry which then redirects the query to the appropriate site(s).
Querying many different sites dynamically can result in a high latency for each query. In order
to provide reasonable performance, we cache the responses to GetData requests. This caching
is part of the registry’s functionality.
What we are trying to do here is in many ways similar to what the Domain Name Service (DNS)
does for data about internet hosts. DNS provides a unified view of data about internet hosts
sitting across millions of sites on the internet. Different kinds of servers store their host


                                                 37
information differently. But what DNS does is to provide a uniform view into all this data so that
a client can pretend that all of this data is sitting locally on their name server. Taking the
analogy further, one could regard GetData as an extension of GetHostByName, the core query
interface for DNS. Just as GetHostByName enables a client to query for one particular property
(the IP address) of one class of objects (Internet Hosts), GetData enables a client to query for
many different kinds of properties of many classes of objects.




3.6. Trust


An important aspect of the Semantic Web is that different sites may contribute data about a
particular resource. For example, many different sites have data about Yo-Yo Ma. Amazon and
CDNow have data about his albums, EBay has data about auctions related to these albums,
TicketMaster has data about his concert schedule, AllMusic has data about where he was born
(Paris), and so on. Each of these sites can publish data about Yo-Yo Ma without getting
permission from any centralized authority, i.e., they can all extend the cumulative knowledge
about any resource in a distributed fashion.
This distributed extensibility leads to problems of its own. In a system where anyone can
publish data about any resource, a lot of what gets published cannot be trusted. On the HTML
web, we, as humans, use our intelligence, invoking concepts of brand, who recommended what,
etc. to decide whether to believe what a web site says. Programs, on the other hand, being
relatively unintelligent, do not have recourse to all these facilities to decide whether to believe
the data from a new site.
We cannot expect programs to be able to make the kind of trust judgments (about sites) that
we as humans make. Consequently, at some level, we have to tell our registries which sites to
trust about which kinds of data. One approach is to rely on centralized registries which ascertain
the quality and trustworthiness of sites providing data. As our experience with the HTML web
and centralized registries such as Yahoo shows, such approaches don’t scale.
Another approach, which complements centralized registries, is to rely on a network of
registries, which share their entries through a Web of Trust between registries. In this model, in
addition to adding some entries on which sites should be queried for which kinds of data, a
programmer also tells her registry which other registries to trust.
When a query arrives from a client, the registry consults its local entries and if no match is
found, forwards the query to its trusted registries. As a result, the work done by any of those in
the programmer’s Web of Trust can be exploited by the program.




3.7. The TAP System


TAP is an implementation of the querying and negotiation interfaces/protocols described above.
TAP provides a facility for publishing data, a library which implements an application
programming interface for consuming this data and a registry. A program consuming data from
the Semantic Web can either use the TAP client library, which provides a simple GetData API or
quite easily build one using one of the popular SOAP toolkits.

                                                38
For publishing, we have built a system called TAPache, which functions as a module for the
Apache HTTP server, for providing the GetData interface. TAPache’s goal is to make it extremely
simple to publish data via GetData. TAPache is not intended to be a high end solution for sites
with large amounts of data and traffic. Flexibility and scalability are much more important than
ease of use for such sites.
With the Apache HTTP server, there is a directory (typically called html), in which one places
files (html, gif, jpg, etc.) or directories containing such files, as a result of which these files are
made available via http from that machine. Similarly, with TAPache, one creates a sibling
directory (typically called data) to the html directory and places RDF files in this directory. The
graphs encoded in these files are automatically made accessible via the GetData interface.
TAPache also provides a simple mechanism for aggregating the data in multiple RDF files.




3.8. Semantic Search


The growth of the Semantic Web will be driven by applications that use it. In this section, we
describe an application of the Semantic Web to the problem of search. Search is both one of the
most popular applications on the Web and an application with significant room for improvement.
Few, if any of today’s search engines are based on an understanding of the search term. Their
goal is to retrieve documents in which the search terms occur.
“Search” is a very general term which covers many different kinds of activities. In particular, we
need to distinguish between the following kinds of searches.
Navigational Searches: The user knows that a particular page, such as the home page for a
particular organization or the schedule for a conference, exists and is using the search engine
to navigate to it. In such cases, the user provides the search engine a combination of words
which s/he expects to find in the documents. The          user is using the search engine as a
navigation tool to navigate to a particular        intended document. Our focus is not on this
class of searches.
Research Searches: Sometimes, the user is researching for information about a particular
object. In this case, there is no particular page which might contain all     the     information
s/he is looking for. The user provides the search engine with a phrase which is intended to
denote the object being researched. This is the class    of searches we are interested in.
A distinguishing characteristic of research searches, which enables us to identify them, is that
the search term denotes the object being researched. Therefore, with the aid of a lexicon, we
can identify such queries.
Example: A search query like “W3C track 2pm Panel” does not denote any concept. The user is
likely just trying to find the page containing all these words. On the other hand, search queries
like “Eric Miller” or “Dublin Ohio”, denote a person or a place. The user is likely doing a research
search on the person or place denoted by the query.
Semantic search attempts to improve the results of research searches in 2 ways.
    We augment the list of documents returned by traditional IR based search with relevant
     data pulled out from Semantic Web. The Semantic Web based results are independent of
     and augment the results obtained via traditional IR techniques. So, for example, a
     search for Yo-Yo Ma might get augmented with his current concert schedule, his music
     albums, his image, etc.



                                                  39
    We would like to exploit an understanding of the denotation of the search term to help
     better filter and sort the list of documents retrieved.
A long term goal is to develop a general framework for utilizing background knowledge to
improve search. Two Semantic Search systems have been built and they are described below.
ABS: This system is aimed at augmenting searches where the search term denotes a resource
from one of the following classes:
People: Popular people such as musicians, actors, athletes and authors
Places: Places such as countries, states, cities and tourist spots
Products: Products such as consumer electronics and automobiles
The TAP Knowledge Base, which contains about 65,000 instances of these classes is used as a
lexicon to identify such searches. For each instance, the TAP KB specifies which class(es) the
concept is an instance of and the english phrases typically used to refer to the concept. These
65,000 terms cover about 17% of the searches that take place on an average day on the ODP
Website.
Since the Semantic Web does not yet contain much information, website wrappers for a number
of sites (Amazon.com, CDNow, Travelocity, Expedia, People Magazine, CarPoint, EBay, AOL
Shopping, Wal-Mart, etc.) have been written. The wrappers for each of these sites provide a
GetData interface to the site. Given the number of data sources and their distributed nature of
its data sources, ABS makes extensive use of the registry and caching mechanisms provided by
TAP. All these data sources together yield a Semantic Web with many millions of triples.
W3C Semantic Search: This system is aimed at providing Semantic Search for a single
website. Working together with the W3C, it has been created a system for augmenting searches
on the W3C website with data from a Semantic Web about the W3C. This limited Semantic Web
acts as both the lexicon for identifying research searches and as the source of data for
augmenting searches.
The W3C Semantic Web is built from the following five data sources.
People: This includes W3C staff and authors of various W3C documents. The data sources
include each staff members contact, title and responsibilities.
W3C Activities: Each activity is related to the W3C people.
Working Groups and other Committees: Each of these is related to the activity and        staff
Documents: This includes recommendations, working drafts, W3C notes, etc. Each of
     these is related to the working groups and activities which produced them.
News: W3C Semantic Search incorporates data from a number of RSS news feeds about
     various newsworthy events happening at the W3C.
Early analysis of the service running at the W3C indicate the local knowledge base covers 85%
of the most common search queries encountered on the Website.
Given the search query, we first need to determine the intended denotation, if any, of the
query. Then we need to determine what relevant data needs to be pulled from the Semantic
Web. We now look at each of these steps.




                                                40
     3.8.1. Choosing a Denotation


The first step is to map the search term to one or more nodes of the Semantic Web. This is
done by using the TAP search interface. More than one term on the Semantic Web may have
search term (or a subset of the search term) as its rdfs:label or one of the other properties
indexed by the search interface. For example, in ABS, the search query “Paris” could either map
to the city Paris, France or the city Paris, Texas, the music group Paris or ... In this case, we
have to pick one of these as a preferred denotation. This choice is made based on a number of
different factors, including,
    The popularity of the term as measured by its frequency of occurrence in a text corpus
     or the availability of data on the occurrence in a text corpus or the availability of data on
     the Semantic Web. E.g., Paris, France is a preferred denotation for “Paris”, compared to
     Paris, Texas, the music group Paris, etc.
    The user profile may be guide selection of the denotation. E.g., the phrase “Quark”, as
     used by a Star Trek fan might be more likely to denote the Star Trek character than the
     subatomic particle.
    The search context can also help select the denotation. If the user has been searching
     for information about musicians, the query “Pink” may more likely denote the musician
     than the color.
Denotations other than the preferred one are also offered to the user by the system as part of
the search user interface so that the user can select one of those if that is what s/he intended.
If the search term does not denote anything known to the Semantic Web, then we are not able
to contribute anything to the search results.




     3.8.2. Determining What to Show


Once we have the search term denotation, the next task is to determine what data from the
Semantic Web we should incorporate as part of the search results, and in what order. As with
traditional search, deciding what to show and in what order is one of the central problems of
Semantic Search. This problem can be visualized in terms of the Semantic Web graph as
follows. The node which is the selected denotation of the search term, provides a starting point.
We will refer to this node as the Anchor Node. We then have to choose a sub-graph around this
node which we want to show. In the case of the search term denoting a combination of terms,
we have two anchor nodes from which we have to choose the sub-graph. Having chosen the
sub-graph, we have to decide the order in which to serialize this sub-graph in the results
presented to the user.
We start with a simple syntactic approach which is widely applicable but has several limitations.
We then introduce various modifications and enhancements to arrive at the approach which
provides adequate results.
The simple approach for selecting the sub-graph, based purely on the structure of the graph,
treating all property types as equally relevant, would be to walk the graph in a breadth first
order, starting with the anchor node, collecting the first N triples, where N is some predefined
limit. This basic approach can be improved by incorporating different kinds of heuristics in how
the walk is done, so as to produce a more balanced sub-graph. Some heuristics include,



                                               41
Include at most N triples with the same source and same arc label, where N is either preset or
computed based on the average branching factor (i.e., the bushiness) of the graph around the
anchor node.
Include at most M triples with the same source, where M is either preset or computed based on
the bushiness of the graph around the anchor node. Further, M could be a function of the
distance of the node from the anchor node. This heuristic results in the inclusion of more
information about nodes closer to the anchor node.
The graph collection procedure is modified for the case of pairs of nodes denoted by the search
term. In this case, in addition to collecting triples in the vicinity of each of the nodes, we also
locate one or more paths connecting the nodes and include the triples involved in these paths.
The intuition behind this approach is that proximity in the graph reflects mutual relevance
between nodes. This approach has the advantage of not requiring any hand-coding but has the
disadvantage of being very sensitive to the representational choices made by the source on the
Semantic Web. This approach also has the property of being able to incorporate new pieces of
information, about the anchor node and its neighbors, as they appear on the Semantic Web,
without changing anything in the Semantic Search engine. This property is both a feature and a
bug. It enables Semantic Search to provide richer results as the Semantic Web grows, but also
makes the system more susceptible to spam and irrelevant information.
Another problem with this approach is that it ignores the search context. For example, a search
for “Eric Miller” on the W3C Website should use different data from the Semantic Web compared
to the same search on the Miller family Web site. There is no way of getting this effect with this
approach.
A different approach is to manually specify, for each class of object we are interested in, the set
of properties that should be gathered. We do a breadth-first walk of the graph, collecting values
for these properties till we have gathered a certain number of triples. For example, we specify
that for instances of the class W3CStaff (call the instance O), we collect triples and values
whose target or source is O and whose property is one of title, imageURL, involvedInActivity,
hasAuthor and hasEmailAddress. So, for example, if the anchor node corresponds to Eric Miller,
this will give us a set of triples and also a new set of nodes (the nodes corresponding to the
Semantic Web Activity and the RDF Primer) which are the targets/sources of the triples with the
properties involvedInActivity and hasAuthor. For each of these nodes, based on the type of the
node we collect a new set of triples. This process continues till we either run out of new triples
or we have collected enough triples.
This approach has the advantage that specifying only certain properties provides a kind of filter,
thereby producing more dependable results. It is also easily customizable so as to be able to
factor the search context in determining which properties should be retrieved. However, it
requires more work to set up, and is also not able to incorporate new kinds of information as it
appears on the Semantic Web.
A hybrid approach, which has most of the benefits of both these approaches, is as follows. We
first apply the second approach to the extent we have specifying properties available. If not
enough data has been collected, we do a more general graph walk, as per the first approach,
only looking at property types not examined in the first pass.
The triples thus collected are clustered and ordered, first by the source, based on the proximity
to the Anchor Node and then by the arc-label. Based on the user interface, the triples are
appropriately formatted and added to the search results. Figure 11 shows the results of search
augmentations for search ’Eric Miller’ in W3C Semantic Search and for ’Yo-Yo Ma’ in ABS.




                                                42
Figure 12. Text Search Results Augmented with Data from the Semantic Web. Left:
    Search for ‚Eric Miller’ on the W3C Web site showing overall page and data
     augmentation. Right: Data augmentation alone for search on ‚Yo-Yo Ma’




                                      43
CHAPTER 4. INTEGRATING THE TAP SYSTEM
      WITH EDUTELLA.


4.1. The Edutella Wrapper


The process that any wrapper must perform is the following:
   1. Receives a QEL query as a string
   2. Understands the QEL query
   3. Converts the QEL to the local query language
   4. Sends the transformed query to the local repository
   5. Receives the results from the repository
   6. Transforms the results to a variable binding table
   7. Returns the results
While Edutella classes can not help the developer on the implementation on steps 3 to 5, it can
on steps 2 and 7. Edutella classes parse a QEL query into a Query object.




       4.1.1. Parsing a QEL Query


I will describe only the three most important classes for parsing a QEL query. These classes
belong to packages net.jxta.edutella.eqm and net.jxta.edutella.eqm.io.

    Class Query
This class represents an Edutella QEL query. Each Query object aggregates an arbitrary number
of Query Literals (QueryLiteral objects), Outer Join Query Literals and Rules. When executing
the query all Query Literals are evaluated using the necessary rules.
Iterator getLiterals() returns an iterator over all body literals of this query
Iterator getOuterJoinLiterals() returns an iterator over all outer join body literals of this
query
List getResultVariables() returns the list of result variables (variables for which bindings are
requested)
Iterator getRules() returns an iterator over all rules of this query
Iterator getVariables() returns an iterator over all variables of this query


                                                44
    Class QELQueryParser
This class represents a parser that converts a QEL query (in XML format) into a Query object.
Query parse (String src) converts a string containing the QEL query (in XML format) into a
Query object.

    Class DatalogQueryParser
This class represents a parser that converts a QEL query (in Datalog format) into a Query
object.
Query parse (String src) converts a string containing the QEL query (in Datalog format) into
a Query object.
Example:
Here there are two examples of how to read a QEL string from a String (in the variable
queryString) and transform it into a Query object. The first one reads a XML encoded query and
the second one reads a Datalog encoded query.
      QELQueryParser queryParser = new QELQueryParser();
       Query query = queryParser.parse(queryString);

      DatalogQueryParser dp = new DatalogQueryParser() ;
       String st = "@prefix qel: <http://www.edutella.org/qel#>.\n" ;
       st = st + "@prefix dc: <http://purl.org/dc/elements/1.1/>.\n" ;
       st = st + "?- qel:s(Resource,dc:title,Title)." ;
       Query q = dp.parse(st) ;




4.2. QEL – GetData Wrapper


Wrapping QEL to GetData consists of the following steps:


           Create the N-Trees
           For every N-Tree
              Traverse the tree in order to having the nodes in a certain
           order
              For each node in the ordered list
                    Create GetData queries
                    Send the GetData query to the TAP KB and return the
              results
                    Apply the restrictions to the results
                    Intersect the results from multiple branches
                    Bind the results to the variables
           Make an union of the results for variables from different N-Trees
           Present the results to the user




                                              45
       4.2.1. Creating the N-Trees


In order to wrap QEL to GetData, I chose to map QEL queries to N-Trees. As I’ve already
mentioned in section 2.5., a QEL query contains a number of elements: variables, predicates,
query literals, rules, etc. With this approach, I create an N-Tree for every rule which appears in
the QEL query.
I implemented a class, named NTree, where an object of this type holds an N-Tree,
corresponding to a QEL rule. The most significant attributes of an NTree – object are:
      the name of the node (private String nameOfNode)
      the type of the node (private String typeOfNode) – and can be “variable” or
       “resource”
      the restrictions associated with a node (private String restrictions)
      the degree of the tree (private int degree)
      the number of children of a node (private int numChildren)
      the list of children of a node (private NTree[] children)
      a list of labels for the arcs to the children (private ArrayList labels)
      the number of nodes in the tree (private static int numberNodes)
      the number of variables in the tree (private static int numVariables)
There are a few steps that have to be covered in order to creating the N-Trees. First of all, the
user has to specify the name of the file containing the query. The query can be expressed both
in RDF-QEL and DATALOG, but it has to be explicitly specified which type of query it is used.
The input file is read, parsed and a Query – object is created from it. The resulting Query -
object is then passed to a TapProviderConnection – object, which executes the query. Before
executing the query, it is first parsed and the corresponding N-Trees are created.
I will now present in detail, how the N-Trees are created. As I have already mentioned, the user
has to specify the name of the file containing the query, and a Query - object is created from it.
This Query – object is then transmitted as a parameter for the method “format” of a
TapQueryFormat – object, which in turn calls the “createTree” method, which takes as
arguments: an iterator over the literals in the query, an iterator over the outer – join literals
and the Query - object.
First of all, we have to identify how many rules are contained in the query. We will have an N-
Tree for each rule. If the query does not contain any rules, we will have only one N-Tree.
We will take in turn every literal of the query. There will be two different cases, depending on
the type of the literal, whether it is a Statement Literal, or a Built-in Literal.
   a) If the current literal is an instance of a StatementLiteral – object we will extract the
      subject, the predicate and the object from it. These will be stored as RDFNode-s objects.
      For each of them, we will create a node in the N-Tree, where the names of the nodes will
      be the names of the subject, predicate and object respectively. Their types can only
      have values in the two – element set = {“resource”, “variable”}. Note that the predicate
      of the statement can not be a variable, because otherwise the restriction of GetData
      queries would not be satisfied. When creating new nodes in the N-Tree, we will also
      check if nodes corresponding to them have already created. As one can easily see, nodes
      have to be created only for subjects and objects, because predicates correspond to arcs
      in the tree. Once we have created nodes for the subject and object, or after we have


                                               46
       identified the corresponding nodes in the N-Tree, we add an arc in the tree, pointing
       from the subject to the object.
   b) If the current literal is an instance of a BuiltinLiteral, we will simply add it to the list of
      conditions.
When we have finished with all the literals in the query, it means that we have built the
structure of the N-Tree. What we still have to do is to add the conditions corresponding to every
node. It is not necessary that every node in the tree has some conditions associated with it. To
accomplish this, we will take in turn each element from the list of conditions. The built-in
predicates specify restrictions for some of the variables that appear in the QEL query. So, we
will have to add the restrictions to the nodes from the tree, which correspond to variables. The
first argument of the built-in predicate stands for the subject of an RDF statement, which
means that we will have to extract the first argument in order to find out to which node we will
apply this constraint.
This was the last step that had to be accomplished in order to create the N-Tree. The algorithm
is repeated for every rule in the QEL query.




     4.2.2. Traversing the N-Trees


Once we have created the N-Trees corresponding to a QEL query, we have to traverse these
trees in order to find the order in which the GetData queries have to be sent to the repository.
There are two possible ways to traverse the trees, and namely: top-down and bottom-up. These
two possibilities correspond to a direct search and an inverse search respectively. If the root of
a tree is a variable, then we will have to traverse the tree in a bottom – up manner, which
corresponds to an inverse order. Similarly, if the root of the tree is a resource, we will traverse
the tree top – down – direct order.




              4.2.2.1. Top – Down Traversal


I will first present the top – down traversal. In this case, considering an RDF statement, we
know the value of the subject and of the predicate and we are looking for possible values for
the object. So, in order to finding the value of the object, we will do the followings:
For each node, in the ordered list of nodes, we will return all the children and the corresponding
labels of the arcs leading to the children. Now, we have to sub-cases:
   a) If the current node (the parent) is a variable, first we will have to bind all the results for
      this node from the previous iterations. Then, we will have to apply the restrictions, if
      any.
   b) If the current node (the parent) is not a variable, we will only have to apply the
      restrictions for it, if any.
For all the children of the current node, we will create a GetData query, which can be seen as
an RDF statement, having as subject the parent of the child, as predicate the label leading to
the current child, and as object the current child.


                                                47
              4.2.2.2. Bottom – Up Traversal


For the bottom – up traversal, considering the same model of an RDF statement as a triple,
we now know the values for the predicate and for the object, and want to know the value of the
subject. Here is what we have to do:
For every node, in the ordered list of nodes, we get the corresponding parent. We have to check
if the object is also a variable, so we also have two sub-cases:
   a) If the child of the current node is a variable, we will have to bind all the previous results
      we got for that node and apply the restrictions, if there are any. Then, for each value of
      the object, we will create a GetData query.
   b) If the child of the current node is not a variable, there will be only a GetData query,
      having as subject the parent of the current node, as predicate the label of the arc and as
      object the node itself.




     4.2.3. Sending the GetData Queries, Applying the
      Restrictions and Binding the Results


The GetData queries are sent sequentially to the repository and the results are bound to the
variables.
As I’ve already mentioned, in the case that a certain node is of type “variable”, once we have
bound all the results to the corresponding variable, we also have to check if that node also has
associated some restrictions with it. If it has, we will have to apply the restrictions, and only the
values of the results that satisfy all the restrictions will be again bound to that variable.
We also have to take into account that a node might have more than one child. In this case, for
every branch that emerges from this node, we might get some results, retrieved from the TAP
knowledge base. A node of this type corresponds to a logical AND between all the branches that
have that node as a source and the destination one of the children of the node. This means that
the values we obtained for this node must be those one that appear as results for all the
GetData queries corresponding to the children. That is to say that we will have to intersect the
results we got for every branch. The values that arise from this intersection will be bound to
that node.




                                                 48
     4.2.4. Unifying the results


In the case that we have more N-Trees, a variable might appear in several of them. This means
that for each tree where a specified variable appears, we will get a set of results. We consider
that the sets of results are complete after all GetData queries have been sent to the repository
and the answers have been filtered according to the associated restrictions and then intersected
for all the branches, as I described above.
We have to identify in the trees which nodes refer to the same variable. This can be easily
done, simply inspecting the names of the nodes. When we have identified all the nodes from all
the N-Trees which refer to the same variable, together with their corresponding sets of results,
we will unify the results, and then, the final results will be bound to the variable.




     4.2.5. Presenting the Results to the User


The results that are bound to variables and that are to be presented to the user are stored as
objects of the type Resource. In order to display them, we will have to convert them to an
easily readable format, and this will be done with the aid of methods implemented in the TAP
client. The query results will be displayed as RDF graph answers.




4.3. An Example


Just to have a better understanding of the above algorithm, let’s consider the following
example.
Let’s suppose that out QEL query looks like this:


?- qel:s(X,<http://localhost/data/tap.rdf/teamMember>,Y),
qel:s(Y,<http://localhost/data/tap.rdf/hasResearchArea>,
<http://localhost/data/tap.rdf/Artificial_Intelligence>).


In natural language, this query is equivalent to “Identify all the departments X which have a
team member Y, whose research area is Artificial Intelligence”.
As one can easily see, we will have only one tree associated with this query, because it doesn’t
contain any rules.
We will take in turn all the literals that build up this query.
The first literal, qel:s(X,<http://localhost/data/tap.rdf/teamMember>,Y), is a statement
literal,     whose      subject    is    the    variable   X,    the      predicate    is
“http://localhost/data/tap.rdf/teamMember", and whose object is the variable Y. Currently
the N-Tree is empty, so that we don’t have to check whether there are already nodes that

                                                  49
correspond to this subject or object. We will build a node for the subject, having the name “X”
and the type “variable” and a node for the object, with the name “Y” and having the type
“variable”    also.     Then,     we      will     add     an      arc     with     the   label
“http://localhost/data/tap.rdf/teamMember", pointing from the subject to the object.
The second literal,
qel:s(Y,<http://localhost/data/tap.rdf/hasResearchArea>,
<http://localhost/data/tap.rdf/Artificial_Intelligence>), is a statement literal also.
In this case, the subject of the statement is the variable Y, the predicate is
“http://localhost/data/tap.rdf/hasResearchArea”, and the object is the resource
“http://localhost/data/tap.rdf/Artificial_Intelligence”. We will also have to create a
node for the subject a one for the object, but this time, the N-Tree already contains some
nodes. After inspecting the tree, we figure out that a node corresponding to the subject of this
statement has already been created in the previous step. We still have to create one for the
object.        This       will       be        a        node       with        the        name
“http://localhost/data/tap.rdf/Artificial_Intelligence”, and the type “resource”.
We don’t have any more literals in the query, so the constructed N-Tree is complete. This is a
very simple tree, with only three nodes and no restrictions associated with them. Here is how it
looks like (note that the labels of the arcs and the name of the node corresponding to the
resource are displayed in the short form) (Figure. 12):




                               Name:          X
                               Type:          Variable
                               Restrictions   null



                          <teamMember>



                               Name:          Y
                               Type:          Variable
                               Restrictions   null



                          <hasResearchArea>


                               Name:          Artificial_Intelligence
                               Type:          Resource
                               Restrictions   null



              Figure 13. The N-Tree corresponding to the QEL query




                                              50
For the constructed tree, we will have to determine what kind of traversal we will do. The root
of this tree is a node of type “variable”, what means that we will have a bottom – up traversal.
This means an inverse search.
So, traversing the tree starting from bottom will result in an ordered list of nodes:
Artificial_Intelligence, Y and X.
We will take the first node from this list: Artificial_Intelligence, and we will search in the
tree its parent and the corresponding label of the arc pointing from the parent to it. In this
case, the parent-node is Y, and the label is hasResearchArea. With these resources, we will
create a GetData query. This will look like:
Y <- GetData (<Artificial_Intelligence>,<hasResearchArea>,“inverse=yes”)
The query will be sent to the repository, and the results will be bound to Y.
There are no constraints associated with the node Y, which means that we don’t have to filter
the results corresponding to Y.
The node Y has only one child, so that there is only one source of results for Y. No intersection
of results has to be made.
The next node in the ordered list of nodes is Y. We determine that its parent-node is X and the
label of the arc pointing from X to Y is teamMember. We must note that the node Y is a variable,
so that we can not create correct GetData queries in this form. But we already have some
results which have been bound to the variable Y. For each of these results, we will build up a
GetData query which will be sent to the repository, and the results will be bound to X. In other
words, we will have:



          for each binding i of Y
             add_to_bindings_of_X <-
             GetData(<binding_Y>i,<teamMember>,“inverse=yes”)




From this set of results, we will eliminate the duplicates, and the remaining results, will be
bound to the variable X.
The next node in the ordered list of nodes, resulted after traversing the tree is X. This node is
the root of the tree, so there is no other node which is its parent-node. There are no more
GetData queries that can be issued.
The last step of out algorithm consists of presenting the results to the user. The results will be
displayed as RDF graph answers. Since in our query, there are only two matching predicated,
we will present the results that have been bound to the variables X and Y, the subjects of these
statement literals.




                                                51
CHAPTER 5. ADDING PERSONALIZATION TO
  SEMANTIC WEB SEARCH.


Personalized support for learners becomes very important, when eLearning takes place in open
and dynamic learning and information networks. Personalized learning using distributed
information in dynamic and heterogeneous learning environments is still an unsolved problem in
eLearning research. In the next sections I will present how TAP approach can be used in an
eLearning scenario.




5.1. How TAP Approach Can Be Used in an eLearning
     Scenario


Let’s consider the following scenario, which motivates our approach: A professor who teaches
“Web Technologies” presents the content of his course on a site available online. Figure 13
shows an example for a lecture:




                             Figure 14. Example of a Lecture




                                             52
As we can easily see, for a lecture there is a title (“XML: Einführung (Fortsetzung), DTDs
und XML Schema“), the date when the lecture will be held („Vorlesung vom 26.4.2004“),
some keywords (“XML, DTD, XML Schema”), a link to the slides corresponding to the lecture
(“Slides (pdf)”) and some recommendations for further readings (“FAQ to XML (provides
good overview): The XML FAQ”, “XML entry-page at W3C: Extensible Markup
Language (XML)”, “Online XML Tutorial in the Web: Google Search: XML + online +
tutorials”, “XML Schema at W3C”, “The ultimate XML Schema document:
http://www/w3.org/2001/10/XMLSchema”, “XTML at W3C”).
A student, who visits this page, will have the description of the lectures, and, besides, in the
right side of the page, there will be some further recommendations for each lecture, which will
be related to W3C. So, these recommendations from the right side of the site are strictly related
to the lectures (the title and the keywords of the lectures) and W3C.




5.2. Implementation of the Scenario


To accomplish this goal, we will make use of the knowledge base from TAP, because it contains
a lot of resources which refer to W3C (W3C as a Consortium, W3C – activities, W3C – working
drafts, W3C – specifications, W3C – notes, persons involved in W3C activities, mailing lists,
etc.). This knowledge base from TAP is described in RDF.
Besides, we have a list of lectures, located at:
http://www.kbs.uni-hannover.de/~henze/semweb04/skript/inhalt.xml
The file, containing the description of lectures is in XML format (you can see the whole file in
ANEXA C). So, in order to ensure uniformity and conformity with the knowledge base of TAP,
we will have to convert this XML file into a RDF file. We created a XSL file which will be then
included in the XML file, so that the output will be in RDF. (The XSL file is also available in
ANEXA C). The resulted RDF file was validated with the validator from W3C
(http://www.w3.org/RDF/Validator/).
The implementation of the scenario was written in Java. I used Jena (the semantic web
framework for Java) to parse the RDF files corresponding to the lectures and to the knowledge
base. (http://jena.sourceforge.net/).
As I have already mentioned we will search in the knowledge base only those resources that are
related to W3C and to the titles and keywords of the lectures. It means that we will have to
extract from the RDF file the title and the keywords corresponding to each lecture and store
them in some objects. From the TAP knowledge base we are interested only in the entries about
W3C. We browsed the knowledge base and we found out that the most relevant subsets of TAP
are stored under the following nodes:


                                      CMUPerson




       CMUFaculty                CMUGraduateStudent                CMU_RAD




                                                   53
     Specification

             DocumentFormatSpecification

                 ProtocolSpecification

                         W3CNote

                 StandardSpecification

                                   W3CSpecification

                               ASMESpecification

                                   IEEESpecification

                                   IETFSpecification




WorldWideWebConsortium

                     W3CDeptAdministrativeSupport


                         W3CDeptArchitecture

                        W3CDeptCommunication

                        W3CDocumentFormats


                         W3CDeptInteraction


                         W3CDeptManagement

                           W3CDeptSystems

                     W3CDeptTechnologyandSociety

                W3CDeptWebAccessibilityInitiative


                              54
W3CActivity

              W3CAccessibilityActivity

                 W3CAmayaActivity

                   W3CDOMActivity

          W3CDeviceIndependenceActivity


                W3CGraphicsActivity

                  W3CHTMLActivity

              W3CInternationalizationActivity


                 W3CJigsawActivity

                   W3CMathActivity

                  W3CPatentActivity

                 W3CPrivacyActivity

          W3CSemanticWebActivity

                   W3CStyleActivity

         W3CSynchronizedMultimediaActivity


                   W3CURIActivity

          W3CVoiceBrowserActivity

              W3CWebServicesActivity

                   W3CXMLActivity

              W3CXMLEncryptionActivity

                  W3CXMLKeyActivity

               W3CXMLSignatureActivity




                        55
                                     W3CPerson



                                     MailingList



                                 W3CWorkingDraft




These nodes are parsed with Jena and the IDs of the resources and their labels are stored in
some objects that will be later used.
Now, that we have the titles and the keywords of the lectures and the resources from the above
nodes, we have to find those resources from TAP, that match the titles and the keywords. The
matching will be done in several ways.
For one of the approaches we used I had to create a glossary of semantic web terms. In the
next section I will present the glossary in more detail.




     5.2.1. The Glossary of Semantic Web Terms


In order to create this glossary, I had to go through several steps, as follows. First of all, I
created a text file, containing approximately 1700 semantic web terms. At first, this seemed
sufficient, but after a better consideration we decided that a hierarchy of semantic web terms
would be more helpful. The best option to create this hierarchy was to transform the text file
into a RDFS file. With the aid of a Java program I did this conversion. For each term in the text
file, I created a line that looks like:
<rdfs:class rdf:about=”http://protege.stanford.edu/kb#Semantic_Term”
      rdfs:label=”Semantic_Term”/>
This means that every semantic term in the glossary is a class. The resulted RDFS file was then
imported to Protégé 2.0., and with the aid of a graphical interface of this tool, I rearranged the
terms in a hierarchy. In the end, the RDFS file contained classes, which can be on the highest
level in the hierarchy, or can be subclasses of other classes.
In order to be able to use these terms, I had to use Jena to parse the file and get each entry
together with the information about its place in the hierarchy.




                                               56
     5.2.2. Finding Resources That Match the Keywords


For finding the resources from the TAP knowledge base that match the keywords, I used three
approaches.




              5.2.2.1. Approach 1 – Single Words


This is the simplest approach and it works as follows:


          For each lecture
             Get the keywords
             For each keyword
                   For each word in the keyword
                         If it is a stop – word then continue
                         Test the word with the KB from TAP




              5.2.2.2. Approach 2 – Single Words Plus Stemmer


This is an extension of the previous approach, and namely, each word from the keywords is first
reduced to the root form before testing it with the KB from TAP. So, the pseudo code for this
approach looks like this:


          For each lecture
             Get the keywords
             For each keyword
                   For each word in the keywords
                         If it is a stop – word then continue
                         Get the root of the word with Stemmer
                         Test the root of the word with KB from TAP


Extraction of the root of a word is done by importing the Stemmer class into the project.




                                               57
             5.2.2.3. Approach 3 – Meaningful Expressions


This approach consists in creating meaningful expressions from the words in the keywords. I
create all the possible combinations of successive words from the keywords, with the restriction
that these expressions must have at least two words that are not stop–words. The longest
expressions are those that include the whole keyword. The resulted expressions are then
sorted, ranging from those which contain the most words down to those with only two words or
the keywords with only one word. This sorting is done in order to compute a score for the
matched resources and it is done with a quicksort algorithm. The score of a matching resource
is computed with the formula:
                       score  100 nr _ of _ words _ in _ searched _ string
This means that the more words from the keywords a resource contains in its label, the better
score it has. By sorting the expressions, we have the results already sorted. The pseudo code
for this approach looks as follows:


          For each lecture
             Get the keywords
             For each keyword
                   Create all possible expressions having at least two
             non-stop adjacent words
             Sort all the created expressions ranging from those with   the
          most words down to those with only two non-stop words   or the
          keywords with only one word
             Test the ordered expressions with KB from TAP
             Compute the score for each matched resource from TAP




To have a better understanding of this algorithm, let’s consider an example. Suppose that we
have a lecture, which has the following keywords:
Keywords: Semantic Web, Semantic           Web   Tower,    W3C,   Semantic     Web   Architecture,
Introduction to Markup Languages
All the possible meaningful expressions that can be created from these keywords are (already
sorted):
   1. Introduction to Markup Languages
   2. Semantic Web Tower
   3. Semantic Web Architecture
   4. Introduction to Markup
   5. to Markup Languages
   6. Semantic Web
   7. Web Tower
   8. Web Architecture
   9. Markup Languages
   10.       W3C


                                                 58
Note that the last keyword contains the stop-word “to”, so that the expressions with only two
words: “Introduction to” and “to Markup” do not satisfy the restriction that states that all
the expressions must contain at least two non-stop words.




     5.2.3. Finding Resources That Match the Title


For finding the resources from the TAP knowledge base that match the titles of the lectures, I
used four approaches.




              5.2.3.1. Approach 1 – Single Words


This approach is similar with the first one used for finding the resources that match the
keywords. The pseudo code is the following:


          For each lecture
             Get the title
             For each word in the title
                   If it is a stop – word then continue
                   Test the word with the KB from TAP




              5.2.3.2. Approach 2 – Single Words Plus Stemmer


Just, as with the keywords, the words from the title are first reduced to the root and then
tested with the KB from TAP.


          For each lecture
             Get the title
             For each word in the title
                   If it is a stop – word then continue
                   Get the root of the word with Stemmer
                   Test the root of the word with KB from TAP


Extraction of the root of a word is done by importing the Stemmer class into the project.



                                               59
              5.2.3.3. Approach 3 – Meaningful Expressions


This approach is also similar with that one used for keywords. I create all the possible
substrings with at least two non-stop adjacent words, then I sort the resulted expressions and
then I test them with KB from TAP. The matched resources also have a score associated with
them, where the score is computed with the formula:
                        score  100 nr _ of _ words _ in _ searched _ string
The algorithm for this approach is the following:


           For each lecture
              Get the title
              Create all possible expressions having at least two non-   stop
           adjacent words
              Sort all the created expressions ranging from those with   the
           most words down to those with only two non-stop words   or the
           title having only one word
              Test the ordered expressions with KB from TAP
              Compute the score for each matched resource from TAP




              5.2.3.4. Approach 4 – Title Reduced to a List of
                 Semantic Web Terms


For this approach we will make use of the glossary of semantic web terms. For each lecture, we
will extract the title, then, for every word in the title that is not a stop word, we will search in
the glossary to find the related semantic web terms. We will return all the semantic terms that
match our word together with their parents in the hierarchy. All the terms that appear on the
same level in the hierarchy as the matched terms will also be returned. This rule will not be
applied for the terms that appear on the highest level in the hierarchy. After creating this list of
semantic web terms corresponding to the title, we will sort them, just as we did for the
meaningful expressions. The ordered list will be then tested with the knowledge base from TAP,
and the matched resources will have a score associated with them. The score is computed with
the formula:
           score  100 nr _ of _ words _ in _ the _ corresponding _ semantic _ web _ term
Here is the algorithm for this approach:


           For each lecture
              Get the title
              For each word in the title
                    If it is a stop – word then continue
                    Create a list of semantic web terms


                                                 60
              Sort the computed list of terms
                    Return all the matched terms
                    Return all the terms that appear on the same level
              as the matched terms – if they are not on the
              highest level
                    Return the parents of the matched terms

              Test the sorted list with KB from TAP
              Compute the score for each matched resource from TAP




     5.2.4. Evaluating the Results of the Program


In order to evaluate the results of my program, I used a very important and common technique
and namely, the indicators precision (P) and recall (R), which are defined below:


                             No. of relevant documents presented to the user
                       P
                            Total number of documents presented to the user
                             No. of relevant documentspresented to the user
                       R
                            Total number of relevant documentsin collection


I created a questionnaire which had to be answered by a number of experts. I randomly chose
two lectures. For each lecture, first I present the keywords and thirty resources randomly
chosen from TAP knowledge base, from which the experts had to select all those they
considered somehow related with the keywords. Then for the same lecture, I presented the title
and another thirty resources randomly chosen. The experts had to select from the thirty
resources those they considered related with the title. These two steps were then repeated for
the second lecture.
The questionnaire was implemented in PHP and the database containing the resources was in
MySQL and it was completed by twelve experts.
There were five lectures, from which two were randomly selected. For each of the two lectures,
the experts had to choose, in turn, from thirty resources those that they considered related with
the keywords and the title of the lecture. The resources presented to the experts were randomly
selected from some subsets of the TAP knowledge base, and namely: W3CActivity, W3CNote,
W3CWorkingDraft and W3CSpecification. All these nodes of the knowledge base contain 360
entries from which only 30 have been randomly presented to each expert. This means that a lot
of resources had to be ignored when computing the indicators precision and recall. So, the
formulas used in this case are:




                                 No. of good results found by TAP
           P
                Total no. of results from TAP  No.of resources not sent to exp erts
                                                61
                               No. of good results found by TAP
                      R
                           Total no. of good results found by exp erts


These two formulas have been applied for all the approaches that have been developed in the
program. The results are presented in the tables below:




                                             62
                             Keywords [Single Words]         Keywords [Stemmer]           Keywords [Semantic Web
                                                                                                 Terms]
                 Agreement   Precision        Recall         Precision         Recall     Precision        Recall

     Lecture 1     33%                   1   0.36363637                  1   0.54545456               1   0.27272728
                   50%                   1   0.33333334                  1   0.52380955               1    0.2857143
                   66%                   1   0.16666667                  1   0.41666666               1   0.33333334
                  100%                   1   0.16666667                  1   0.41666666               1   0.33333334


     Lecture 2     33%                   1    0.6666667                  1    0.6666667               1    0.6666667
                   50%                   1       0.6875                  1       0.6875               1       0.6875
                                                                                                                          Table 1 – Precision and Recall for Keywords




                   66%                   1    0.8076923                  1    0.8076923               1    0.8076923
                  100%                   1    0.8076923                  1    0.8076923               1    0.8076923




63
     Lecture 3     33%                   1   0.47368422                  1   0.47368422               0              0
                   50%                   1   0.47368422                  1   0.47368422               0              0
                   66%                   1             0.6               1          0.6               0              0
                  100%                   1   0.64285713                  1   0.64285713               0              0


     Lecture 4     33%                   1   0.33333334                  1   0.33333334               1   0.33333334
                   50%                   1   0.33333334                  1   0.33333334               1   0.33333334
                   66%                   1   0.33333334                  1   0.33333334               1   0.33333334
                  100%                   1   0.33333334                  1   0.33333334               1   0.33333334


     Lecture 5     33%                   1       0.1875                  1       0.8125               1       0.8125
                   50%                   1             0.2               1          0.8               1             0.8
                   66%                   1         0.25                  1    0.8333333               1    0.8333333
                  100%                   1   0.27272728                  1         0.85               1         0.85
                             Title [Single Words]      Title [Stemmer]       Title [Expressions]         Title [Semantic
                                                                                                           Web Terms]
                 Agreement   Precision    Recall     Precision    Recall     Precision    Recall       Precision    Recall


     Lecture 1     33%               1   0.0384615           1    0.038461           0             0           1   0.038461
                                                 4                      54
                   50%               1   0.0384615           1    0.038461           0             0           1   0.038461
                                                 4                      55
                   66%               1   0.0555555           1    0.055555           0             0           1   0.055555
                                                56                     556
                  100%               1   0.0588235           1   0.0588235           0             0           1 0.0588235


     Lecture 2     33%               1   0.6538461           1   0.653846            1   0.038461              1   0.653846
                                                                                                                              Table 2 – Precision and Recall for Titles




                                                 4                     14                      54
                   50%               1        0.68           1       0.68            1       0.04              1       0.68
                   66%               1   0.8095238           1   0.809523            1   0.047619              1   0.809523
                                                                        8                      05                         8
                  100%               1         0.8           1        0.8            1      0.045              1        0.8




64
     Lecture 3     33%               1   0.5666666           1   0.566666            0             0           1   0.566666
                                                 6                      6                                                 6
                   50%               1   0.6971428           1   0.607142            0             0           1   0.607142
                                                 7                     87                                                87
                   66%               1   0.6666667           1   0.666666            0             0           1   0.666666
                                                                        7                                                 7
                  100%               1   0.6521739           1   0.652173            0             0           1   0.652173
                                                 4                      9

     Lecture 4     33%               1   0.7727272           1   0.772727            0             0           1   0.772727
                                               154                     25                                                25
                   50%               1   0.7727272           1   0.772727            0             0           1   0.772727
                                                 4                     34                                                34
                   66%               1   0.8947368           1   0.894736            0             0           1   0.894736
                                                                        8                                                 8
                  100%               1   0.8947368           1   0.894736            0             0           1   0.894736
                                                                        8                                                 8

     Lecture 5     33%               1         0.2           1         0.2           0             0           1        0.2
                   50%               1        0.25           1        0.25           0             0           1       0.25
                   66%               1   0.3333334           1   0.333333            0             0           1   0.333333
                                                                       34                                                34
                  100%               1   0.3333334           1   0.333333            0             0           1   0.333333
                                                                        4
As one can easily see in Table 1 and Table 2, our program is very precise. It means that the
program finds only results for which the agreement is 100 %. The agreement for each resource
presented to the experts is computed as a fraction having as denominator the number of times
that resource has been showed and as numerator the number of times it has been selected. So,
the fraction can have a value between 0 and 1, where 0 means total disagreement and 1 means
total agreement. For our experiment, we considered four thresholds for the agreement, and
namely: 33%, 50%, 66% and 100%.
The recall instead is not so good, which means that the experts chose as good results more
diverse resources. Our program searches in the TAP knowledge base mostly for those resources
which contain the words that appear in the title or in the keywords of the lectures.




                                            65
CHAPTER 6. CONCLUSIONS AND FUTURE WORK.


In this paper I presented the following contributions:
      A QEL – GetData wrapper for Edutella
      Personalization of semantic web search
Several analyses have also been made, ending with the evaluation of the program for adding
personalization to semantic web search.
Further work involves modifying the algorithm so that the recall improves. The precision in this
case will probably decrease, but we will get more diverse results. Another further possible
development would be to combine the two algorithms, in order to have personalized semantic
search in Edutella. The queries will be issued in QEL, translated to GetData with the aid of the
wrapper, then sent against the repositories and the results will be then presented to the users.




                                                66
CHAPTER 7. BIBLIOGRAPHY.


 1. “EDUTELLA: P2P Networking for the Semantic Web” by Wolfgang Nejdl, Boris Wolf,
    Wolf Siberski, Changtao Qu, Stefan Decker, Michael Sintek, Ambjörn Naeve, Mikael
    Nilsson, Matthias Palmer, Tore Risch. [Technical Report, 2003].
 2. “Edutella: A P2P Networking Infrastructure Based on RDF” by Wolfgang Nejdl,
    Boris Wolf, Changtao Qu, Stefan Decker, Michael Sintek, Ambjörn Naeve, Mikael Nilsson,
    Matthias Palmer, Tore Risch. [11th International World Wide Web Conference
    (WWW2002), Hawaii, USA, May 2002].
 3. “Interacting Edutella/JXTA Peer-to-Peer Network with Web Services” by
    Changtao Qu and Wolfgang Nejdl. [In Proc. of the International Symposium on
    Applications and the Internet (SAINT 2004), IEEE Computer Society Press, Jan.
    2004, Tokyo, Japan].
 4. “Edutella: Searching and Annotating Resources within an RDF-based P2P
    Network” by W. Nejdl, Boris Wolf, Steffen Staab, Julien Tane. [Semantic Web
    Workshop, at the 11th Iternational World Wide Web Conference (WWW2002),
    Hawaii, USA, May 2002].
 5. “Project JXTA: An Open, Innovative Collaboration”, Sun Microsystem, Inc. April 25,
    2001
 6. “Using Edutella Classes to Build a Wrapper” by Daniel Olmedilla, Wolf Siberski
 7. “Efficient Search in Peer – to – Peer Networks” by Beverly Yang, Hector Garcia –
    Molina. [In Proceedings of the 21st International Conference on Distributed
    Computing Systems (ICDCS'2001)].
 8. “A Framework for Semantic Gossiping” by Karl Aberer, Philippe Cudré-Mauroux,
    Manfred Hauswirth. [SIGMOD Record, 31(4), 2002].
 9. “Trust Management for the Semantic Web” by Matthew Richardson, Rakesh
    Agrawal, Pedro Domingos. [Proceedings of the Second International Semantic
    Web Conference, Sanibel Island, Florida, 2003.].
 10.“Personalization in Distributed e-Learning Environments” by Peter Dolog, Nicola
    Henze, Wolfgang Nejdl, Michael Sintek. [In Proceedings of WWW2004, May 2004].
 11. “Designing a Super – Peer Network” by Beverly Yang, Hector Garcia – Molina. [19th
    International Conference on Data Engineering, March 05 – 08, 2003, Bangalore,
    India].
 12. “TAP: A Semantic Web Platform” by R. Guha, Rob McCool
 13. “TAP: Towards the Semantic Search” – Powerpoint Slides for a talk at the World
    Wide Web 2002 Conference
 14. “Semantic Search” by R.V. Guha, Rob McCool and Eric Miller. [Proceedings of
    WWW 2003].
 15. “Contexts for the Semantic Web” by R. Guha, R. McCool, R. Fikes.



                                          67
16. “A System for Integrating Web Services into a Global Knowledge Base” by
   R. Guha, Rob McCool
17. “Personalized Web Search – Diploma Thesis in Computer Science” by Paul -
   Alexandru Chirita
18. “S-Match: an Algorithm and an Implementation of Semantic Matching” by F.
   Giunchiglia, P. Shvaiko and M. Yatskevich. [First European Semantic Web
   Symposium, ESWS 2004, Heraklion, Crete, Greece, May 2004 – Proceedings].
19. The apache http server. http://www.apache.org
20. RDF Primer. http://www.w3.org/TR/REC-rdf-syntax/
21. Semantic Web. http://www.w3.org/2001/sw/
22. Homepage of Edutella project. http://edutella.jxta.org
23. RDF Query Exchange Language (QEL). http://edutella.jxta.org/spec/qel.html
24. RDF Validator. http://www.w3.org/RDF/Validator/
25. PHP: Hypertext Preprocessor. http://www.php.net/
26. MySQL. http://www.mysql.com/
27. The   Protégé   Ontology   Editor        and   Knowledge   Acquisition   System.
   http://protege.stanford.edu/
28. Jena 2 – A Semantic Web Framework.
   http://www.hpl.hp.com/semweb/jena.htm
29. Xerces Java Parser 1.4.4. http://xml.apache.org/xerces-j/
30. Xalan Java Version 2.6.0. http://xml.apache.org/xalan-j/
31. L3S Research Center. http://www.l3s.de/




                                        68
CHAPTER 8. ANEXA A.

SOME CODE FROM THE QEL-GETDATA WRAPPER.



/*
 * TapQueryFormat.java
 *
 * Created on Apr 30, 2004
 *
 * To change the template for this generated file go to
 * Window&gt;Preferences&gt;Java&gt;Code Generation&gt;Code and Comments
 */
package net.jxta.edutella.provider.getdata;

import   net.jxta.edutella.eqm.Query;
import   net.jxta.edutella.eqm.io.QueryFormat;
import   net.jxta.edutella.eqm.QueryLiteral;
import   net.jxta.edutella.eqm.Rule;
import   net.jxta.edutella.eqm.StatementLiteral;
import   net.jxta.edutella.eqm.BuiltinLiteral;

import com.hp.hpl.jena.rdf.model.RDFNode;

import org.apache.log4j.Logger;

import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

/**
 * @author Paiu Raluca
 *
 * To change the template for this generated type comment go to
 * Window&gt;Preferences&gt;Java&gt;Code Generation&gt;Code and Comments
 */

/*
 * Resposible for translating Datalog into GetData for the TAP System
 */
public class TapQueryFormat implements QueryFormat {

      Logger log = Logger.getLogger(TapQueryFormat.class);
      private static final String DEFAULT_STMT_VIEW_NAME = "EDUTELLA_STATEMENTS";

      // lists of variables, constants, etc.
      private ArrayList variables = new ArrayList();
      private ArrayList constants = new ArrayList();
      private ArrayList joins = new ArrayList();
      private ArrayList conditions = new ArrayList();
      private ArrayList literals = new ArrayList();
      private ArrayList outerJoinLiterals = new ArrayList();

      // the query that has to be resolved
      private Query eduquery;

      // number of statements in the query
      private int stmtcount = 0;

      private String stmtViewName;
      //private TapQueryFormat subFormat = null;

      // holds a list of variables for each tree
      public static ArrayList[] variableName;

      // vector of restrictions
      public static ArrayList[] restr;

      public static int[] numVars;

      // it will contain the trees corresponding to the QEL query
      private static NTree[] qelTree;

      // holds the rules in the QEL query
      private String[] regs;

      // number of trees constructed for a specified QEL query
      private static int numberOfTrees;

      public TapQueryFormat() {
             this.stmtViewName = DEFAULT_STMT_VIEW_NAME;
      }

                                                                    1
public TapQueryFormat(String stmt) {
       this.stmtViewName = stmt;
}

// return the value of a private attribute "numberOfTrees"
public int getNumberOfTrees() {
       return TapQueryFormat.numberOfTrees;
}

// return the NTrees corresponding to a QEL query
public NTree[] getTrees() {
       return TapQueryFormat.qelTree;
}

/* (non-Javadoc)
  * @see net.jxta.edutella.eqm.io.QueryFormat#format(net.jxta.edutella.eqm.Query)
  */
public String format(Query src) {
        return format(src.getLiterals(), src.getOuterJoinLiterals(), src);
}

// create the tree(s) that correspond to the query
public void generateTree(Query src) {
       this.createTree(src.getLiterals(), src.getOuterJoinLiterals(), src);
}

private String format(
       Iterator literalIter,
       Iterator outerJoinIter,
       Query src) {

      this.generateTree(src);
      return "";
}

// create the tree(s) corresponding to the QEL query
private void createTree(
       Iterator literalIter,
       Iterator outerJoinIter,
       Query src) {


                                                             2
eduquery = src;
variables.clear();
constants.clear();
joins.clear();
conditions.clear();
literals.clear();
outerJoinLiterals.clear();
stmtcount = 0;

NTree temp;
NTree search;

int contor = 0;

while (literalIter.hasNext()) {
       this.literals.add(literalIter.next());
}

while (outerJoinIter.hasNext()) {
       this.outerJoinLiterals.add(outerJoinIter.next());
}

// how many rules there are in the QEL query
// fill in the vector "regs" with these rules
Iterator itRules = src.getRules();
int numberRules = 0;
while (itRules.hasNext()) {
       itRules.next();
       numberRules++;
}
this.regs = new String[numberRules];
itRules = src.getRules();
contor = 0;
// parse the string to get the body of the rule
while (itRules.hasNext()) {
       String rule = itRules.next().toString();
       int start = rule.indexOf(":");
       this.regs[contor] =
              new String(rule.substring(start + 3, rule.length()));
       contor++;
}


                                                     3
// for each rule we will have a distinct tree
TapQueryFormat.qelTree = new NTree[numberRules];
for (int i = 0; i < numberRules; i++)
       TapQueryFormat.qelTree[i] = new NTree();

int nrtemp = numberRules;
if (nrtemp == 0)
       nrtemp = 1;

TapQueryFormat.variableName = new ArrayList[nrtemp];
TapQueryFormat.numVars = new int[nrtemp];
TapQueryFormat.restr = new ArrayList[nrtemp];
for (int i = 0; i < nrtemp; i++) {
       TapQueryFormat.variableName[i] = new ArrayList();
       TapQueryFormat.restr[i] = new ArrayList();
       TapQueryFormat.numVars[i] = 0;
}

// setting the value of the attribute "numberOfTrees"
TapQueryFormat.numberOfTrees = numberRules;

contor = 0;

// if there are no rules we have a single tree
if (numberRules == 0) {
       TapQueryFormat.numberOfTrees = 1;
       TapQueryFormat.qelTree = new NTree[1];
       TapQueryFormat.qelTree[contor] = new NTree();

      for (Iterator i = this.literals.iterator(); i.hasNext();) {
             QueryLiteral literal = (QueryLiteral) i.next();

              search = null;
              temp = null;
              NTree tempNode = null;

              /*
               * ***********************************************************
               * Statement Literal
               * ***********************************************************
               */
              if (literal instanceof StatementLiteral) {

                                                       4
StatementLiteral stmt = (StatementLiteral) literal;

if (stmt.toString().indexOf("qel:member") != -1) {
       System.out.println(
              "\n\nqel:member is not " + "supported.\n\n");
       System.exit(0);
}

RDFNode subject = stmt.getSubject();
RDFNode predicate = stmt.getPredicate();
RDFNode object = stmt.getObject();

String subjName, predName, objName;

// check to see if there is already a corresponding node
boolean flag = false;
// check to see if it is a variable
boolean var = false;

subjName = new String();
predName = new String();
objName = new String();

/*   the predicate of the statement */
//   check to see if the predicate is a variable -
//   this is imposible for GetData
if   (eduquery.isVariable(predicate)) {
         System.out.println(
                "The predicate can NOT be a variable "
                       + "in case of the translation from QEL to GetData");
         System.exit(0);
}
predName = predicate.toString();

/* the subject of the statement */
// if it is a variable
if (eduquery.isVariable(subject)) {
       int start = subject.toString().indexOf("#");
       subjName =
              subject.toString().substring(
                     start + 1,
                     subject.toString().length());

                                   5
       var = true;
}
// otherwise
else
       subjName = subject.toString();

// we check to see if there are nodes in the tree
int nr =
       TapQueryFormat.qelTree[contor].getNumberNodes(
              TapQueryFormat.qelTree[contor]);

// if there is no node in the tree
if (nr == 1) {
       TapQueryFormat.qelTree[contor] =
              new NTree(subjName, 10);

       // set the type of the new node
       if (var)
              TapQueryFormat.qelTree[contor].setType("variable");
       else
              TapQueryFormat.qelTree[contor].setType("resource");
       var = false;
} else {
       search =
              TapQueryFormat.qelTree[contor].findNode(subjName);
       if (search != null)
              flag = true;
       else {
              tempNode = new NTree(subjName, 10);
              if (var == true)
                     tempNode.setType("variable");
              else
                     tempNode.setType("resource");
       }
       var = false;
}

/* the object of the statement */
// if the object is a variable
if (eduquery.isVariable(object)) {
       int start = object.toString().indexOf("#");
       objName =

                                 6
                object.toString().substring(
                       start + 1,
                       object.toString().length());
         var = true;
} else
         objName = object.toString();

// we check to see if there is already a node
// corresponding to the object of the statement
temp = TapQueryFormat.qelTree[contor].findNode(objName);
NTree insertNode;

// there is no node, so we create a new one
if (temp == null) {
       // set the name of the new node
       insertNode = new NTree(objName, 10);

       // set the type of the new node
       if (var)
              insertNode.setType("variable");
       else
              insertNode.setType("resource");
       var = false;
} else {
       insertNode = new NTree(temp);
}

// create an arc for this statement
if (flag == false)
       if (tempNode == null)
              TapQueryFormat.qelTree[contor].insert(
                     TapQueryFormat.qelTree[contor],
                     insertNode,
                     predName);
       else {
              tempNode.insert(tempNode, insertNode, predName);
              TapQueryFormat.qelTree[contor] =
                     new NTree(tempNode);
       }

else {
         search.insert(search, insertNode, predName);

                                   7
             }
      }
      /* ******************************************************* */

      /*
        * **********************************************************
        * Builtin Literal
        * **********************************************************
        */
      else if (literal instanceof BuiltinLiteral) {
              this.conditions.add((BuiltinLiteral) literal);
      }

      /* ******************************************************* */
}

// add restrictions to the nodes in the tree
for (int k = 0; k < this.conditions.size(); k++) {
       BuiltinLiteral condition = (BuiltinLiteral) conditions.get(k);

      String predName =
             new String(condition.getPredicateName().toString());
      // the name of the builtin predicate
      int start = predName.toString().indexOf("#");
      predName = predName.substring(start + 1, predName.length());

      // the   first argument of the builtin predicate
      String   first = new String();
      // the   second argument of the builtin predicate
      String   second = new String();

      boolean f = false; // test if the first argument is a variable
      boolean s = false; // test if the second argument is a variable

      first = condition.getArg(0).toString();
      second = condition.getArg(1).toString();

      if (eduquery.isVariable(condition.getArg(0)))
             f = true;
      if (eduquery.isVariable(condition.getArg(1)))
             s = true;


                                               8
if (first.indexOf('#') != -1) {
       // we need the name of the variable, without the '#' sign
       start = first.indexOf("#");
       first = first.substring(start + 1, first.length());
       f = true;
}
if (second.indexOf('#') != -1) {
       s = true;
}

// if the second argument is a variable
if (s) {
       start = second.indexOf("#");
       second = second.substring(start + 1, second.length());
}

//   if the first argument is a variable
//   the restriction will be added to the node
//   corresponding to this variable
if   (f) {
         int l = 0;

        for (l = 0;
               l < TapQueryFormat.variableName[contor].size();
               l++)
               if (TapQueryFormat
                      .variableName[contor]
                      .get(l)
                      .toString()
                      .compareTo(first)
                      == 0)
                      break;

        if (l < TapQueryFormat.variableName[contor].size()) {
               String restrictions = new String();
               restrictions += TapQueryFormat.restr[contor].get(l)
                      + "#"
                      + predName
                      + "("
                      + first
                      + ","
                      + second

                                         9
                                 + ")";
                          TapQueryFormat.restr[contor].add(l, restrictions);
                   } else {
                          String restrictions = new String();
                          restrictions += predName
                                 + "("
                                 + first
                                 + ","
                                 + second
                                 + ")";
                          TapQueryFormat.restr[contor].add(l, restrictions);
                          TapQueryFormat.variableName[contor].add(l, first);
                   }
             }
      }
}

// the QEL query is made of rules
else {
       contor = 0;

      for (Iterator i = this.literals.iterator(); i.hasNext();) {
             QueryLiteral literal = (QueryLiteral) i.next();

             List rule = src.getMatchingRules(literal);

             for (int k = 0; k < rule.size(); k++, contor++) {
                    Rule kRule = (Rule) rule.get(k);

                   this.conditions.clear();
                   Iterator litRule = kRule.getLiterals();
                   while (litRule.hasNext()) {
                          QueryLiteral lit = (QueryLiteral) litRule.next();

                          /*
                           * ************************************************
                           * Statement Literal
                           * ************************************************
                           */
                          if (lit instanceof StatementLiteral) {
                                 StatementLiteral stmt = (StatementLiteral) lit;


                                                   10
search = null;
temp = null;

RDFNode subject = stmt.getSubject();
RDFNode predicate = stmt.getPredicate();
RDFNode object = stmt.getObject();

String subjName, predName, objName;

// check to see if there is already a
// corresponding node
boolean flag = false;
// check to see if it is a variable
boolean var = false;

subjName = new String();
predName = new String();
objName = new String();

/*   the predicate of the statement */
//   check to see if the predicate is a variable -
//   this is imposible for GetData
if   (eduquery.isVariable(predicate)) {
         System.out.println(
                "The predicate can NOT be a variable in "
                       + "case of the translation from QEL to "
                       + "GetData");
         System.exit(0);
}
predName = predicate.toString();

/* the subject of the statement */
// if it is a variable
if (eduquery.isVariable(subject)) {
       int start = subject.toString().indexOf("#");
       subjName =
              subject.toString().substring(
                     start + 1,
                     subject.toString().length());
       var = true;
}
// otherwise

                     11
else
       subjName = subject.toString();

// we check to see if there are nodes in the tree
int nr =
       TapQueryFormat.qelTree[contor].getNumberNodes(
              TapQueryFormat.qelTree[contor]);

// if there is no node in the tree
if (nr == 1) {
       TapQueryFormat.qelTree[contor] =
              new NTree(subjName, 10);

       // set the type of the new node
       if (var)
              TapQueryFormat.qelTree[contor].setType(
                     "variable");
       else
              TapQueryFormat.qelTree[contor].setType(
                     "resource");

       var = false;
} else {
       search =
              TapQueryFormat.qelTree[contor].findNode(
                     subjName);
       if (search != null)
              flag = true;
       var = false;
}

/* the object of the statement */
// if the object is a variable
if (eduquery.isVariable(object)) {
       int start = object.toString().indexOf("#");
       objName =
              object.toString().substring(
                     start + 1,
                     object.toString().length());
       var = true;
} else
       objName = object.toString();

                   12
      // we check to see if there is already a node
      // corresponding to the object of the statement
      temp =
             TapQueryFormat.qelTree[contor].findNode(
                    objName);
      NTree insertNode;

      // there is no node, so we create a new one
      if (temp == null) {
             // set the name of the new node
             insertNode = new NTree(objName, 10);

               // set the type of the new node
               if (var)
                      insertNode.setType("variable");
               else
                      insertNode.setType("resource");
               var = false;
      } else
               insertNode = new NTree(temp);

      // create an arc for this statement
      if (flag == false)
             TapQueryFormat.qelTree[contor].insert(
                    TapQueryFormat.qelTree[contor],
                    insertNode,
                    predName);
      else
             search.insert(search, insertNode, predName);
}
/* ************************************************ */

/*
  * ***************************************************
  * Builtin Literal
  * ***************************************************
  */
else if (lit instanceof BuiltinLiteral) {
        this.conditions.add((BuiltinLiteral) lit);
}


                           13
      /* ************************************************ */
}

// add restrictions to the nodes in the tree
for (int x = 0; x < this.conditions.size(); x++) {
       BuiltinLiteral condition =
              (BuiltinLiteral) conditions.get(x);

      String predName =
             new String(condition.getPredicateName().toString());
      // the name of the builtin predicate
      int start = predName.toString().indexOf("#");
      predName =
             predName.substring(start + 1, predName.length());

      String first = new String();
      // the first argument of the builtin predicate
      String second = new String();
      // the second argument of the builtin predicate
      boolean f = false;
      // test if the first argument is a variable
      boolean s = false;
      // test if the second argument is a variable

      first = condition.getArg(0).toString();
      second = condition.getArg(1).toString();

      if (eduquery.isVariable(condition.getArg(0)))
             f = true;
      if (eduquery.isVariable(condition.getArg(1)))
             s = true;

      if (first.indexOf('#') != -1) {
             // we need the name of the variable, without the '#'
             start = first.indexOf("#");
             first = first.substring(start + 1, first.length());
             f = true;
      }
      if (second.indexOf('#') != -1) {
             s = true;
      }


                                14
// if the second argument is a variable
if (s) {
       start = second.indexOf("#");
       second =
              second.substring(start + 1, second.length());
}

//   if the first argument is a variable
//   the restriction will be added to the node
//   corresponding to this variable
if   (f) {
         int l = 0;

        for (l = 0;
               l < TapQueryFormat.variableName[contor].size();
               l++)
               if (TapQueryFormat
                      .variableName[contor]
                      .get(l)
                      .toString()
                      .compareTo(first)
                      == 0)
                      break;

        if (l
               < TapQueryFormat.variableName[contor].size()) {
               String restrictions = new String();
               restrictions
                      += TapQueryFormat.restr[contor].get(l)
                      + "#"
                      + predName
                      + "("
                      + first
                      + ","
                      + second
                      + ")";
               TapQueryFormat.restr[contor].add(
                      l,
                      restrictions);
        } else {
               String restrictions = new String();
               restrictions += predName

                           15
                                                        + "("
                                                        + first
                                                        + ","
                                                        + second
                                                        + ")";
                                                 TapQueryFormat.restr[contor].add(
                                                        l,
                                                        restrictions);
                                                 TapQueryFormat.variableName[contor].add(
                                                        l,
                                                        first);
                                           }
                                     }
                               }
                         }
                   }
             }
     }
}

/*
 * TapProviderConnection.java
 *
 * Created on Apr 30, 2004
 *
 * To change the template for this generated file go to
 * Window&gt;Preferences&gt;Java&gt;Code Generation&gt;Code and Comments
 */
package net.jxta.edutella.provider.getdata;

import   net.jxta.edutella.eqm.Query;
import   net.jxta.edutella.eqm.ResultSet;
import   net.jxta.edutella.provider.AbstractProviderConnection;
import   net.jxta.edutella.provider.ProviderException;
import   net.jxta.edutella.util.Option;

import   net.jxta.edutella.provider.edu.stanford.TAP.Abbrev;
import   net.jxta.edutella.provider.edu.stanford.TAP.Client;
import   net.jxta.edutella.provider.edu.stanford.TAP.Cursor;
import   net.jxta.edutella.provider.edu.stanford.TAP.Description;
                                                             16
import   net.jxta.edutella.provider.edu.stanford.TAP.Description_XML;
import   net.jxta.edutella.provider.edu.stanford.TAP.KB;
import   net.jxta.edutella.provider.edu.stanford.TAP.Resource;
import   net.jxta.edutella.provider.edu.stanford.TAP.XML_Branch;

import org.apache.log4j.Logger;

import   java.net.*;
import   java.util.ArrayList;
import   java.util.StringTokenizer;
import   java.util.Vector;
import   java.io.*;

/**
 * @author Paiu Raluca
 *
 * To change the template for this generated type comment go to
 * Window&gt;Preferences&gt;Java&gt;Code Generation&gt;Code and Comments
 */
public class TapProviderConnection extends AbstractProviderConnection {
      private static Logger log = Logger.getLogger(TapProviderConnection.class);

     private   String stmtViewName;
     private   Client tap = null;
     private   String url = new String("http://localhost/data/tap.rdf");
     private   Resource[] answers;

     private URL u = null;
     private URLConnection conn = null;
     private TapQueryFormat queryFormat;

     // for each tree we hold an ordered list in order to execute
     // the branches in the tree as a GetData query
     private ArrayList[] listNodes;

     private boolean[] isDirect;

     // for each variable in the tree we hold the bindings which
                                                            17
// result from executing GetData queries
private PartialResult[][] bindRes;

private Vector finalResults = new Vector();
private Vector nameResults = new Vector();

private Bindings[][] bindings;

public void init() {
      try {
            u = new URL(url);
            conn = u.openConnection();
            tap = new Client(url);
            queryFormat = new TapQueryFormat(stmtViewName);
      } catch (IOException e) {
            log.error("I failed to connect to " + url + e);
      }
}

/* (non-Javadoc)
  * @see net.jxta.edutella.component.InspectableComponent#getDescription()
  */
public String getDescription() {
       return null;
}

/* (non-Javadoc)
  * @see net.jxta.edutella.provider.PooledConnection#close()
  */
public void close() {
       conn = null;
}

/* (non-Javadoc)
 * @see net.jxta.edutella.provider.PooledConnection#validate()
 */
public boolean validate() {
      try {
                                                     18
            return conn != null;
      } catch (Exception e) {
            return false;
      }
}

/* (non-Javadoc)
  * @see net.jxta.edutella.util.Configurable#getOptions()
  */
public Option[] getOptions() {
       return null;
}

/* (non-Javadoc)
  * @see net.jxta.edutella.util.Configurable#getPropertyPrefix()
  */
public String getPropertyPrefix() {
       return "provider";
}

public String getStmtViewName() {
      return stmtViewName;
}

/**
 * @return the result...
 */
public synchronized ResultSet executeQuery(Query eduquery) {
      try {
            this.executeTree();
            ResultSet results = null;

            System.out.println("\n\nRESULTS:\n\n");
            for (int i = 0; i < this.finalResults.size(); i++)
                  results = this.importTupleResults(eduquery, i);
            return null;
      } catch (ProviderException e) {
            throw e;
                                                     19
     } catch (Exception e) {
           log.error("Error processing an GetData query", e);
           throw new ProviderException(e);
     }
}

/**
 * @param eduquery
 * @param i
 * @return
 */
/*
 * import the results we obtained after executing the query
 */
private ResultSet importTupleResults(Query eduquery, int i) {
      Cursor c = new Cursor(this.tap.GetTarget());
      Vector nodes = new Vector();
      Resource[] res = (Resource[]) this.finalResults.get(i);
      String varName = this.nameResults.get(i).toString();

     System.out.println(varName);
     for (int j = 0; j < res.length; j++) {
           nodes.addElement(res[j].value);
           //System.out.println(nodes.get(j).toString());
     }

     c.AddResults(nodes);

     String url;
     KB kb = c.GetKnowledgeBase();
     url = kb.GetURL();
     String result = new String("");

     if (url == null) {
           System.out.println("Connection to database failed");
           System.exit(0);
     }


                                                    20
     int k = 0;
     for (String n = c.Next(); n != null; n = c.Next()) {
           result += this.importOneResult(kb, n, k++);
     }
     c.Release();

     System.out.println(
           "===============================================================");
     if (result.length() == 0)
           System.out.println("There where no results found.");
     else
           System.out.println(result);
     System.out.println(
           "===============================================================");
     return null;
}

/**
 * @param kb
 * @param n
 * @param k
 * @return
 */
private String importOneResult(KB kb, String n, int k) {
      String usource = n;
      XML_Branch top;
      Description desc = new Description(null);

     if (Abbrev.NeedsNamespace(n))
           usource = kb.AddNamespace(n);

     desc.AddArcTarget("oid", usource);

     Cursor ac = kb.GetArcs(n);
     top = new XML_Branch("RDF");

     for (String arc = ac.Next(); arc != null; arc = ac.Next()) {


                                                    21
           Cursor tc = kb.GetTargets(n, arc);
           String uarc = arc;
           if (Abbrev.NeedsNamespace(arc)) {
                 uarc = kb.AddNamespace(arc);
           }

           for (String target = tc.Next();
                 target != null;
                 target = tc.Next()) {
                 if (!arc.equals("oid")) {
                       String utarget = target;

                       if (!kb.IsLexiconArc(arc)
                             && (kb.HasNode(target)
                                   || arc.equals("type")
                                   || arc.equals("subClassOf"))) {
                             if (Abbrev.NeedsNamespace(target))
                                   utarget = kb.AddNamespace(target);
                       }
                       desc.AddArcTarget(uarc, utarget);
                 }
           }
           tc.Release();
     }
     ac.Release();

     XML_Branch child = Description_XML.RDF_ToBranch(desc);
     top.AddChild(child);
     String str = child.ToString(2, false, new StringBuffer()).toString();
     String repl = new String("n" + k);
     String old = new String("[n]\\d+");
     str = str.replaceAll(old, repl);
     return str;
}

/*
 * execute sequentially the GetData queries corresponding
 * to the branches in the trees
                                                    22
 */
public void executeTree() {
      this.sendGetData();

     // if we have more trees we have to gather all the results
     this.union();
}

/*
 * execute sequentially the GetData queries corresponding
 * to the branches in the trees
 */
private void sendGetData() {
      int num = this.queryFormat.getNumberOfTrees();
      if (num == 0)
            num = 1;

     this.listNodes = new ArrayList[num];
     this.answers = new Resource[num];
     this.isDirect = new boolean[num];

     for (int i = 0; i < num; i++)
           this.isDirect[i] = false;

     NTree[] qelTrees = this.queryFormat.getTrees();
     int nrAnt = 0;

     this.bindRes = new PartialResult[num][5];
     for (int m = 0; m < num; m++)
           for (int n = 0; n < 5; n++)
                 this.bindRes[m][n] = new PartialResult();

     this.bindings = new Bindings[num][5];
     for (int m = 0; m < num; m++)
           for (int n = 0; n < 5; n++)
                 this.bindings[m][n] = new Bindings();

     // if the connection to the KB could not be established we abort
                                                    23
if (this.conn.getContentType() == null) {
      System.out.println("Connection to KB could not be established");
      System.exit(0);
}

// for every tree
for (int i = 0; i < num; i++) {
      if (i != 0)
            nrAnt = qelTrees[i - 1].getNumberNodes(qelTrees[i - 1]);
      // the number of nodes in the current tree
      int numNodes = qelTrees[i].getNumberNodes(qelTrees[i]);

     this.listNodes[i] = new ArrayList(numNodes - nrAnt);

     // if the root of the tree is a variable, we will have inverse order
     if (qelTrees[i].getType().compareTo("variable") == 0) {
           getOrderNodes(i, qelTrees[i]);
     }
     // otherwise we will have direct order
     else {
           getDirectOrderNodes(i, qelTrees[i]);
           this.isDirect[i] = true;
     }

     // trees with more then one level which have to be traversed
     // in direct order, will be specially treated
     if (this.isDirect[i] == true) {
           for (int k = 0; k < this.listNodes[i].size(); k++) {
                 String name = this.listNodes[i].get(k).toString();

                 // the node in the tree corresponding to the current node
                 NTree parent = qelTrees[i].findNode(qelTrees[i], name);

                 //the number of children for this node
                 int numChildren = parent.getNumChildren();

                 for (int l = 0; l < numChildren; l++) {
                       // the l-th child of the current node
                                              24
NTree child = parent.getChildren()[l];
// the label of the arc leading from the parent
// to the child
String label = qelTrees[i].findLabel(parent, child);

// if the parent is a variable, we will make a query
// for all the bindings corresponding to this variable
if (parent.getType().compareTo("variable") == 0) {
      int nr = this.findBindRes(i, parent.getName());
      int dim = this.bindRes[i][nr].getResQ().length;
      Resource pred = new Resource(label);
      Resource iter = new Resource();
      Resource[] vecRes = this.bindRes[i][nr].getResQ();

     for (int m = 0; m < dim; m++) {
           iter = vecRes[m];
           if (iter.value == null);
           Resource ans_m = this.tap.GetData(iter, pred);
           this.makeBindings(ans_m, child.getName(), i);
     }

     // now we apply the restrictions for the results
     // we obtained for the child
     if (TapQueryFormat.variableName[i].size() > 0) {
           String restrictions = null;
           int ll = 0;

           for (ll = 0;
                 ll < TapQueryFormat.variableName[i].size();
                 ll++)
                 if (TapQueryFormat
                       .variableName[i]
                       .get(ll)
                       .toString()
                       .compareTo(child.getName())
                       == 0)
                       break;


                       25
           if (ll
                 < TapQueryFormat.variableName[i].size()) {
                 restrictions =
                       TapQueryFormat
                             .restr[i]
                             .get(ll)
                             .toString();
                 this.applyRestrictions(
                       restrictions,
                       child.getName(),
                       i);
           }
     }
}

// the parent is not a variable
else {
      Resource pred = new Resource(label);
      Resource subj = new Resource(parent.getName());
      answers[i] = this.tap.GetData(subj, pred);
      this.makeBindings(answers[i], child.getName(), i);

     // apply the restrictions

     if (TapQueryFormat.variableName[i].size() > 0) {
           String restrictions = null;
           int ll = 0;

           for (ll = 0;
                 ll < TapQueryFormat.variableName[i].size();
                 ll++)
                 if (TapQueryFormat
                       .variableName[i]
                       .get(ll)
                       .toString()
                       .compareTo(child.getName())
                       == 0)
                       break;
                       26
                             if (ll
                                   < TapQueryFormat.variableName[i].size()) {
                                   restrictions =
                                         TapQueryFormat
                                               .restr[i]
                                               .get(ll)
                                               .toString();
                                   this.applyRestrictions(
                                         restrictions,
                                         child.getName(),
                                         i);
                             }
                       }
                 }
           }
     }
}

/*
 * for each node in the above list
 * we must construct a GetData query
 * the parent of the node - the SUBJECT
 * the node itself - the OBJECT
 * the label of the arc - the PREDICATE
 */
else {
      for (int k = 0; k < this.listNodes[i].size() - 1; k++) {
            String name = this.listNodes[i].get(k).toString();

           // the node in the tree corresponding to the current node
           NTree child = qelTrees[i].findNode(qelTrees[i], name);
           // the parent of the current node
           NTree parent = qelTrees[i].findParent(qelTrees[i], child);
           // the label of the arc leading from the parent to the child
           String label = qelTrees[i].findLabel(parent, child);

           // the subject and the object of the GetData can not be
                                        27
// both variables at the same time
if (child.getNumChildren() == 0
      && child.getType().compareTo("variable") == 0
      && parent.getType().compareTo("variable") == 0) {
      System.out.println(
            "Subject and object can not be both "
                  + "variables at the same time");
      System.exit(0);
}

/*
 * INVERSE ORDER
 * we know the object and the label of the arc
 * that points to this object
 * we want to find the value of the subject
 */
if (parent.getType().compareTo("variable") == 0
      && this.isDirect[i] == false) {
      Resource obj = new Resource(child.getName());

     //   we have to check if the object is also a variable
     //   in this case this variable will be sequentially bound
     //   to the results of the corespondig previous query
     if   (child.getType().compareTo("variable") == 0) {

            // first we have to apply the restrictions for this node
            if (TapQueryFormat.variableName[i].size() > 0) {
                  String restrictions = null;
                  int l = 0;

                  for (l = 0;
                        l < TapQueryFormat.variableName[i].size();
                        l++)
                        if (TapQueryFormat
                              .variableName[i]
                              .get(l)
                              .toString()
                              .compareTo(child.getName())
                              28
                        == 0)
                        break;

           if (l
                   < TapQueryFormat.variableName[i].size()) {
                   restrictions =
                         TapQueryFormat
                               .restr[i]
                               .get(l)
                               .toString();
                   this.applyRestrictions(
                         restrictions,
                         child.getName(),
                         i);
           }
     }

     int nr = this.findBindRes(i, child.getName());
     int dim = this.bindRes[i][nr].getResQ().length;
     Resource pred = new Resource(label);
     Resource iter = new Resource();
     Resource[] vecRes = this.bindRes[i][nr].getResQ();

     // for each value from the bindings vector we make
     // a query
     // the results of each query are stored
     for (int m = 0; m < dim; m++) {
           iter = vecRes[m];
           if (iter.value == null);
           Resource ans_m =
                 this.tap.GetData(iter, pred, "inverse=yes");
           this.makeBindings(ans_m, parent.getName(), i);
     }
}
// the object is a resource or a constant
else {
      Resource pred = new Resource(label);
      answers[i] =
                        29
                 this.tap.GetData(obj, pred, "inverse=yes");
           this.makeBindings(answers[i], parent.getName(), i);
      }
}
/* ******************************************************** */

/*
 * DIRECT ORDER
 * we know the subject and the predicate and
 * we want to find the value of the object
 */
if (child.getType().compareTo("variable") == 0) {
      Resource subj = new Resource(parent.getName());
      Resource pred = new Resource(label);

     answers[i] = this.tap.GetData(subj, pred);

     this.isDirect[i] = true;

     //for (int m=0; m<answers[i].count(); i++)
     //    System.out.println (answers[i].item(m).value);

     // we make the bindings for this variable
     this.makeBindings(answers[i], child.getName(), i);

     // we have to apply the restrictions for this node
     if (TapQueryFormat.variableName[i].size() > 0) {
           String restrictions = null;
           int l = 0;

           for (l = 0;
                 l < TapQueryFormat.variableName[i].size();
                 l++)
                 if (TapQueryFormat
                       .variableName[i]
                       .get(l)
                       .toString()
                       .compareTo(child.getName())
                             30
                                   == 0)
                                   break;

                       if (l < TapQueryFormat.variableName[i].size()) {
                             restrictions =
                                   TapQueryFormat.restr[i].get(l).toString();
                             this.applyRestrictions(
                                   restrictions,
                                   child.getName(),
                                   i);
                       }
                 }

           }
           /* ******************************************************** */
     }
}

// apply the restrictions for the root of the tree
// it wasn't included in the above for
String name =
      this.listNodes[i].get(this.listNodes[i].size() - 1).toString();
// the node in the tree corresponding to the current node
// in the list
NTree child = qelTrees[i].findNode(qelTrees[i], name);

if (child.getType().compareTo("variable") == 0) {
      // first we have to apply the restrictions for this node
      if (TapQueryFormat.variableName[i].size() > 0) {
            String restrictions = null;
            int l = 0;

           for (l = 0; l < TapQueryFormat.variableName[i].size(); l++)
                 if (TapQueryFormat
                       .variableName[i]
                       .get(l)
                       .toString()
                       .compareTo(child.getName())
                                        31
                                   == 0)
                                   break;

                       if (l < TapQueryFormat.variableName[i].size()) {
                             restrictions =
                                   TapQueryFormat.restr[i].get(l).toString();
                             this.applyRestrictions(
                                   restrictions,
                                   child.getName(),
                                   i);
                       }
                 }
           }

           /* ************************************************************ */

           /*
            * INTERSECTION OF RESULTS FOR EACH TREE
            */
           this.intersection(i);
     }
}

// get the maxim value from the vector that indicates which resource
// satisfies the restrictions
private int getMaxValue(int arbNum, int num) {
      if (num < 0)
            return 0;
      int length = this.bindRes[arbNum][num].getResQ().length;
      int maxim = 0;

     for (int i = 0; i < length; i++)
           if (this.bindRes[arbNum][num].getValid()[i] > maxim)
                 maxim = this.bindRes[arbNum][num].getValid()[i];

     return maxim;
}


                                                      32
// apply the restrictions of the node to the results obtained
private void applyRestrictions(String restr, String var, int arbNum) {
      StringTokenizer tok = new StringTokenizer(restr, "#");
      int maxHistory = 0;
      int dimension = this.numberVariables();

     while (tok.hasMoreTokens()) {
           String token = tok.nextToken();

           // get the name of the builtin literal
           StringTokenizer nameTok = new StringTokenizer(token, "(");
           String name = nameTok.nextToken();

           // getting the first argument
           int start = token.indexOf('(');
           int stop = token.indexOf(',');
           String first = token.substring(start + 1, stop);

           // getting the second argument
           start = stop;
           stop = token.length() - 1;
           String second = token.substring(start + 1, stop);

           //   depending of the name of the builtin predicate,
           //   we apply a specific restriction
           //   LIKE
           if   (name.compareTo("like") == 0) {
                   this.applyLike(first, second, var, arbNum);
           }

           // EQUALS
           else if (name.compareTo("equals") == 0) {
                 this.applyEquals(first, second, var, arbNum);
           }

           // LESSTHAN
           else if (name.compareTo("lessThan") == 0) {
                 this.applyLessThan(first, second, var, arbNum);
                                                     33
}

// GREATERTHAN
else if (name.compareTo("greaterThan") == 0) {
      this.applyGreaterThan(first, second, var, arbNum);
}

// LANGUAGE
else if (name.compareTo("language") == 0) {
      this.applyLanguage(first, second, var, arbNum);
}

// NODETYPE
else if (name.compareTo("nodeType") == 0) {
      this.applyNodeType(first, second, var, arbNum);
}

// STRINGVALUE
else if (name.compareTo("stringValue") == 0) {
      this.applyStringValue(first, second, var, arbNum);
}

// DATATYPE
else if (name.compareTo("dataType") == 0) {
      this.applyDataType(first, second, var, arbNum);
}

int num = this.findBindRes(arbNum, var);
int maxim = this.getMaxValue(arbNum, num);
int length = this.bindRes[arbNum][num].getResQ().length;

// if the maxim is 0, it means that no result satisfied this restr
if (maxim == 0) {
      maxHistory = 1;
      for (int i = 0; i < length; i++)
            this.bindRes[arbNum][num].setResQ(null, i);
      continue;
                                        34
     }

     Resource[] temp = new Resource[length];

     int contor = 0;
     for (int i = 0; i < length; i++) {
           if (this.bindRes[arbNum][num].getValid()[i] == maxim) {
                 temp[contor] = this.bindRes[arbNum][num].getResQ()[i];
                 contor++;
           }
     }

     this.bindRes[arbNum][num] = new PartialResult(var, contor);

     if (contor == 0)
           for (int m = 0; m < contor; m++)
                 this.bindRes[arbNum][num].setResQ(null, m);
     else
           for (int m = 0; m < contor; m++) {
                 this.bindRes[arbNum][num].setResQ(temp[m], m);
           }
}

if (maxHistory == 1) {
      // get the bindings corresponding to the variable "var",
      // in the arbNum-th tree
      int num = this.findBindRes(arbNum, var);
      int maxim = this.getMaxValue(arbNum, num);
      int length = this.bindRes[arbNum][num].getResQ().length;
      if (maxim == 0) {
            for (int i = 0; i < length; i++)
                  this.bindRes[arbNum][num].setResQ(new Resource(), i);
            return;
      }

     Resource[] res = new Resource[length];
     res = this.bindRes[arbNum][num].getResQ();
     Resource[] temp = new Resource[length];
                                               35
           for (int i = 0; i < length; i++) {
                 if (this.bindRes[arbNum][num].getValid()[i] == maxim)
                       temp[i] = this.bindRes[arbNum][num].getResQ()[i];
           }

           int i;
           for (i = 0; i < 5; i++)
                 if (this.bindings[arbNum][i].varName == null)
                       break;
           this.bindings[arbNum][i].setVarName(var);
           this.bindings[arbNum][i].setValidRes(temp);
     }
}

// apply the "LIKE" restriction
private void applyLike(String arg1, String arg2, String var, int arbNum) {
      Cursor c = this.applyGen(arg1, arg2, var, arbNum);
      if (c == null)
            return;
      String result = new String("");
      result = this.serializeCursor(c, arg2, "like", var, arbNum);
}

// apply the "EQUALS" restriction
private void applyEquals(
      String arg1,
      String arg2,
      String var,
      int arbNum) {
      Cursor c = this.applyGen(arg1, arg2, var, arbNum);
      String result = new String("");
      result = this.serializeCursor(c, arg2, "equals", var, arbNum);
}

// apply the "LESSTHAN" restriction
private void applyLessThan(
      String arg1,
                                                    36
     String arg2,
     String var,
     int arbNum) {
     Cursor c = this.applyGen(arg1, arg2, var, arbNum);
     String result = new String("");
     result = this.serializeCursor(c, arg2, "lessThan", var, arbNum);
}

// apply the "GREATERTHEN" restriction
private void applyGreaterThan(
      String arg1,
      String arg2,
      String var,
      int arbNum) {
      Cursor c = this.applyGen(arg1, arg2, var, arbNum);
      String result = new String("");
      result = this.serializeCursor(c, arg2, "greaterThan", var, arbNum);
}

// apply the "DATATYPE" restriction
private void applyDataType(
      String arg1,
      String arg2,
      String var,
      int arbNum) {
      Cursor c = this.applyGen(arg1, arg2, var, arbNum);
      String result = new String("");
      result = this.serializeCursor(c, arg2, "dataType", var, arbNum);
}

// apply the "STRINGVALUE" restriction
private void applyStringValue(
      String arg1,
      String arg2,
      String var,
      int arbNum) {
      Cursor c = this.applyGen(arg1, arg2, var, arbNum);
      String result = new String("");
                                                    37
     result = this.serializeCursor(c, arg2, "stringValue", var, arbNum);
}

// apply the "NODETYPE" restriction
private void applyNodeType(
      String arg1,
      String arg2,
      String var,
      int arbNum) {
      // get the bindings corresponding to the variable "var", in the arbNum-th tree
      int num = this.findBindRes(arbNum, var);
      int length = this.bindRes[arbNum][num].getResQ().length;

     if (arg2.compareTo("qel:Literal") != 0
           && arg2.compareTo("qel:Resource") != 0
           && arg2.compareTo("qel:AnonymousResource") != 0
           && arg2.compareTo("qel:NonAnonymousResource") != 0) {

           for (int i = 0; i < length; i++)
                 this.bindRes[arbNum][num].setResQ(null, i);

           return;
     }

     Cursor c = this.applyGen(arg1, arg2, var, arbNum);

     String result = new String("");
     result = this.serializeCursor(c, arg2, "nodeType", var, arbNum);
}

// apply the "LANGUAGE" restriction
private void applyLanguage(
      String arg1,
      String arg2,
      String var,
      int arbNum) {
      Cursor c = this.applyGen(arg1, arg2, var, arbNum);
      if (c == null)
                                                    38
           return;
     String result = new String("");
     result = this.serializeCursor(c, arg2, "language", var, arbNum);
}

private Cursor applyGen(String arg1, String arg2, String var, int arbNum) {
      // get the bindings corresponding to the variable "var", in the arbNum-th tree
      int num = this.findBindRes(arbNum, var);

     if (num < 0)
           return null;

     int length = this.bindRes[arbNum][num].getResQ().length;
     Resource[] res = new Resource[length];
     res = this.bindRes[arbNum][num].getResQ();

     Cursor c = new Cursor(this.tap.GetTarget());
     Vector nodes = new Vector();

     for (int i = 0; i < res.length; i++) {
           nodes.addElement(res[i].value);
     }

     c.AddResults(nodes);
     return c;
}

// get the nodes corresponding to the obtained resources
// we use a cursor to extract the properties for each node from the KB
private String serializeCursor(
      Cursor c,
      String arg,
      String type,
      String var,
      int arbNum) {
      String url;
      KB kb = c.GetKnowledgeBase();
      url = kb.GetURL();
                                                    39
     String result = new String("");

     if (url == null) {
           System.out.println("Connection to database failed");
           System.exit(0);
     }

     for (String n = c.Next(); n != null; n = c.Next())
           result += this.serializeOne(kb, n, arg, type, var, arbNum);

     c.Release();
     return result;
}

// extract the properties for one node from the KB
private String serializeOne(
      KB kb,
      String n,
      String arg,
      String type,
      String var,
      int arbNum) {
      String usource = n;
      XML_Branch top;
      Description desc = new Description(null);

     if (Abbrev.NeedsNamespace(n))
           usource = kb.AddNamespace(n);

     desc.AddArcTarget("oid", usource);

     Cursor ac = kb.GetArcs(n);
     top = new XML_Branch("RDF");

     for (String arc = ac.Next(); arc != null; arc = ac.Next()) {

           Cursor tc = kb.GetTargets(n, arc);
           String uarc = arc;
                                                     40
     if (Abbrev.NeedsNamespace(arc)) {
           uarc = kb.AddNamespace(arc);
     }

     for (String target = tc.Next();
           target != null;
           target = tc.Next()) {
           if (!arc.equals("oid")) {
                 String utarget = target;

                 if (!kb.IsLexiconArc(arc)
                       && (kb.HasNode(target)
                             || arc.equals("type")
                             || arc.equals("subClassOf"))) {
                       if (Abbrev.NeedsNamespace(target))
                             utarget = kb.AddNamespace(target);
                 }
                 desc.AddArcTarget(uarc, utarget);
           }
     }
     tc.Release();
}
ac.Release();

XML_Branch child = Description_XML.RDF_ToBranch(desc);
top.AddChild(child);
String str = child.ToString(2, false, new StringBuffer()).toString();

// according to the restriction type, we search something different
// in the returned string
String restr = new String();
// LANGUAGE
if (type.compareTo("language") == 0)
      restr = new String("xml:lang=\"" + arg + "\"");
// NODETYPE
else if (type.compareTo("nodeType") == 0)
      restr = new String("<rdf:type>" + arg);
// STRINGVALUE, DATATYPE, LIKE
                                              41
else if (
      type.compareTo("stringValue") == 0
            || type.compareTo("dataType") == 0
            || type.compareTo("like") == 0)
      restr = new String(arg);
// EQUALS
else if (type.compareTo("equals") == 0) {
      // if the second argument is a literal,
      // it means that we have to match the string values
      int start = arg.indexOf("\"");
      int stop = arg.lastIndexOf("\"");
      if (start != -1)
            restr =
                  new String(
                        "\">"
                              + arg.substring(start + 1, stop)
                              + "</rdfs:label>");
      // otherwise the 2 RDF nodes must be same resource
      else {
            System.out.println(
                  "\n\nGetData does not support this kind of"
                        + " \"qel:equals(X, Y)\" predicate. Y must be a literal");
            System.exit(0);
      }
}
// LESSTHAN, GREATERTHEN
else if (
      type.compareTo("lessThan") == 0
            || type.compareTo("greaterThen") == 0) {

     // we have to extract the numerical value of "<rdf:value>"
     int start = str.indexOf("<rdf:value>");
     int stop = str.indexOf("</rdf:value>");
     String match1 = new String(str.substring(start + 1, stop));

     start = arg.indexOf("\"");
     stop = arg.lastIndexOf("\"");
     String match2 = new String(arg.substring(start + 1, stop));
                                              42
           Integer val1 = new Integer(match1);
           Integer val2 = new Integer(match2);

           // LESSTHAN
           if (type.compareTo("lessThan") == 0) {
                 if (val1.compareTo(val2) < 0)
                       return str;
                 return "";
           }
           // GREATERTHAN
           else {
                 if (val1.compareTo(val2) > 0)
                       return str;
                 return "";
           }
     }

     if (str.indexOf(restr) != -1) {
           this.modifyBindings(str, var, arbNum);
           return str;
     } else
           return "";
}

// modify the bindings corresponding to a variable,
// according to the restrictictions
private void modifyBindings(String result, String var, int arbNum) {
      // get the bindings corresponding to the variable "var",
      // in the arbNum-th tree
      int num = this.findBindRes(arbNum, var);
      int length = this.bindRes[arbNum][num].getResQ().length;
      Resource[] res = new Resource[length];
      res = this.bindRes[arbNum][num].getResQ();

     for (int i = 0; i < length; i++) {
           String val = res[i].value;


                                                    43
           if (result.indexOf(val) != -1)
                 this.bindRes[arbNum][num].setValid(i);
     }
}

// traverse a tree to get the nodes in a certain order
private void getOrderNodes(int k, NTree arb) {
      int nr = arb.getNumChildren();
      int i;

     for (i = 0; i < nr; i++)
           if (arb.getChildren()[i].getVisited() == false)
                 break;

     if (i == arb.getNumChildren() || arb.getNumChildren() == 0)
           arb.setVisited(true);

     for (i = 0; i < nr; i++)
           getOrderNodes(k, arb.getChildren()[i]);

     // we fill the name of the nodes in the k-th arraylist
     this.listNodes[k].add(arb.getName());
}

// traverse a tree starting from root to the children
private void getDirectOrderNodes(int k, NTree arb) {
      int nr = arb.getNumChildren();
      int i;

     if (arb.getNumChildren() > 0)
           this.listNodes[k].add(arb.getName());

     for (i = 0; i < nr; i++)
           getDirectOrderNodes(k, arb.getChildren()[i]);
}

// test if the hight of a tree is greater than 1
boolean testHeight(NTree arb) {
                                                     44
     int i = 1;
     int nr = arb.getNumChildren();

     for (int j = 0; j < nr; j++)
           if (arb.getChildren()[j].getNumChildren() >= 1)
                 i++;
     if (i != 1)
           return true;
     return false;
}

private int numberVariables() {
      int nr = this.queryFormat.getNumberOfTrees();
      if (nr == 0)
            nr = 1;
      int dimension = 0;

     NTree[] qelTrees = this.queryFormat.getTrees();
     for (int i = 0; i < nr; i++)
           if (dimension < qelTrees[i].getNumVariables(qelTrees[i]))
                 dimension = qelTrees[i].getNumVariables(qelTrees[i]);

     return dimension;
}

// find the number of the bind-result for a variable with the name "vName"
private int findBindRes(int k, String vName) {
      int dimension = this.numberVariables();

     int num = this.queryFormat.getNumberOfTrees() * 5;

     for (int i = 0; i < 5; i++) {
           //for (int i=0; i<dimension; i++) {
           if (this.bindRes[k][i].varName == null)
                 continue;
           if (this.bindRes[k][i].varName.compareTo(vName) == 0)
                 return i;
     }
                                                      45
     // it wasn't found
     return -1;
}

// return the number of the next free PartialResult
private int getNextBindRes(int k) {
      int dimension = this.numberVariables();

     for (int i = 0; i < 5; i++)
           if (this.bindRes[k][i].varName == null)
                 return i;

     // the vector of bindings is already full
     return -2;
}

// create the bindings for the specified variable
private void makeBindings(Resource answ, String var, int arbNum) {
      // if there is a binding for that variable we return
      // the number in the vector of bindings
      int index = findBindRes(arbNum, var);
      int length = 0;
      PartialResult temp = new PartialResult();
      int i;
      int dimension = this.numberVariables();

     // some bindings for this variable were found
     if (index != -1) {
           // we have to extend the vector of bindings
           length = this.bindRes[arbNum][index].getResQ().length;

           Resource[] fillRes = new Resource[length];
           fillRes = this.bindRes[arbNum][index].getResQ();

           temp = new PartialResult(var, length);

           for (i = 0; i < length; i++)
                                                      46
                 temp.setResQ(fillRes[i], i);
     }

     // there is no binding for this variable, so we find a free place in the vector
     else if (index == -1)
           index = getNextBindRes(arbNum);

     if (index != -2) {
           Resource[] fillRes = new Resource[1];

           if (length != 0) {
                 fillRes = new Resource[length];
                 fillRes = temp.getResQ();
           }

           // find out the number of resuts in the resource "answ", that must be bound
           int size = answ.count();

           this.bindRes[arbNum][index] = new PartialResult(var, size + length);

           if (length != 0)
                 for (i = 0; i < length; i++)
                       this.bindRes[arbNum][index].setResQ(fillRes[i], i);

           // append to the bindings the current obtained answers
           for (i = 0; i < size + length; i++) {
                 this.bindRes[arbNum][index].setResQ(answ.item(i), i + length);
           }
     }
}

// intersect the bindings
// for those nodes which corresopond to variables and which have more than one child
// (more branches emerge from them)
private void intersection(int j) {
      int dimension = this.numberVariables();

     for (int i = 0; i < 5; i++) {
                                                    47
String varName = this.bindRes[j][i].varName;
if (varName == null)
      break;

// find the number of children of the current node in the original tree corresponding to the query
NTree node = new NTree();
node =
      this.queryFormat.getTrees()[j].findNode(
            this.queryFormat.getTrees()[j],
            varName);
int numChildren = node.getNumChildren();

int contor = 0;

int length = this.bindRes[j][i].getResQ().length;
Resource resJ = new Resource();

Resource[] temp = new Resource[length];
for (int k = 0; k < length; k++)
      temp[k] = new Resource();

PartialResult iterRes = this.bindRes[j][i];
for (int m = 0; m < length; m++) {
      resJ = iterRes.getResQ()[m];
      if (resJ.value == null)
            continue;

     int k = m + 1;

     int numAppearance = 1;

     // check to see if the resource appers more than once in the vector of bindings
     // the number of appearances must be equal to the number of children of the node
     for (k = m + 1; k < length; k++) {
           if (resJ
                 .value
                 .compareTo(this.bindRes[j][i].getResQ()[k].value)
                 == 0) {
                                          48
                             numAppearance++;
                             if (numAppearance == numChildren)
                                   break;
                       }
                 }

                 if (k < length || numChildren == 1 || this.isDirect[j] == true)
                       if (this.containsRes(temp, resJ) == false)
                             temp[contor++] = resJ;
           }

           nameResults.add(varName);

           this.bindRes[j][i] = new PartialResult(varName, contor);
           for (int m = 0; m < contor; m++) {
                 this.bindRes[j][i].setResQ(temp[m], m);
           }

           temp = this.bindRes[j][i].getResQ();
           finalResults.add(temp);
     }
}

// check if a resource is already contained in a vector of partial results
private boolean containsRes(Resource[] pr, Resource res) {
      for (int i = 0; i < pr.length; i++) {
            if (pr[i].value == null)
                  return false;
            if (pr[i].value.compareTo(res.value) == 0)
                  return true;
      }

     return false;
}

private boolean containsRes(Vector v, String x) {
      for (int i = 0; i < v.size(); i++)
            if (((String) v.get(i)).compareTo(x) == 0)
                                                    49
                 return true;

     return false;
}

private boolean containsRes(Vector v, Resource x) {
      for (int i = 0; i < v.size(); i++)
            if (((Resource) v.get(i)).value.compareTo(x.value) == 0)
                  return true;

     return false;
}

// make the union of results corresponding to a variable which appears in more trees
private void union() {
      // we will operate on the vectors "nameResults" and "finalResults"
      Vector names = new Vector();
      Vector results = new Vector();
      for (int i = 0; i < this.nameResults.size(); i++) {
            String val = (String) this.nameResults.get(i);
            Vector temp = new Vector();
            boolean flag = false;

           if (this.containsRes(names, val) == false) {
                 flag = true;
                 names.add(val);

                 Resource[] t = (Resource[]) this.finalResults.get(i);

                 for (int j = 0; j < t.length; j++) temp.add(t[j]);

                 // if there are some other results for the same variable,
                 // we merge them
                 for (int j = i + 1; j < this.nameResults.size(); j++) {
                       if (val.compareTo((String) this.nameResults.get(j)) == 0) {
                             Resource[] r = (Resource[]) this.finalResults.get(j);

                             for (int k = 0; k < r.length; k++)
                                                    50
                                  if (this.containsRes(temp, r[k]) == false)
                                        temp.add(r[k]);
                       }
                 }
           }

           if (flag) {
                 Resource[] addVec = new Resource[temp.size()];
                 for (int j = 0; j < temp.size(); j++) {
                       addVec[j] = new Resource();
                       addVec[j] = (Resource) temp.get(j);
                 }
                 results.add(addVec);
           }
     }

     // modify the attributes "finalResults" and "namesResults"
     this.nameResults = new Vector();
     this.finalResults = new Vector();

     for (int i = 0; i < names.size(); i++) {
           this.nameResults.add(names.get(i));

           Resource[] t = new Resource[names.size()];
           t = (Resource[]) results.get(i);

           this.finalResults.add(t);
     }

     for (int i = 0; i < this.nameResults.size(); i++) {
           Resource[] t = (Resource[]) this.finalResults.get(i);
     }
}

class Bindings {
      String varName;
      Resource[] validRes;


                                                   51
        public Bindings() {
              varName = null;
        }

        public Bindings(int nr) {
              this.validRes = new Resource[nr];
              for (int i = 0; i < nr; i++) this.validRes[i] = new Resource();
        }

        public void setValidRes(Resource[] res) {
              this.validRes = res;
        }

        public void setVarName(String name) {
              varName = name;
        }
    }
}




                                                      52
CHAPTER 9. ANEXA B.

SOME CODE FROM THE PERSONALIZED SEMANTIC WEB SEARCH.



/*
 * ELearningTap.java
 *
 * Created on Jun 30, 2004
 *
 * To change the template for this generated file go to
 * Window&gt;Preferences&gt;Java&gt;Code Generation&gt;Code and Comments
 */
package test;

/**
 * @author Paiu Raluca
 *
 * To change the template for this generated type comment go to
 * Window&gt;Preferences&gt;Java&gt;Code Generation&gt;Code and Comments
 */

import java.io.File;
import java.io.FileNotFoundException;

import java.util.StringTokenizer;
import java.util.Vector;

import com.hp.hpl.jena.rdf.model.*;

import edu.stanford.TAP.Resource;

public class ELearningTap extends Object {
      static final String fileName = "http://l3s.de/~paiu/rdf/result.rdf";

                                                          53
final String[] stopWords =
      {
            "a", "ii", "about", "above", "according", "across", "39", "actually", "ad", "adj", "ae", "af",
            "after", "afterwards", "ag", "again", "against", "ai", "al", "all", "almost", "alone", "along",
            "already", "also", "although", "always", "am", "among", "amongst", "an", "and", "another", "any",
            "anyhow", "anyone", "anything", "anywhere", "ao", "aq", "ar", "are", "aren", "aren't", "around",
            "arpa", "as", "at", "au", "aw", "az", "b", "ba", "bb", "bd", "be", "became", "because", "become",
            "becomes", "becoming", "been", "before", "beforehand", "begin", "beginning", "behind", "being",
            "below", "beside", "besides", "between", "beyond", "bf", "bg", "bh", "bi", "billion", "bj", "bm",
            "bn", "bo", "both", "br", "bs", "bt", "but", "buy", "bv", "bw", "by", "bz", "c", "ca", "can",
            "can't", "cannot", "caption", "cc", "cd", "cf", "cg", "ch", "ci", "ck", "cl", "click", "cm", "cn",
            "co", "co.", "com", "copy", "could", "couldn", "couldn't", "cr", "cs", "cu", "cv", "cx", "cy", "cz",
            "d", "de", "did", "didn", "didn't", "dj", "dk", "dm", "do", "does", "doesn", "doesn't", "don",
            "don't", "down", "during", "dz", "e", "each", "ec", "edu", "ee", "g", "eh", "eight", "eighty",
            "either", "else", "elsewhere", "end", "ending", "enough", "er", "es", "et", "etc", "even", "ever",
            "every", "everyone", "everything", "everywhere", "except", "f", "few", "fi", "fifty", "find",
            "first", "five", "fj", "fk", "fm", "fo", "for", "former", "formerly", "forty", "found", "four",
            "fr", "free", "from", "further", "fx", "g", "ga", "gb", "gd", "ge", "get", "gf", "gg", "gh", "gi",
            "gl", "gm", "gmt", "gn", "go", "gov", "gp", "gq", "gr", "gs", "gt", "gu", "gw", "gy", "h", "had",
            "has", "hasn", "hasn't", "have", "haven", "haven't", "he", "he'd", "he'll", "he's", "help", "hence",
            "her", "here", "here's", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him",
            "himself", "his", "hk", "hm", "hn", "home", "homepage", "how", "however", "hr", "ht", "htm", "html",
            "http", "hu", "hundred", "i", "i'd", "i'll", "i'm", "i've", "i.e.", "id", "ie", "if", "il", "im",
            "in", "inc", "inc.", "indeed", "information", "instead", "int", "into", "io", "iq", "ir", "is",
            "isn", "isn't", "it", "it's", "its", "itself", "j", "je", "jm", "jo", "join", "jp", "k", "ke", "kg",
            "kh", "ki", "km", "kn", "kp", "kr", "kw", "ky", "kz", "l", "la", "last", "later", "latter", "lb",
            "lc", "least", "less", "let", "let's", "li", "like", "likely", "lk", "ll", "lr", "ls", "lt", "ltd",
            "lu", "lv", "ly", "m", "ma", "made", "make", "makes", "many", "maybe", "mc", "md", "me", "meantime",
            "meanwhile", "mg", "mh", "microsoft", "might", "mil", "million", "miss", "mk", "ml", "mm", "mn",
            "mo", "more", "moreover", "most", "mostly", "mp", "mq", "mr", "mrs", "ms", "msie", "mt", "mu",
            "much", "must", "mv", "mw", "mx", "my", "myself", "mz", "n", "na", "namely", "nc", "ne", "neither",
            "net", "netscape", "never", "nevertheless", "new", "next", "nf", "ng", "ni", "nine", "ninety", "nl",
            "no", "nobody", "none", "nonetheless", "noone", "nor", "not", "nothing", "now", "nowhere", "np",
            "nr", "nu", "nz", "o", "of", "off", "often", "om", "on", "once", "one", "one's", "only", "onto",
            "or", "org", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "overall",
            "own", "p", "pa", "page", "pe", "per", "perhaps", "pf", "pg", "ph", "pk", "pl", "pm", "pn", "pr",
            "pt", "pw", "py", "q", "qa", "r", "rather", "re", "recent", "recently", "reserved", "ring", "ro",
            "ru", "rw", "s", "sa", "same", "sb", "sc", "sd", "se", "seem", "seemed", "seeming", "seems",
                                                    54
"seven",
                   "seventy", "several", "sg", "sh", "she", "she'd", "she'll", "she's", "should", "shouldn",
shouldn't",
                   "si", "since", "site", "six", "sixty", "sj", "sk", "sl", "sm", "sn", "so", "some", "somehow",
                   "someone", "something", "sometime", "sometimes", "somewhere", "sr", "st", "still", "stop", "su",
                   "such", "sv", "sy", "sz", "t", "taking", "tc", "td", "ten", "text", "tf", "tg", "test", "th",
                   "than", "that", "that'll", "that's", "the", "their", "them", "themselves", "then", "thence",
"there",
                   "there'll", "there's", "thereafter", "thereby", "therefore", "therein", "thereupon", "these",
                   "they", "they'd", "they'll", "they're", "they've", "thirty", "this", "those", "though", "thousand",
                   "three", "through", "throughout", "thru", "thus", "tj", "tk", "tm", "tn", "to", "together", "too",
                   "toward", "towards", "tp", "tr", "trillion", "tt", "tv", "tw", "twenty", "two", "tz", "u", "ua",
                   "ug", "uk", "um", "under", "unless", "unlike", "unlikely", "until", "up", "upon", "us", "use",
                   "used", "using", "uy", "uz", "v", "va", "vc", "ve", "very", "vg", "vi", "via", "vn", "vu", "w",
                   "was", "wasn", "wasn't", "we", "we'd", "we'll", "we're", "we've", "web", "webpage", "website",
                   "welcome", "well", "were", "weren", "weren't", "wf", "what", "what'll", "what's", "whatever",
"when",
                   "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon",
                   "wherever", "whether", "which", "while", "whither", "who", "who'd", "who'll", "who's", "whoever",
                   "NULL", "whole", "whom", "whomever", "whose", "why", "will", "with", "within", "without", "won",
                   "won't", "would", "wouldn", "wouldn't", "ws", "www", "x", "y", "ye", "yes", "yet", "you", "you'd",
                   "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves", "yt", "yu", "z", "za",
"zm",
                   "zr", "10", "z", "-", "introduction" };

        // the name of the input file
        private String inputFile;
        // the name of the output file - for   keywords
        private String keywordsOutputFile;
        // the name of the output file - for   title
        private String titleOutputFile;
        // tha name of the output file - for   title + SW terms
        private String titleSWOutputFile;
        // the name of the output file - for   title + expressions (more that 1 term)
        private String titleExprOutputFile;
        // a model
        private Model model;
        // a statement iterator
                                                              55
private StmtIterator iter;

// vector of scores of the results
private Vector scores;

private UtilTap uTap;

/*
 * constructors
 */
public ELearningTap() {
      inputFile = new String("in.xml");
      keywordsOutputFile = new String("keywords_out.rdf");
      titleOutputFile = new String("title_out.rdf");
      titleSWOutputFile = new String("titleSW_out.rdf");
      titleExprOutputFile = new String("titleEXPR_out.rdf");
      scores = new Vector();

     uTap = new UtilTap();

     // initialize the attributes of the object of uTap with KB from TAP
     uTap.initResources();
}

public ELearningTap(String in, String out) {
      inputFile = new String(in);
      keywordsOutputFile = new String("keywords_" + out);
      titleOutputFile = new String("title_" + out);
      titleSWOutputFile = new String("titleSW_" + out);
      titleExprOutputFile = new String("titleEXPR_" + out);
      scores = new Vector();

     uTap = new UtilTap();

     // initialize the attributes of the object of uTap with KB from TAP
     uTap.initResources();
}
/* ********************************************************************** */
                                                    56
/*
 * Get the keywords and test them with KB from TAP
 */
private File testKeywords() {
      File out = new File(keywordsOutputFile);
      Vector results = new Vector();
      Vector substrKey = new Vector();

     // create an empty model
     this.createModel();

     // iterate over the statements in the input file
     while (iter.hasNext()) {
           // get next statement
           Statement stmt = iter.nextStatement();
           // get the subject
           com.hp.hpl.jena.rdf.model.Resource subject = stmt.getSubject();
           // get the predicate
           Property predicate = stmt.getPredicate();
           // get the object
           RDFNode object = stmt.getObject();

           // test if the predicate of the statement is "ontopub:keywords"
           if (predicate.getLocalName().compareTo("keywords") != 0)
                 continue;

           String keywords = object.toString();
           // vector to store the keywords for the current lecture
           Vector keyw = new Vector();
           StringTokenizer tokenizer = new StringTokenizer(keywords, ",");
           // put the keywords in the vector of keywords
           while (tokenizer.hasMoreTokens()) {
                 String currentKeyW = tokenizer.nextToken(",");
                 keyw.add(currentKeyW);
           }

           for (int i = 0; i < keyw.size(); i++) {
                                                     57
                   Vector partialSubstr =
                         this.getAllSubstrings(keyw.get(i).toString());

                   // add these substrings to the global vector of substrings
                   for (int j = 0; j < partialSubstr.size(); j++)
                         if (!substrKey.contains(partialSubstr.get(j)))
                               substrKey.add(partialSubstr.get(j));
             }

     }

     // sort the vector of substrings decreasing after the number of tokens
     substrKey = this.sortSubstrings(substrKey);

     // search in the TAP KB
     results = uTap.search(substrKey);

     System.out.println("Number of results for keywords: " + results.size());
     for (int i = 0; i < results.size(); i++)
           System.out.println(
                 ((Resource) results.get(i)).value
                       + "-->"
                       + uTap.getScores().get(i));

     try {
           uTap.createFile(out, results);
     } catch (FileNotFoundException e) {
           e.printStackTrace();
     }

     return out;
}

/*
 * Get the KEYWORDS for each lecture
 * Transform them so that they only contain the root of the word
 * Test the resulted words with TAP KB
 */
                                                     58
private File testKeywordsStemmer() {
      File out = new File(keywordsOutputFile);
      Vector results = new Vector();
      Vector keyw = new Vector();
      Stemmer s = new Stemmer();

     // create an empty model
     this.createModel();

     // iterate over the statements in the input file
     while (iter.hasNext()) {
           // get next statement
           Statement stmt = iter.nextStatement();
           // get the subject
           com.hp.hpl.jena.rdf.model.Resource subject = stmt.getSubject();
           // get the predicate
           Property predicate = stmt.getPredicate();
           // get the object
           RDFNode object = stmt.getObject();

           // test if the predicate of the statement is "ontopub:keywords"
           if (predicate.getLocalName().compareTo("keywords") != 0)
                 continue;

           String keywords = object.toString();

           StringTokenizer tokenizer = new StringTokenizer(keywords, ",");
           // put the keywords in the vector of keywords
           while (tokenizer.hasMoreTokens()) {
                 String currentKeyW = tokenizer.nextToken(",");
                 StringTokenizer token = new StringTokenizer(currentKeyW, " ");

                 while (token.hasMoreTokens()) {
                       String tok = token.nextToken(" ");
                       // delete special characters from the token
                       tok = this.deleteChars(tok);

                       // if this is not a stop word, stem it
                                                   59
                       // and add it to the vector
                       if (!isStopWord(tok)) {
                             s = s.initStemmer(tok);
                             s.stem();
                             keyw.add(s.toString());
                       }
                   }
             }
     }

     // search in the TAP KB
     results = uTap.search(keyw);

     System.out.println("Number of results for keywords: " + results.size());
     for (int i = 0; i < results.size(); i++)
           System.out.println(
                 ((Resource) results.get(i)).value
                       + "-->"
                       + uTap.getScores().get(i));

     try {
           uTap.createFile(out, results);
     } catch (FileNotFoundException e) {
           e.printStackTrace();
     }

     return out;
}

/*
 * Sort the vector given as argument decreasingly after the number of
 * tokens of the substrings
 */
private Vector sortSubstrings(Vector input) {
      Vector output = new Vector();
      String[] substrings = new String[input.size()];
      int[] poz = new int[input.size()];


                                                       60
     for (int i = 0; i < input.size(); i++) {
           substrings[i] = input.get(i).toString();
           // find the number of tokens of the current element in the vector
           StringTokenizer tokenizer = new StringTokenizer(substrings[i], " ");
           poz[i] = tokenizer.countTokens();
     }

     this.quickSort(poz, substrings, 0, poz.length - 1);
     for (int i = 0; i < substrings.length; i++)
           output.add(substrings[i]);

     return output;
}

/*
 * Quick Sort the vector of substrings
 */
private void quickSort(int[] poz, String[] str, int lo0, int hi0) {
      int lo = lo0;
      int hi = hi0;

     if (lo >= hi)
           return;
     else if (lo == hi -   1) {
           // sort a two   element vector by swapping if necessary
           if (poz[lo] <   poz[hi]) {
                 int T =   poz[lo];
                 poz[lo]   = poz[hi];
                 poz[hi]   = T;

                 String temp = str[lo];
                 str[lo] = str[hi];
                 str[hi] = temp;
           }
           return;
     }

     // pick a pivot and move it out of the way
                                                     61
int pivot = poz[(lo + hi) / 2];
poz[(lo + hi) / 2] = poz[hi];
poz[hi] = pivot;
String strPivot = str[(lo + hi) / 2];
str[(lo + hi) / 2] = str[hi];
str[hi] = strPivot;

while (lo < hi) {
      // search forward from poz[lo] until an element is found that
      // is less than the pivot or lo >= hi
      while (poz[lo] >= pivot && lo < hi)
            lo++;

     // search backward from poz[hi] until an element is found that
     // is greater than the pivot, or lo >= hi
     while (pivot >= poz[hi] && lo < hi)
           hi--;

     // swap elements poz[lo] and poz[hi] and the corresponding
     // elements from the vector of substrings
     if (lo < hi) {
           int T = poz[lo];
           poz[lo] = poz[hi];
           poz[hi] = T;

           String temp = str[lo];
           str[lo] = str[hi];
           str[hi] = temp;
     }
}

// put the median in the "center" of the list
poz[hi0] = poz[hi];
poz[hi] = pivot;

str[hi0] = str[hi];
str[hi] = strPivot;


                                                62
     // recursive calls;
     this.quickSort(poz, str, lo0, lo - 1);
     this.quickSort(poz, str, hi + 1, hi0);
}

/*
 * Get the title, reduce it to a list of semantic web terms and
 * test this computed list with the KB of TAP
 */
private File testTitle() {
      File out = new File(titleSWOutputFile);
      Vector results = new Vector();
      Vector words = new Vector();
      Glossar g = new Glossar();

     // create an empty model
     this.createModel();

     // create an empty model for the Glossary
     g.createModel();
     g.initializeModel();

     // iterate over the statements in the input file
     while (iter.hasNext()) {
           Statement stmt = iter.nextStatement(); // get next statement
           com.hp.hpl.jena.rdf.model.Resource subject = stmt.getSubject();
           // get the subject
           Property predicate = stmt.getPredicate(); // get the predicate
           RDFNode object = stmt.getObject(); // get the object

           // test if the predicate of the statement is "dc:title"
           if (predicate.getLocalName().compareTo("title") != 0)
                 continue;

           // get the title of each lecture
           String title = object.toString();

           Vector temp = g.createListOfTerms(title, true);
                                                    63
             for (int i = 0; i < temp.size(); i++)
                   if (!words.contains(temp.get(i)))
                         words.add(temp.get(i));
     }

     // sort the substrings - the first substring has the most tokens
     words = this.sortSubstrings(words);

     // search in TAP KB
     results = uTap.search(words);

     System.out.println("Number of results for keywords: " + results.size());
     for (int i = 0; i < results.size(); i++)
           System.out.println(
                 ((Resource) results.get(i)).value
                       + "-->"
                       + uTap.getScores().get(i));

     try {
           uTap.createFile(out, results);
     } catch (FileNotFoundException e) {
           e.printStackTrace();
     }

     return out;
}

private File applyStemmer_SWT() {
      File out = new File(titleSWOutputFile);
      Vector results = new Vector();
      Vector words = new Vector();
      Glossar g = new Glossar();
      Stemmer s = new Stemmer();

     // create an empty model
     this.createModel();

     // create an empty model for the Glossary
                                                       64
g.createModel();
g.initializeModel();

// iterate over the statements in the input file
while (iter.hasNext()) {
      Statement stmt = iter.nextStatement(); // get next statement
      com.hp.hpl.jena.rdf.model.Resource subject = stmt.getSubject();
      // get the subject
      Property predicate = stmt.getPredicate(); // get the predicate
      RDFNode object = stmt.getObject(); // get the object

     // test if the predicate of the statement is "ontopub:keywords"
     if (predicate.getLocalName().compareTo("keywords") != 0)
           continue;

     // get the keywords of each lecture
     String keywords = object.toString();

     Vector temp = g.createListOfTerms(keywords, false);
     for (int i = 0; i < temp.size(); i++)
           if (!words.contains(temp.get(i)))
                 words.add(temp.get(i));

     // add the keywords to the list of semantic web terms
     StringTokenizer tokenizer = new StringTokenizer(keywords, ",");
     while (tokenizer.hasMoreTokens()) {
           String part = tokenizer.nextToken(",");

           if (!words.contains(part))
                 words.add(part);

           StringTokenizer token = new StringTokenizer(part, " ");

           while (token.hasMoreTokens()) {
                 String item = token.nextToken(" ");

                 // eliminate special characters from the current token
                 item = this.deleteChars(item);
                                              65
                 boolean isStop = false;
                 for (int cont = 0; cont < this.stopWords.length; cont++)
                       if (this.stopWords[cont].compareToIgnoreCase(item)
                             == 0)
                             isStop = true;

                 if (isStop)
                       continue;

                 s = s.initStemmer(item);
                 s.stem();

                 if (!words.contains(s))
                       words.add(s);
            }
        }
}

words = this.sortSubstrings(words);

for (int i = 0; i < words.size(); i++)
      System.out.println(words.get(i).toString());
System.out.println("\n\n\n");

// search in TAP KB
results = uTap.search(words);

System.out.println("Number of results for keywords: " + results.size());
for (int i = 0; i < results.size(); i++)
      System.out.println(
            ((Resource) results.get(i)).value
                  + "-->"
                  + uTap.getScores().get(i));

try {
      uTap.createFile(out, results);
} catch (FileNotFoundException e) {
                                              66
           e.printStackTrace();
     }

     return out;
}

/*
 * Get the Keywords for each lecture, starting from them create a list of
 * semantic web terms, then test this created list with KB from TAP
 */
private File testKeywords_SW(boolean plus) {
      File out = new File(titleSWOutputFile);
      Vector results = new Vector();
      Vector words = new Vector();
      Glossar g = new Glossar();

     // create an empty model
     this.createModel();

     // create an empty model for the Glossary
     g.createModel();
     g.initializeModel();

     // iterate over the statements in the input file
     while (iter.hasNext()) {
           Statement stmt = iter.nextStatement(); // get next statement
           com.hp.hpl.jena.rdf.model.Resource subject = stmt.getSubject();
           // get the subject
           Property predicate = stmt.getPredicate(); // get the predicate
           RDFNode object = stmt.getObject(); // get the object

           // test if the predicate of the statement is "ontopub:keywords"
           if (predicate.getLocalName().compareTo("keywords") != 0)
                 continue;

           // get the keywords of each lecture
           String keywords = object.toString();


                                                    67
     Vector temp = g.createListOfTerms(keywords, false);
     for (int i = 0; i < temp.size(); i++)
           if (!words.contains(temp.get(i)))
                 words.add(temp.get(i));

     // add the keywords to the list of semantic web terms
     if (plus) {
           StringTokenizer tokenizer = new StringTokenizer(keywords, ",");
           while (tokenizer.hasMoreTokens()) {
                 String part = tokenizer.nextToken(",");
                 StringTokenizer token = new StringTokenizer(part, " ");

                 while (token.hasMoreTokens()) {
                       String item = token.nextToken(" ");

                       // eliminate special characters from the current token
                       item = this.deleteChars(item);

                       boolean isStop = false;
                       for (int cont = 0;
                             cont < this.stopWords.length;
                             cont++)
                             if (this.stopWords[cont].compareToIgnoreCase(item)
                                   == 0)
                                   isStop = true;

                       if (isStop)
                             continue;
                       if (!words.contains(item))
                             words.add(item);
                 }
           }
     }
     //System.out.println ("keywords = " + keywords);
}

//System.out.println ("\n\n\n");


                                              68
     // sort the substrings - the first substring has the most tokens
     words = this.sortSubstrings(words);

     /*
     for (int i = 0; i < words.size(); i++)
           System.out.println (words.get(i).toString());
     System.out.println ("\n\n\n");
     */

     // search in TAP KB
     results = uTap.search(words);

     System.out.println("Number of results for keywords: " + results.size());
     for (int i = 0; i < results.size(); i++)
           System.out.println(
                 ((Resource) results.get(i)).value
                       + "-->"
                       + uTap.getScores().get(i));

     try {
           uTap.createFile(out, results);
     } catch (FileNotFoundException e) {
           e.printStackTrace();
     }

     return out;
}

/*
 * Use each single word, that is not a stop word, from the title to
 * get information from the TAP KB
 */
private File testWordsFromTitle() {
      File out = new File(titleOutputFile);
      Vector results = new Vector();
      Vector words = new Vector();

     // create an empty model
                                                    69
this.createModel();

// iterate over the statements in the input file
while (iter.hasNext()) {
      Statement stmt = iter.nextStatement(); // get next statement
      com.hp.hpl.jena.rdf.model.Resource subject = stmt.getSubject();
      // get the subject
      Property predicate = stmt.getPredicate(); // get the predicate
      RDFNode object = stmt.getObject(); // get the object

        // test if the predicate of the statement is "dc:title"
        if (predicate.getLocalName().compareTo("title") != 0)
              continue;

        // get the title of each lecture
        String title = object.toString();

        // create a vector of tokens from the title
        Vector temp = this.createInputVector(title, ",", " ");
        for (int i = 0; i < temp.size(); i++)
              if (!words.contains(temp.get(i)))
                    words.add(temp.get(i));
}

// search in TAP KB
results = uTap.search(words);

System.out.println(
      "Number of results - each word from the title: " + results.size());
for (int i = 0; i < results.size(); i++)
      System.out.println(
            ((Resource) results.get(i)).value
                  + "-->"
                  + uTap.getScores().get(i));

try {
      uTap.createFile(out, results);
} catch (FileNotFoundException e) {
                                                70
           e.printStackTrace();
     }

     return out;
}

/*
 * Get the title for each lecture
 * Stem every words that is not a stop - word
 * Test the resulted words with KB from TAP
 */
private File testWordsFromTitleStemmer() {
      File out = new File(titleOutputFile);
      Vector results = new Vector();
      Vector words = new Vector();
      Stemmer s = new Stemmer();

     // create an empty model
     this.createModel();

     // iterate over the statements in the input file
     while (iter.hasNext()) {
           Statement stmt = iter.nextStatement(); // get next statement
           com.hp.hpl.jena.rdf.model.Resource subject = stmt.getSubject();
           // get the subject
           Property predicate = stmt.getPredicate(); // get the predicate
           RDFNode object = stmt.getObject(); // get the object

           // test if the predicate of the statement is "dc:title"
           if (predicate.getLocalName().compareTo("title") != 0)
                 continue;

           // get the title of each lecture
           String title = object.toString();

           // create a vector of tokens from the title
           Vector temp = this.createInputVector(title, ",", " ");
           for (int i = 0; i < temp.size(); i++)
                                                   71
                   if (!words.contains(temp.get(i))) {
                         s = s.initStemmer(temp.get(i).toString());
                         s.stem();
                         words.add(s.toString());
                   }
     }

     // search in TAP KB
     results = uTap.search(words);

     System.out.println(
           "Number of results - each word from the title: " + results.size());
     for (int i = 0; i < results.size(); i++)
           System.out.println(
                 ((Resource) results.get(i)).value
                       + "-->"
                       + uTap.getScores().get(i));

     try {
           uTap.createFile(out, results);
     } catch (FileNotFoundException e) {
           e.printStackTrace();
     }

     return out;
}

private Vector createInputVector(
      String str,
      String delim1,
      String delim2) {
      Vector vec = new Vector();
      StringTokenizer tokenizer = new StringTokenizer(str, delim1);
      while (tokenizer.hasMoreTokens()) {
            String token = tokenizer.nextToken(delim1);

             StringTokenizer tokenizerIntern =
                   new StringTokenizer(token, delim2);
                                                     72
           while (tokenizerIntern.hasMoreTokens()) {
                 String item = tokenizerIntern.nextToken(delim2);
                 item = this.deleteChars(item);
                 item = this.getSingForm(item);
                 if (!vec.contains(item) && this.isStopWord(item) == false)
                       vec.add(item);
           }
     }
     return vec;
}

/*
 * Use the title of each lecture to create expressions, then test these
 * expressions with KB from TAP
 */
private File testTitleExpressions() {
      File out = new File(titleExprOutputFile);
      Vector results = new Vector();
      Vector subStrings = new Vector();
      Vector words = new Vector();

     // create an empty model
     this.createModel();

     // iterate over the statements in the input file
     while (iter.hasNext()) {
           Statement stmt = iter.nextStatement(); // get next statement
           com.hp.hpl.jena.rdf.model.Resource subject = stmt.getSubject();
           // get the subject
           Property predicate = stmt.getPredicate(); // get the predicate
           RDFNode object = stmt.getObject(); // get the object

           // test if the predicate of the statement is "dc:title"
           if (predicate.getLocalName().compareTo("title") != 0)
                 continue;

           // get the title of each lecture
           String title = object.toString();
                                                    73
             // create all possible substrings from this title
             Vector substrings = this.getAllSubstrings(title);
             for (int i = 0; i < substrings.size(); i++)
                   if (!words.contains(substrings.get(i)))
                         words.add(substrings.get(i));
     }

     // sort the substrings - the first substring has the most tokens
     words = this.sortSubstrings(words);

     // search in TAP KB
     results = uTap.search(words);

     System.out.println(
           "Number of results - expressions from title: " + results.size());
     for (int i = 0; i < results.size(); i++)
           System.out.println(
                 ((Resource) results.get(i)).value
                       + "-->"
                       + uTap.getScores().get(i));

     try {
           uTap.createFile(out, results);
     } catch (FileNotFoundException e) {
           e.printStackTrace();
     }

     return out;
}

/*
 * Eliminate special characters from the string given as parameter
 * Return the resulted string
 */
private String deleteChars(String str) {
      String newStr = new String();
      for (int i = 0; i < str.length(); i++) {
                                                     74
           if (str.charAt(i) != ':'
                 && str.charAt(i) != '('
                 && str.charAt(i) != ')'
                 && str.charAt(i) != '\n'
                 && str.charAt(i) != ' ')
                 newStr += str.charAt(i);
     }

     return newStr;
}

/*
 * Return the singular form of a noun
 */
private String getSingForm(String str) {
      String newStr;
      int len = str.length();

     // for the case that we have something like: Tim's
     if (str.charAt(len - 1) == 's' && str.charAt(len - 2) == '\'')
           newStr = new String(str.substring(0, len - 2));
     // if it is a special case of plural (e.g. ontology - ontologies)
     else if (
           str.charAt(len - 1) == 's'
                 && str.charAt(len - 2) == 'e'
                 && str.charAt(len - 3) == 'i') {
           newStr = new String(str.substring(0, len - 3));
           newStr += "y";
     }
     // for the case of nouns terminated in double 's', like "class"
     else if (str.charAt(len - 1) == 's' && str.charAt(len - 2) != 's')
           newStr = new String(str.substring(0, len - 1));
     // the noun is in plural form
     else
           newStr = new String(str);

     return newStr;
}
                                                   75
/*
 * Test if the word supplied as parameter is a stop-word or not
 */
private boolean isStopWord(String word) {
      for (int i = 0; i < this.stopWords.length; i++)
            if (this.stopWords[i].compareToIgnoreCase(word) == 0)
                  return true;

     return false;
}

/*
 * Given a string, containing more tokens, separated by white-space
 * create substrings of a certain length with adjacent tokens
 */
private Vector createSubstrings(String str, int N) {
      Vector substrVec = new Vector();
      Vector strVec = new Vector();
      int nrTokens = 0;

     StringTokenizer tokenizer = new StringTokenizer(str, " ");
     while (tokenizer.hasMoreTokens()) {
           String token = tokenizer.nextToken(" ");
           token = this.deleteChars(token);
           nrTokens++;
           // add the token to the vector
           strVec.add(token);
     }

     // if the requested length of the substrings is greater than the
     // number of tokens, we will return NULL
     if (N > nrTokens)
           return null;

     for (int i = 0; i < nrTokens - N; i++) {
           String tempStr = new String();
           int nrNonStop = 0;
                                                    76
           for (int j = i; j < i + N; j++) {
                 // if the current token is not a stop-word
                 // increment the number of non-stop words
                 if (this.isStopWord(strVec.get(j).toString()))
                       nrNonStop++;
                 tempStr += strVec.get(j).toString();
                 if (j != i + N - 1)
                       tempStr += " ";
           }
           if (nrNonStop >= 2)
                 substrVec.add(tempStr);
     }

     return substrVec;
}

/*
 * The same as above, but with different arguments
 */
private Vector createSubstrings(Vector strVec, int N) {
      Vector substrVec = new Vector();

     for (int i = 0; i <= strVec.size() - N; i++) {
           String tempStr = new String();

           // the number of non stop words in the current substring
           int nrNonStop = 0;
           for (int j = i; j < i + N; j++) {
                 // if the current token is not a stop-word
                 // increment the number of non-stop words
                 if (!this.isStopWord(strVec.get(j).toString()))
                       nrNonStop++;
                 tempStr += strVec.get(j).toString();
                 if (j != i + N - 1)
                       tempStr += " ";
           }
           if (nrNonStop >= 2)
                                                      77
                 substrVec.add(tempStr);
     }

     return substrVec;
}

/*
 * Create all possible substrings of different lengths from the string
 * supplied as argument.
 * The minimum length is 2, the maximum is the number of tokens contained
 * in the string "str".
 * Substrings are formed from adjacent tokens.
 * Get the singular form of the nouns
 */
public Vector getAllSubstrings(String str) {
      Vector substrVec = new Vector();

     Vector strVec = new Vector();
     int nrTokens = 0;

     StringTokenizer tokenizer = new StringTokenizer(str, " ");
     while (tokenizer.hasMoreTokens()) {
           String token = tokenizer.nextToken(" ");
           // eliminate all special characters from the token
           token = this.deleteChars(token);
           // transform the noun in the singular form
           token = this.getSingForm(token);
           nrTokens++;
           // add the token to the vector
           strVec.add(token);
     }

     // len e [nrTokens; 2]
     String maxSubstr = new String();
     for (int i = 0; i < strVec.size(); i++) {
           maxSubstr += strVec.get(i);
           if (i != strVec.size() - 1)
                 maxSubstr += " ";
                                                    78
      }
      substrVec.add(maxSubstr);
      for (int len = nrTokens - 1; len > 1; len--) {
            Vector temp = this.createSubstrings(strVec, len);
            for (int i = 0; i < temp.size(); i++)
                  substrVec.add(temp.get(i));
      }

      return substrVec;
}

private void createModel() {
      // create an empty model
      model = ModelFactory.createDefaultModel();

      // read the RDF/XML file
      model.read(fileName);

      iter = model.listStatements();
}

/*
  * For each lecture / each object
  *          - get the KEYWORDS
  *          - test them with KB from TAP
  */
public File runApproachA() {
       File out = this.testKeywords();
       return out;
}

/*
 * For each lecture / each object
 *          - get the KEYWORDS
 *          - transform them with the aid of Stemmer Class
 *          - test them with KB from TAP
 */
public File runApproachA_Stemmer() {
                                                    79
      File out = this.testKeywordsStemmer();
      return out;
}

/*
  * For each lecture / each object
  *          - get the KEYWORDS
  *          - reduce the keywords to a list of Semantic Web Terms
  *          - test the computed list with KB from TAP
  */
public File runApproachA_SW() {
       File out = this.testKeywords_SW(true);
       return out;
}

/*
  * For each lecture / each object
  *          - get the TITLE
  *          - reduce it to a list of Semantic Web Terms
  *          - test the computed list with KB of TAP
  */
public File runApproachB() {
       File out = this.testTitle();
       return out;
}

/*
  * For each lecture / each object
  *          - get the TITLE
  *          - test each word from the title with KB from TAP
  */
public File runApproachB_Prime() {
       File out = this.testWordsFromTitle();
       return out;
}

/*
 * For each lecture / each object
                                                     80
      *         - get the TITLE
      *         - stem each word from title, that is not a stop - word
      *         - test each resulted word with KB from TAP
      */
    public File runApproachB_Prime_Stemmer() {
          File out = this.testWordsFromTitleStemmer();
          return out;
    }

    /*
      * For each lecture / each object
      *          - get the TITLE
      *          - create a list of expressions (more than 1 word per expression)
      *          - test each expression from this list with KB from TAP
      */
    public File runApproachC() {
           File out = this.testTitleExpressions();
           return out;
    }

    public File runStemmer_SWT() {
          File out = this.applyStemmer_SWT();
          return out;
    }

}




                                                         81
CHAPTER 10. ANEXA C.

THE XML FILE CONTAINING THE LECTURES (JUST ONE LECTURE).

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="content_to_rdf.xsl"?>

<vorlesung>


<vorlesungsstunde >
<id>semweb_19_04_2004</id>
<date>Vorlesung vom 19.4.2004</date>
<title> Semantic Web: Einführung und Übersicht </title>
<keywords> Semantic Web, Semantic Web Tower, W3C, Semantic Web
Architecture, Introduction to Markup-Languages </keywords>
<script>
<link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"
      xlink:href="http://www.kbs.uni-hannover.de/~henze/semweb04/skript/slides/19_04_2004.pdf"
      xlink:title="Slides (pdf)">
</link>
</script>

<readings>
<item> Overview on Semantic Web Idea:
      <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple" xlink:show="new"
            xlink:href="http://www.w3.org/DesignIssues/Semantic.html"
            xlink:title="Semantic Web Road map">
      </link>
</item>

<item> Article in Scientific America, May 2001: The Semantic Web: A
new form of Web content that is meaningful to computers will unleash a
revolution of new possibilities. By Tim Berners-Lee, James Hendler and Ora Lassila:
      <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple" xlink:show="new"
                                                          82
            xlink:href="http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21"
            xlink:title="Online Version">
      </link>
</item>

<item> SELFHTML: Kapitel "Einführung in Internet und WWW":
      <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple" xlink:show="new"
            xlink:href="http://selfhtml.teamone.de/intro/internet/index.htm"
            xlink:title="Online Version">
      </link>
</item>

<item>
      <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple" xlink:show="new"
            xlink:href="http://www.w3.org/"
            xlink:title="The World Wide Web Consortium: W3C">
      </link>
</item>

<item>
      <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple" xlink:show="new"
            xlink:href="http://www.w3.org/2001/sw/"
            xlink:title="Semantic Web Activity Group at W3C">
      </link>
</item>

</readings>
</vorlesungsstunde>


</vorlesung>




                                                          83
THE XSL FILE USED TO CONVERT THE XML FILE, CONTAINING THE LECTURE, INTO RDF.

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      xmlns:rdfs="http://www.w1.org/1000/01/rdf-schema#"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
        xmlns:ontopub="http://www.example.org/ontopub"
        xmlns:xlink="http://www.w3.org/1999/xlink">
<xsl:output method="xml" indent = "yes"/>

<xsl:template match="/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      xmlns:rdfs="http://www.w1.org/1000/01/rdf-schema#"
      xmlns:ontopub="http://www.example.org/ontopub"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:xlink="http://www.w3.org/1999/xlink">
      <xsl:apply-templates select="vorlesung"/>
</rdf:RDF>
</xsl:template>

<!-- VORLESUNG -->
<xsl:template match = "vorlesung">
      <xsl:apply-templates select="vorlesungsstunde"/>
</xsl:template>


<!-- LINK -->
<xsl:template match="link">
      <xsl:variable name="theRef" select="@xlink:href" />
      <xsl:variable name="theTitle" select="@xlink:title" />
      <rdf:li rdf:resource="{$theRef}" rdfs:label="{$theTitle}" />
</xsl:template>


                                                          84
<!-- ITEM -->
<xsl:template match = "item" priority = "1">
      <xsl:apply-templates select = "link"/>
</xsl:template>

<!-- VORLESUNGSSTUNDE -->
<xsl:template match = "vorlesungsstunde">
<xsl:variable name="theChildID" select="child::id" />

<rdf:Description rdf:ID="{$theChildID}">
      <ontopub:date>
            <xsl:value-of select="child::date"/>
      </ontopub:date>
      <dc:title>
            <xsl:value-of select="child::title"/>
      </dc:title>
      <ontopub:keywords>
            <xsl:value-of select="child::keywords"/>
      </ontopub:keywords>
      <ontopub:script>
            <rdf:Bag>
                  <xsl:for-each select="script">
                  <xsl:apply-templates/>
                  </xsl:for-each>
            </rdf:Bag>
      </ontopub:script>
      <ontopub:readings>
            <rdf:Bag>
                  <xsl:for-each select = "readings">
                  <xsl:apply-templates select="item"/>
                  </xsl:for-each>
            </rdf:Bag>
      </ontopub:readings>
</rdf:Description>
</xsl:template>

</xsl:stylesheet>


                                                         85

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:11/9/2012
language:Unknown
pages:155