Semantic Web-Service Interoperability for Geospatial Decision by pengxuebo

VIEWS: 37 PAGES: 132

									                            SWING
Semantic Web-Service Interoperability for Geospatial
               Decision Making
                               FP6-26514




                D7.7 Edited Book Manuscript
                            Draft Paper


     Date:           2008-12-18
     Editor(s):      Sven Schade (UOM)
     Distribution:   All partners
     WP:             7
     Version:        0.7
     Keywords:
     Description:
                                Document Metadata
Quality Assurors and Contributors:

             Quality Assuror(s):     Name1, Name2
             Contributor(s):         all


Version History:

             Version number          Description
             0.1                     Draft structure
             0.2                     Partner absracts added
             0.3                     Parts from Jörg Hoffmann added
             0.4                     Chapter 5: details added
             0.5                     Other chapters: details added, Ch. 5: revised
             0.6                     Chapter 9: details added
             0.7                     Chapter 1 details added and chapter 2 revised

             New once:               1st of Oct., Nov., Dec., Jan., Feb.
                                 Executive Summary
This book presents and reflects the outcomes the EU-founded academic project called Semantic Web
services INteroperability for Geospatial decision making (SWING, IST-FP6-26514). The main objectives of
the project were developing an open, easy-to-use Semantic Web Service framework of suitable ontologies
and inference tools for annotation, discovery, composition, and invocation of geospatial web services, and
evaluating the appropriateness of this framework in the context of geospatial decision-making. Conclusions
on recent potentials and directions future research are elaborated.
                                                                            Content

1       INFORMATION SOCIETY AND GEOSPATIAL DECISION MAKING - JOEL LANGLOIS, BRGM......... 7
    1.1         THE EUROPEAN UNION CONTEXT ............................................................................................................................ 7
    1.2         THE E-GOVERNANCE IN FRANCE .............................................................................................................................. 8
    1.3         THE ROLE OF BRGM AS A PUBLIC ESTABLISHMENT ................................................................................................ 8
    1.4         RAW MATERIALS SUPPLY AND GEOSPATIAL DECISION MAKING .............................................................................. 9
2   THE STATE OF THE ART IN SEMANTIC AND GEOSPATIAL TECHNOLOGIES - LAURENTIU
VASILIU, NUIG / PHILIPPE DUCHESNE, IONIC........................................................................................................ 12
    2.1      OVERVIEW OF SEMANTIC TECHNOLOGIES ............................................................................................................. 12
    2.2      CORE SEMANTIC TECHNOLOGIES OF INTEREST FOR SWING PROJECT .................................................................. 13
       2.2.1 XML/RDF/RDF schema.................................................................................................................................... 13
       2.2.2 Semantic Web Services ..................................................................................................................................... 14
    2.3      GEOSPATIAL TECHNOLOGIES ................................................................................................................................. 19
       2.3.1 Overview............................................................................................................................................................ 20
       2.3.2 Open Geospatial Consortium (OGC) ............................................................................................................... 20
       2.3.3 OGC Geospatial Web Services......................................................................................................................... 20
       2.3.3.1   Data and processing services....................................................................................................................... 20
       2.3.3.2   Catalogue service ......................................................................................................................................... 21
       2.3.4 Open Issues ....................................................................................................................................................... 24
3  COMBINING SWS AND GEOSPATIAL SERVICES: POTENTIAL BENEFITS, AND TECHNICAL
CHALLENGES -JOERG HOFFMANN, DUMITRU ROMAN, LFUI ........................................................................ 27
    3.1     WHICH LOGICS? ..................................................................................................................................................... 27
       3.1.1 Spatial Constraints and Operations ................................................................................................................. 28
       3.1.2 Web Feature Services ....................................................................................................................................... 29
       3.1.3 Web Processing Services .................................................................................................................................. 30
       3.1.4 Decisions for ontology formalisation ............................................................................................................... 32
    3.2     DISCOVERY ............................................................................................................................................................ 35
       3.2.1 Discovery based on Semantic Annotation ........................................................................................................ 36
       3.2.2 WFS, WPS, WMS .............................................................................................................................................. 37
       3.2.3 Integration with Catalogue Services ................................................................................................................ 37
    3.3     COMPOSITION AND EXECUTION ............................................................................................................................. 38
       3.3.1 Semantic Execution........................................................................................................................................... 38
       3.3.2 The Role of Ontologies for Execution .............................................................................................................. 39
       3.3.3 Data Mediation ................................................................................................................................................. 39
       3.3.4 Automatic Composition..................................................................................................................................... 41
4       SWING ARCHITECTURE OVERVIEW – JOERG HOFFMANN, LFUI/ARNE J BERRE, SINTEF .......... 42

5       ONTOLOGICAL BACKBONE – SVEN SCHADE, UOM .................................................................................... 45
    5.1     ONTOLOGIES IN SWING ........................................................................................................................................ 45
       5.1.1 Methodology for Ontology Engineering and Representation Language......................................................... 45
       5.1.2 Ontology Development Process........................................................................................................................ 45
    5.2     KNOWLEDGE ACQUISITION STRATEGY .................................................................................................................. 46
       5.2.1 Types of SWING Ontologies ............................................................................................................................. 46
       5.2.2 Knowledge Acquisition in SWING.................................................................................................................... 47
       5.2.3 Knowledge Acquisition from ISO and OGC Specifications............................................................................. 48
       5.2.4 Knowledge Acquisition with Domain Experts.................................................................................................. 52
       5.2.5 Knowledge Acquisition from External Sources................................................................................................ 56
    5.3     DESIGN DECISIONS FOR DOMAIN ONTOLOGIES ..................................................................................................... 56
    5.4     ONTOLOGY EVALUATION STRATEGY .................................................................................................................... 56
       5.4.1 “Quality” of an Ontology ................................................................................................................................. 57
       5.4.2 Assessing Quality of Ontology Parameters...................................................................................................... 57
       5.4.3 The SWING Ontology Validation Strategy ...................................................................................................... 59
    5.5     MAINTENANCE STRATEGY ..................................................................................................................................... 62
        5.5.1       The SWING Ontology Maintenance Strategy................................................................................................... 62
6       SEMANTIC ANNOTATION AND DISCOVERY APPROACH – STI, DERI, JSI, UOM................................ 72
    6.1     SEMANTIC ANNOTATION OF WFS ......................................................................................................................... 72
       6.1.1 Decisions for Formalising Annotations............................................................................................................ 72
    6.2     SEMANTIC ANNOTATION AND DISCOVERY FOR WPS ........................................................................................... 77
    6.3     APPLICATION SCENARIOS: SWING ONTOLOGIES FOR ANNOTATION, DISCOVERY .............................................. 81
       6.3.1 Mediation .......................................................................................................................................................... 87
7       SEMANTIC ANNOTATION ENGINE - MIHA GRCAR, JSI.............................................................................. 88
    7.1     EMPLOYING TEXT MINING FOR SEMANTIC ANNOTATION ..................................................................................... 90
       7.1.1 Concept Similarity by Comparison of Documents ........................................................................................... 90
       7.1.2 Google Definitions and (Contextualized) Search Results................................................................................ 92
       7.1.3 Hypotheses Checking by Using Linguistic Patterns ........................................................................................ 93
       7.1.4 Google Distance................................................................................................................................................ 94
    7.2     TERM MATCHING: A BUILDING BLOCK FOR AUTOMATING THE ANNOTATION PROCESS ..................................... 95
    7.3     EVALUATION OF TERM MATCHING TECHNIQUES .................................................................................................. 99
       7.3.1 Preliminary evaluation ..................................................................................................................................... 99
       7.3.2 Large-scale evaluation ................................................................................................................................... 100
    7.4     CROSS-LANGUAGE TERM MATCHING .................................................................................................................. 105
       7.4.1 Different Applications of Machine Translation in SWING............................................................................ 105
    7.5     AUTOMATING THE ANNOTATION PROCESS.......................................................................................................... 107
       7.5.1 Labels and Groundings in Resource Description Framework (RDF)........................................................... 108
       7.5.2 Training a Classifier ....................................................................................................................................... 108
       7.5.3 Incorporating Ontology Structure .................................................................................................................. 109
    7.6     EVALUATION OF AUTOMATIC ANNOTATION METHODS ...................................................................................... 109
       7.6.1 Golden Standard ............................................................................................................................................. 109
       7.6.2 Evaluation Metric ........................................................................................................................................... 109
       7.6.3 Experimental Results ...................................................................................................................................... 110
       7.6.4 Conclusions ..................................................................................................................................................... 110
    7.7     VISUAL ONTOBRIDGE .......................................................................................................................................... 110
    7.8     GEO-DATA MINING .............................................................................................................................................. 110
    7.9     CONCLUSIONS AND LESSONS LEARNED............................................................................................................... 110
8       GEOSPATIAL CATALOGUE, SEMANTIC DISCOVERY - PHILIPPE DUCHESNE, ERDAS ................. 111
    8.1     THE NEED FOR SEMANTIC METADATA / SEMANTIC TAGGING .............................................................................. 111
    8.2     IMPORTING ONTOLOGIES ..................................................................................................................................... 111
       8.2.1 The needs......................................................................................................................................................... 111
       8.2.2 Using the OASIS ebRIM/OWL profile............................................................................................................ 113
       8.2.3 Using a third-party inference engine.............................................................................................................. 114
       8.2.4 Extending to WSML ........................................................................................................................................ 115
       8.2.5 Standardization ............................................................................................................................................... 115
    8.3     STORING ANNOTATIONS IN THE CATALOGUE ...................................................................................................... 115
       8.3.1 Annotation object model ................................................................................................................................. 115
       8.3.2 Harvesting annotations................................................................................................................................... 116
       8.3.3 Synchronizing with WSMX.............................................................................................................................. 117
    8.4     INTERFACING A CS/W CATALOGUE WITH THE WSMX PLATFORM .................................................................... 117
    8.5     QUERY LANGUAGE EXTENSIONS.......................................................................................................................... 117
9       GEOSPATIAL SERVICE COMPOSITION AND EXECUTION -ANDREAS LIMYR, SINTEF................. 118
    9.1     MODELLING SERVICE COMPOSITIONS ................................................................................................................. 118
    9.2     ABSTRACTING GEOSPATIAL ASPECTS ................................................................................................................. 121
       9.2.1 A brief introduction to OGC web services ..................................................................................................... 121
    9.3     MODELLING GEOSPATIAL SERVICE COMPOSITIONS ............................................................................................ 121
    9.4     EXECUTION OF SERVICE COMPOSITIONS ............................................................................................................. 123
       9.4.1 Namespace References and Web Service Header .......................................................................................... 124
       9.4.2 Input and output parameters .......................................................................................................................... 124
       9.4.3 Transition Rules .............................................................................................................................................. 125
    9.5     EXECUTION OF GEOSPATIAL SERVICE COMPOSITIONS ........................................................................................ 126
       9.5.1 Import of Web Feature Service....................................................................................................................... 127
       9.5.2 Web Feature Service filter expressions .......................................................................................................... 127
    9.6     THE SWING EXECUTION LIFE CYCLE .................................................................................................................. 129
       9.6.1 Publication ...................................................................................................................................................... 129
        9.6.2 Discovery......................................................................................................................................................... 130
        9.6.3 Invocation........................................................................................................................................................ 130
     9.7     METADATA MODEL .............................................................................................................................................. 130
10    SWING DEMONSTRATION, WALKTHROUGH THE SWING USE CASE - ARNE J BERRE,
SINTEF, MARC URVOIS, BRGM................................................................................................................................... 131
     10.1        THE END-USER CONTEXT...................................................................................................................................... 131
     10.2        THE SWING END USER APPLICATION PRESENTATION ......................................................................................... 131
     10.3        THE USE CASE/SHOW CASE ................................................................................................................................... 131
     10.4        EXPERIENCE FEEDBACK ....................................................................................................................................... 131
11           REVIEW AND OUTLOOK - SVEN SCHADE, UOM ..................................................................................... 132
1      Information Society and Geospatial Decision Making - Joel Langlois,
       BRGM

In this part of this book, we will provide the background in the Geospatial Decision Making Applications.
From the context of the EU, we will present the different frames that push toward the geospatial concepts.
We will quickly present the e-governance in France and mention the role of BRGM , as a public
establishment, in France. We will then focus to the aggregate socio economic management, which is the
selected finality theme for the SWING project, what will lead us to the use case presentation that will be
further presented in chap 9.

1.1      The European Union Context

The strategy adopted by UE in Lisbon 2000 is for an accelerated transition to a competitive and dynamic
knowledge economy capable of sustainable growth, with more and better jobs and greater social cohesion.
This requires wider adoption, broader availability and an extension of Information Society Technologies
(IST) applications and services in all economic and public sectors and in the society as a whole. IST are the
key underlying technologies for easier and efficient knowledge creation, sharing and exploitation.

In this context, SWING (Semantic Web services INteroperability for Geospatial decision making) is a
project of the IST Program for Research, Technology Development & Demonstration under the Sixth
Framework Programme of the European Commission.

It aims at breaking technological locks in the field of semantics on the Web. Such projects in the field of
knowledge sharing through Internet have a motivation well in line with European framework such as
INSPIRE and GMES, which make the Information Society operational.

The INSPIRE1 directive (INfratruture for SPatial InfoRmation in Europe) , approved by the Council of
Ministers of the European Union and the European Parliament and published in the Official Journal of
European Communities on 25 April 2007, entered into force on 15 May 2007. It aims to promote the
production and exchange of data required for various EU policies in the environmental field caught in a
broad sense. The transposition into national law of the text of the European Union will be completed by 17
May 2009. This directive is divided into three complementary parts: the obligations created, the data
concerned, and the actors involved.

In another perspective, GMES2 (Global Monitoring for Environment and Security), now named Kopernikus,
is a joint initiative of the European Space Agency (ESA) and the European Union in the field of environment
and security. It is now the European response to GEOSS (Global Earth Observation System of Systems),
which is subject to the summit of Earth Observation and GEO working groups which main leaders are the
European Union, the United States, Japan and South Africa.

To succeed, such initiatives naturally require the setting up of standards and more particularly in the field of
digital geographic information. In this respect, the role of ISO Technical Committee 2113 aims at
establishing a framework of standards related to information on objects or phenomena that are directly or
indirectly associated with a location on Earth. These standards may specify, for geographic information,
methods, tools and services for data management (including their definition and description), acquisition,
processing, analysis, access, presentation and electronic data transfers between different users, systems and
sites.

The OGC4 (Open Geospatial Consortium) is an international non-profit organisation dedicated to the
emergence of standards and development of open systems in geomatics. It was founded in 1994 to address


1
    Detailed information is available from. http://inspire.jrc.ec.europa.eu/ and http://inspire.brgm.fr.
2
    Official Web page available from: http://www.gmes.info.
3
    Official Web page available from: http://www.isotc211.org/.
4
    Official Web page available from: http://www.opengeospatial.org/.
the problems of non-interoperability of applications for geographic information. These initiatives, both in
the domain of law and of standardisation, give great perspectives to geospatial decision making procedures
and e-governance.

1.2      The e-governance in France

The implementation of INSPIRE will bring major changes in the field of geographic information and will
definitely impact a large number of public authorities and stakeholders. The directive entered into force on
15 May 2007 and its transposition into French law will be completed by 17 May 2009.

Presently in France, various efforts are made towards information society, bringing the full potential of the
Internet to serve the citizen information, mechanisms for decision-making, and governance.

Among them, the Direction Générale de la Modernisation de l’État (DGME) is a directorate under the
Ministry of Budget, Public Accounts and the Public Service, within the French public administration. It is
responsible for implementing the administrative reform of the State, in association with concern ministries.
The Référentiel Général d'Interopérabilité (RGI) is a project within the French government and is led by the
DGME aiming to standardize the electronic data exchanges.

Moreover, l’Agence pour le Développement de l’Administration Electronique (ADAE) is the French
government agency attached to the Prime Minister. Its main mission is to foster the development of the use
of information system by public administrations.

These 2 organizations complement the CNIG (Conseil National de l’Information Géographique) created 20
years ago to advise the government on all issues relating to the geographic information sector. It also helps to
stimulate development.

Naturally, the setting up of operational e-gouvernance in France relies on the strong involvements of a large
number of public institutions throughout the country. In this context, BRGM5 is deeply involved in the
design and development of national infrastructures. It also actively contributes to standardization working
groups at French, European, and International Levels.

1.3      The role of BRGM as a public establishment

BRGM is highly involved in this global movement of availability of information, knowledge and especially
in the field of geographic information. Therefore BRGM is very active in the INSPIRE work program as
Mandated Legacy Organization and participates in several working groups for the setting up of reference
guides.

BRGM is also a member of OGC and participate in the specification of geospatial information exchange and
geospatial interoperability.

At a national scale, BRGM is the major player in the implementation part of initiatives such as
GeoCatalogue6 which aims to reference the national public data. Information in the GeoCatalogue is for state
services, public institutions, local authorities and other public organizations.

From a functional point of view, BRGM’s main business consists in delivering information about the
subsurface for decision making. Its clients (public authorities and agencies, private consultants and
industries, citizen) are expecting not only scientific information, but information that can be used by non-
specialists and integrated in a decision making process. BRGM, as other national geological surveys has
therefore moved from a pure scientific and research model to an information agency model. If the basic
scientific information is delivered for free, financial returns are expected from added value services,
produced using BRGM data and software tools, but also integrating external components.



5
    Official Web page available from: http://www.brgm.fr.
6
    Official Web page available from: http://www.geocatalogue.fr.
BRGM has been involved in research and development activities concerning mineral resources assessment
for several decades. The organization has been able to adapt its initial calling and move away from its
traditional role of aid in mineral deposit development, towards a more global policy of mineral resource
management. This is an approach that addresses the needs of society in general, within a framework of
global development-aid policies, and no longer those of a particular industrial sector.

BRGM has taken this conceptual turn by facing up to the concerns associated with sustainable development,
social policies and geopolitics. Therefore, the expected results of SWING are of great importance to support
the strategic change of BRGM business, to allow the delivery of more advanced services, to make them more
visible and accessible to a wide range of potential users.

SWING will enhance BRGM's leading position as governmental web service provider and will help to
stimulate other public data providers to prepare an interoperable information environment that will help to
produce, use, and integrate spatial and non-spatial information for the decision maker and the public.


1.4     Raw materials supply and geospatial decision making

#…#

1.4.1    Raw material supply in European Union

Raw materials are essential for the sustainable functioning of modern societies. They are part of both high
tech products and every-day consumer products. Access to and affordability of mineral raw materials are
crucial for the sound functioning of the EU's economy.

European industry needs fair access to raw materials both from within and outside the EU. For certain high
tech metals, the EU has a high import dependency and access to these raw materials is getting increasingly
difficult.

The UE is self-sufficient in a rather wide variety of raw material such as construction minerals in particular
aggregates. It is also a major world producer of natural stones, gypsum, fledspar, perlite, Kaolin and salt.

In view to secure and improve the access to raw materials for EU industry, the European Commission
launched in November 2008 a new integrated strategy which sets out targeted measures.

The present context in which this multinational strategy emerges is detailed as follow by the commission (cf
communication from the commission : the raw materials initiative – meeting our critical needs for growth
and jobs in Europe).

The EU has many raw material deposits. However, their exploration and extraction are facing increased
competition for different land uses and a highly regulated environment, as well as technological limitations
in access to mineral deposits. At the same time, a significant opportunity exists for securing material
supplies by improving resource efficiency and recycling (…).

As long term market prospects will create conditions that are favourable to new mining and recycling
projects all over the world, it is important for the EU not to miss the opportunity to make more of its
domestic capacities or develop substitutes (…).

The sustainable supply of raw materials based in the EU requires that the knowledge base of mineral deposits
within the EU will be improved. In addition, the long term access to these deposits should be taken into
account in land use planning. Therefore the Commission recommends that the national geological surveys
become more actively involved in land use planning within the Member States (…).

Moreover, the Commission recommends better networking between the national geological surveys to
facilitate the exchange of information and improve the interoperability of data and their dissemination, with
particular attention to the needs of SMEs. Additionally, the Commission, in conjunction with Member States,
will look into developing a medium to long term strategy for integrating sub-surface components into the
Land service of Kopernikus, which can feed into land-use planning and improve its quality.
In this respect, BRGM being the French geological Survey, plays a major role as a national public institution
in France. Moreover, EuroGeoSurveys7, the association of European geological force is also a proposal in
this area at European level and not only the sustainable supply of raw materials is addressed but, the
environment protection issues are also taken into account.

Most of the legislation at EU level relevant to the non-energy extractive industry is horizontal. The
implementation of the Natura 20008 legislation is of particular relevance for the extractive industry. During
the public consultation industry raised concerns about sometimes competing objectives between the
protection of Natura 2000 areas and the development of extractive activities in Europe. Whereas the
Commission stresses that there is no absolute exclusion of extractive operations within the Natura 2000 legal
framework, the Commission and Member States have committed themselves to developing guidelines for
industry and authorities in order to clarify how extraction activities in or near Natura 2000 areas can be
reconciled with environmental protection. The guidelines are expected to be finalised by the end of 2008 and
will be based on available best practices (…).

To tackle the technological challenges related to sustainable mineral production, the Commission promotes
research projects in its 6th and 7th Framework Programmes, both on thematical and technical aspects. Not
only the discovery of deeply located ore deposits is addressed, but also the geospatial decision making when
combining the raw material resources with land use constraints.


1.4.2     Geospatial decision-making

Land use planning and sustainable development of natural resources involves multiple stakeholders, different
disciplines and various sources of information. Examples of land management projects are the spatial
planning of large-scale infrastructure projects such as a new airport, a high-speed train railway, or
exploitation of underground resources such as establishing a quarry or a mine. Such projects involve, not
only an assessment of the future infrastructure network before the initiation of the project's construction, but
also multi-stakeholder considerations about spatial occupation of the infrastructure itself (sterilising the land
for any other surface occupation activity), about effects on neighbouring land-use, environmental load,
energy consumption, economic revenues, employment, and on resource availability to help constructing the
project. Depending on the project, many viewpoints of different stakeholders may need to be taken into
account.

The management of natural underground resources is in the heart of BRGM activities, driven by its mission
to administer, organise and provide information on the subsurface of France's national territory and its
knowledge and expertise in national and international mineral resource management under a sustainable
development scheme. Mining and quarrying activities supply mineral resources, such metals, minerals, sand
and gravels, to prepare the infrastructure foundation and to provide steel, alloys, concrete and other materials
to build the infrastructure. The sources of these mineral raw materials have a heterogeneous spatial
distribution determined by the subsurface geology and subsequent landscape evolution.

Access to these resources is determined by important economic and legal constraints and severe public
opinions, involving economic, environmental and socio-economic points of view. Mines or quarries have
difficulties to meet these requirements well before they can develop the extraction of the mineral resources.
Subsequently, they have difficulties to provide positive elements to the public opinion and to find the
arguments to allow continuing activity and possible extension of the extraction site. Landscape modifications
and environmental impacts such as soil pollution, water and air, noise, dust, and vibrations, during the
lifetime of a mine or quarry also often leave a heritage after the mine closure. On the contrary, import and
long-distance transport of mineral resources transfers the problem of the exploitation to other location in the
world and introduces an additional economic and environmental load by imposing adaptation of
infrastructures (requiring more material) and large volumes of material to be transported by boats, trains and



7
    Official Web page available from: http://www.eurogeosurveys.org/.
8
    Official Web page available from: http://www.natura2000.fr/.
trucks from the production sites to the consumption areas.

Because of the variety of information to be taken into account, in most cases, decision-making processes
start with a phase of data collection and organisation of the geospatial information. For the French territory,
BRGM has estimated the cost for an initial inventory of aggregate deposits (sand, gravel, etc. for
construction and infrastructure development) to be approximately 12 millions Euros. Once the geological
knowledge on the aggregates resources is well understood, the information can be combined with
information on other land uses (forestry, agriculture, natural heritage sites, urbanised areas, stream and
transportation networks, etc.), requiring an additional multi-million Euros investment and integration of
information and expertise from other disciplines.

This data preparation is a time consuming, very expensive, and rather unrewarding phase of work. The
targeted technological development in SWING helps to:

    •   enhance the efficiency of data collection, because existing data sets can be discovered,

    •   sustain the availability of acquired information through the semantic web services,

    •   semi-automate data integration and analyses, save time and money, while improving the quality,
        significance and coherence of the assessments.

Even though the pilot application is focused on natural resource exploitation, it is developed with a more
generic application focus in mind. In this focus, a user wants to assess the impact of doing X on location Y
from different viewpoints.

In addition to the technological development in SWING to address decision making for any different
multidisciplinary, multi-source and multi-stakeholder exercise, the selected use cases applied to the siting of
aggregates quarries also brings an important contribution to Europe's natural resource policies and will help
to deliver unbiased information on future decisions and consequences to manage the opening, maintenance,
extension, and/or closure of a mine or quarry.
2     The State of the Art in Semantic and Geospatial Technologies - Laurentiu
      Vasiliu, NUIG / Philippe Duchesne, IONIC

This chapter provides an introduction to semantic and Geospatial technologies. In section 3.1 we first
provide an insight into the basic elements of technologies as XML, RDF, RDFs, then a view into more
advanced technologies as OWL and WSMO. We then provide in section 3.2 an overview of Geospatial
technologies, covering in particular the most relevant kinds of Geospatial Web Services.

2.1    Overview of Semantic Technologies

#It needs to be refined further and text improved. (Laurentiu)#

The ‘semantic web’ concept has been coined by Tim Berners Lee at a keynote session at XML 2000. In the
context of the semantic web, ‘machine processable’ was the intended meaning for the concept ‘semantic’
applied in the IT web domain. When operating with data, the semantics attached to it would instruct the
machine what to do with that specific data and also what the data would mean. Thus a whole new direction
has started with and from XML towards building a declarative environment [Berners Lee, 2000].

Therefore in the next pages there are going to be introduced at a high level the family of languages that have
evolved since 2000, when the conceptual basis of semantic web have been laid. XML, RDF, RDFs, OWL
[D., McGuinness , van Harmelen, F, 2004], OLW-s [Martin, D., Burstein, M et al, 2004] and
WSMO/WSML [Roman, D., Lausen, H. et al, 2006] are going to be presented together.

In accordance with the Semantic web stack (W3C) presented in the next picture there is a layering of
technologies that can traced bottom up with XML, RDF, SPARQL, OWL WSMO with the logic layer on
top and the final layers. While an evolutionary path can be followed from XML towards RDF and OWL,
WSML language have been created independent of the previous languages, with the goal to overcome some
limitations of previous technologies. Description logic [Baader, F., Calvanese, D et al.], [Description Logic]
is being used in OWL family of languages while description logics, first-order logic[Wiedijk, F. and
Zwanenburg, J., 2003] and logic programming [Lloyd, J. W., 1987] are used for WSML family of languages.




                               Figure 1: Semantic Stack [Semantic Stack, 2007].
2.2     Core semantic technologies of interest for SWING project

In this sub-section we are going to introduce briefly the core semantic technologies that have been reviewed
and considered for usage in SWING project. Out of the semantic stack presented in the above figure 1, the
focus is going to be on XML, RDF, OWL and WSML.

2.2.1     XML/RDF/RDF schema

Next there are introduced the languages that are at the foundation of semantic web from the historic
evolution point of view:

XML (eXtensible Markup Language)

XML had been designed to describe various types of data. It had been designed to display data and take care
on the way data presentation style. XML describes a class of data objects called XML documents and
partially describes the behavior of computer programs to process them. According with [XML] XML is an
application profile or restricted form of SGML, the Standard Generalized Markup Language [ISO 8879].
XML documents are by desing conforming SGML documents. XML documents are made up of storage units
called entities. These entities contain either parsed or unparsed data. Parsed data is made up of characters,
‘character data’ and ‘markup’. [XML]

XML was developed by an XML Working Group formed under the World Wide Web Consortium
(W3C) in 1996, by Jon Bosak of Sun Microsystems with the active participation. XML goal
designes are [XML] :
      1. Straightforwardly usable over the Internet.

      2. To support a wide variety of applications.

      3. To be compatible with SGML.

      4. Facilitate easiness to write programs which process XML documents.

      5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.

      6. XML documents should be human-legible and reasonably clear.

      7. The XML design should be prepared quickly.

      8. The design of XML shall be formal and concise.

      9. XML documents shall be easy to create.

      10. Terseness in XML markup is of minimal importance.

A very simple XML example is presented below:

          <?xml version="1.0" encoding="ISO-8859-1"?>
              <note>
                  <to>John</to>
                  <from>Paul</from>
                  <heading>Warning</heading>
                  <body>Your payment is overdue !</body>
              </note>


RDF (Resource Description Framework)

The Resource Description Framework (RDF) is a language designed to represent information about
resources in the World Wide Web and to represent metadata about Web resources. RDF can also be used to
represent information about things that can be identified on the Web, even when they cannot be directly
retrieved on the Web. Examples include information about items available from on-line shopping facilities
(e.g., information about specifications, prices, etc), or the description of a Web user's preferences for
information delivery [RDF primer, 2004].

RDF is intended for situations where the information needs to be processed by applications, and not only for
display for human consumption. RDF provides a common framework for expressing this information.. Since
it The application designers can leverage the availability of common RDF parsers and processing tools. The
ability to exchange information between different applications means that the information may be made
available to applications other than those for which it was originally created [RDF primer, 2004].

RDF is build upon the idea of identifying entities using Web identifiers (called Uniform Resource Identifiers,
or URIs), and describing resources in terms of simple properties and property values. This enables RDF to
represent simple statements about resources as a graph of nodes and arcs representing the resources, and
their properties and values. As in the example presented in the [RDF primer, 2004], the group of statements
"there is a Person identified by http://www.w3.org/People/EM/contact#me, whose name is XYZ, whose
email address is em@w3.org, and whose title is Dr." could be represented as the RDF graph

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#">
  <contact:Person rdf:about="http://www.w3.org/People/EM/contact#me">
    <contact:fullName>WYZ</contact:fullName>
    <contact:mailbox rdf:resource="mailto:em@w3.org"/>
    <contact:personalTitle>Dr.</contact:personalTitle>
  </contact:Person>
</rdf:RDF>


As HTML, this RDF/XML is machine processable and, using URIs, can link pieces of information across the
Web. However, unlike conventional hypertext, RDF URIs can refer to any identifiable thing, including
things that may not be directly retrievable on the Web. RDF can also describe objects, persons, events, news
etc. In addition, RDF properties themselves have URIs, to precisely identify the relationships that exist
between the linked items.

RDFs – RDF schema

RDFs schema is an extensible language that provides the basic elements for RDF vocabularies description
intended to structure RDF resources. The final RDFs recommendation [RDFs recommendation, 2004] was
released in 2004, and main RDFs components are included in the much more expressive language that is
OWL and will be presented in the. Details can be found in the final RDFs recommendation.

2.2.2   Semantic Web Services

Semantic Web Services (SWS) technology aims to add sufficient semantics to the specifications and
implementations of Web Services to make possible the automatic integration of distributed autonomous
systems, with independently designed data and behaviour models. Defining data, behaviour and system
components in a machine understandable way using ontologies provides the basis for reducing the need for
humans to be in the loop for system integration processes. The application of semantics to Web Services can
be used to remove humans from the integration jigsaw and substitute them with machines. SWS will put in
place an automated process for machine driven dynamic discovery, selection, mediation, invocation and
inter-operation.

OWL/OWLS

The Semantic Web is a vision for the future of the Web in which information is given explicit meaning,
making it easier for machines to automatically process and integrate information available on the Web. OWL
builds on XML's ability to define customized tagging schemes and RDF's flexible approach to representing
data. The first level above RDF required for the Semantic Web is an ontology language that can formally
describe the meaning of terminology used in Web documents. If machines are expected to perform useful
reasoning tasks on these documents, the language must go beyond the basic semantics of RDF Schema.
[McGuinness , D., van Harmelen, F, 2004].

OWL has been designed to meet this need for a Web Ontology Language. OWL is part of the growing stack
of W3C recommendations related to the Semantic Web [McGuinness , D., van Harmelen, F, 2004]:

    •   XML provides a surface syntax for structured documents, but imposes no semantic constraints on the
        meaning of these documents.

    •   XML Schema is a language for restricting the structure of XML documents and also extends XML
        with datatypes.

    •   RDF is a datamodel for objects ("resources") and relations between them, provides a simple
        semantics for this datamodel, and these datamodels can be represented in an XML syntax.

    •   RDF Schema is a vocabulary for describing properties and classes of RDF resources, with a
        semantics for generalization-hierarchies of such properties and classes.

    •   OWL adds more vocabulary for describing properties and classes: among others, relations between
        classes (e.g. disjointness), cardinality (e.g. "exactly one"), equality, richer typing of properties,
        characteristics of properties (e.g. symmetry), and enumerated classes.

OWL provides three increasingly expressive sublanguages designed for use by specific communities of
implementers and users. [McGuinness , D., van Harmelen, F, 2004]:

    •   OWL Lite supports those users primarily needing a classification hierarchy and simple constraints.
        For example, while it supports cardinality constraints, it only permits cardinality values of 0 or 1. It
        should be simpler to provide tool support for OWL Lite than its more expressive relatives, and OWL
        Lite provides a quick migration path for thesauri and other taxonomies. Owl Lite also has a lower
        formal complexity than OWL DL, see the section on OWL Lite in the OWL Reference for further
        details. [McGuinness , D., van Harmelen, F, 2004].

    •   OWL DL supports those users who want the maximum expressiveness while retaining computational
        completeness (all conclusions are guaranteed to be computable) and decidability (all computations
        will finish in finite time). OWL DL includes all OWL language constructs, but they can be used only
        under certain restrictions (for example, while a class may be a subclass of many classes, a class
        cannot be an instance of another class). OWL DL is so named due to its correspondence with
        description logics, a field of research that has studied the logics that form the formal foundation of
        OWL. [McGuinness , D., van Harmelen, F, 2004].

    •   OWL Full is meant for users who want maximum expressiveness and the syntactic freedom of RDF
        with no computational guarantees. For example, in OWL Full a class can be treated simultaneously
        as a collection of individuals and as an individual in its own right. OWL Full allows an ontology to
        augment the meaning of the pre-defined (RDF or OWL) vocabulary. It is unlikely that any reasoning
        software will be able to support complete reasoning for every feature of OWL Full. [McGuinness ,
        D., van Harmelen, F, 2004].

Ontology developers adopting OWL should consider which sublanguage are the most suitable. The choice
between OWL Lite and OWL DL depends on the extent to which users require the more-expressive
constructs provided by OWL DL. The choice between OWL DL and OWL Full mainly depends on the
extent to which users require the meta-modeling facilities of RDF Schema (e.g. defining classes of classes, or
attaching properties to classes). When using OWL Full as compared to OWL DL, reasoning support is less
predictable since complete OWL Full implementations do not currently exist. [McGuinness , D., van
Harmelen, F, 2004].

OWL Full can be viewed as an extension of RDF, while OWL Lite and OWL DL can be viewed as
extensions of a restricted view of RDF. Every OWL (Lite, DL, Full) document is an RDF document, and
every RDF document is an OWL Full document, but only some RDF documents will be a legal OWL Lite or
OWL DL document. Because of this, some care has to be taken when a user wants to migrate an RDF
document to OWL. When the expressiveness of OWL DL or OWL Lite is deemed appropriate, some
precautions have to be taken to ensure that the original RDF document complies with the additional
constraints imposed by OWL DL and OWL Lite. Among others, every URI that is used as a class name must
be explicitly asserted to be of type owl:Class (and similarly for properties), every individual must be asserted
to belong to at least one class (even if only owl:Thing), the URI's used for classes, properties and individuals
must be mutually disjoint. [McGuinness , D., van Harmelen, F, 2004].

WSMO/WSML/WSMX

Next we provide an overview of the Web Services Modelling Ontology (WSMO), a fully-fledged framework
for SWS (semantic web services). We also discuss and present the key technology related to the conceptual
framework of WSMO – Web Services Modelling Execution Environment (WSMX), which is its reference
implementation.

WSMO, WSML and reasoners

The Web Services Modelling Ontology WSMO (Roman et al 2005) initiative provides a complete
framework enhancing syntactic description of Web Services with semantic metadata. The WSMO project is
an ongoing research and development initiative aiming to define a complete framework for SWS (semantic
web services) and consisting of three activities:

    •      WSMO (Web Service Modelling Ontology), which provides formal specification of concepts for
           Semantic Web Services,

    •      WSML (Web Services Modelling Language), which defines the language for representing WSMO
           concepts;

    •      WSMX (Web Services Execution Environment), which defines and provides reference
           implementation allowing the execution of SWS.

As illustrated in Figure 2 below, there are four top level WSMO concepts: Ontologies, Goals, Web Services
and Mediators.

                                                  Objectives that a client may have
                                                  when consulting a Web Service
                                                              Goals


                                                                                               Semantic description of
        Provide the formally specified
                                                                                               The Web Services:
        Technology of the information    Ontologies                             Web Services   - Capability (Functional)
        Used by all other components
                                                                                               - Interfaces (Usage)



                                                           Mediators
                                                  Connectors between components
                                                 with mediation facilities for handling
                                                          heterogeneities


                                 Figure 2: WSMO Top Level Concepts [WSMO], [WSML].

As defined by Roman et al 2005 and Brodie et al 2006 Goals provide means to characterize user requests in
terms of functional and non- functional requirements. For the former, a standard notion of pre and post
conditions has been chosen and the later provides a predefined ontology of generic properties. Web service
descriptions enrich this by an interface definition that defines service access patterns (its choreography) as
well as means to express services as composed from other services (its orchestration). More concretely, a
Web service presents:

    •      a capability that is a functional description of a Web Service, describing constraints on the input and
           output of a service through the notions of preconditions, assumptions, post conditions, and effects;
    •   interfaces that specify how the service behaves in order to achieve its functionality. A service
        interface consists of a choreography that describes the interface for the client-service interaction
        required for service consumption, and an orchestration that describes how the functionality of a Web
        Service is achieved by aggregating other Web services.

Ontologies provide a first and important means to achieve interoperability between goals and services as
well as between various services themselves. By reusing standard terminologies, different elements can link
directly or indirectly via predefined mapping and alignments. Mediators provide additional procedural
elements to specify further mappings that cannot directly be captured through the usage of Ontologies. Using
Ontologies provides real-world semantics to our description elements as well as machine processable formal
semantics through the formal language used to specify them. Mediation in WSMO addresses the handling of
heterogeneities occurring between elements that interoperate by resolving mismatches between different
terminologies (data level), and the communicative behaviour between services (protocol level) on the
business process level. A WSMO Mediator connects the WSMO elements in a loosely coupled manner, and
provides mediation facilities to resolve mismatches that might arise in the process of connecting different
elements defined by WSMO. As a conceptual and reference model, WSMO framework provides the high
level concepts that are used in its reference implementation that is WSMX, introduced in the following
section.

As described in [Web Service Modelling Language] WSML language is based upon different logical
formalisms that are description logics, first-order logic and logic programming. The Web Service Modeling
Language (WSML) aims at providing means to formally describe all the elements defined in WSMO. The
different variants of WSML correspond with different levels of logical expressiveness and the use of
different languages paradigms. WSML consists of a number of variants based on these different logical
formalisms: WSML-Core, WSML-DL (description logic), WSML-Flight, WSML-Rule and WSML-Full.
WSML-Core corresponds with the intersection of Description Logic and Horn Logic. WSML is specified in
terms of a normative human-readable syntax. Besides the human-readable syntax, WSML has an XML and
an RDF(resource description framework) [RDF 2004] syntax for exchange over the Web and for
interoperation with RDF-based applications.

The logic constraints that are declared in WSML goals or capabilities are interpreted and reasoned upon
within WSMX by a dedicated component called reasoner. There are very few reasoners on the market under
development as KAON2 [KAON2 Reasoner], MINS[MINS reasoner], and IRIS [IRIS reasoner]. For WSMX
application needs, KAON2 reasoner have been used, however MINS and IRIS may become viable options
once consolidated. KAON has the purpose ‘to retain scalability in reasoning with large ontologies and
knowledge bases’ [KAON2 reasoner]. Initially it has been designed for QWL [OWL language] (Ontology
Web language) but within DIP project [DIP project] there have been developed wrappers to transform
WSML to KAON2 internal data representation. Thus is one of the most versatile reasoners on the market
being able to compute both OWL and WSML ontologies. KAON2 is a hybrid reasoning systems that
combines the two knowledge representation paradigms by allowing Datalog-style rules to interact with
structural description logics knowledge bases. It supports the SHIQ(D) [5] description logic and disjunctive
Datalog [1] programs [Grimm, S., Nagypal G. WSML Flight Reasoner, 2006] In the next figure 3, a detailed
WSML reasoner architecture is presented.
                        Figure 3: WSML reasoner architecture [Grimm S., Nagypal G., 2006].

WSMX (Web Service Execution Environment)

WSMX9 [Cimpian et al, 2005] is an execution environment to support the dynamic discovery, selection,
mediation, invocation and inter-operation of the Semantic Web Services and provides a reference
implementation for a service-oriented architecture that uses semantic annotation of all its major elements.
Therefore a general architecture as well as the necessary components have been defined and the interfaces
and communication of components have been standardized. WSMX is a reference implementation for
WSMO. The development process for WSMX includes defining its conceptual model (which is WSMO),
standardizing the execution semantics for the environment, describing architecture and a software design and
building a working implementation.

While WSMX is intended as the reference implementation of Semantic Web Services systems, at this stage
its major purpose is to trigger standardization activities in OASIS and to connect many of the SWS efforts to
provide a justification for and a description of the Semantic Web Services infrastructure in general terms (on
the conceptual level), rather than simply focus on specific implementation. The WSMX specification is
currently further developed through OASIS as the Semantic Execution Environment (SEE)10. In figure 3 we
present the WSMX architecture and its most important components.




9
    http://www.wsmo.org/wsmx/
10
     http://www.oasis-open.org/committees/semantic-ex/
                       Figure 4: WSMX: A Reference Architecture for SEE [WSMX].

WSMX is a useful framework for both Web Service providers and requesters. As a provider, one may
register a service using WSMX in order to make it available to consumers and, as a requester, one can find
the Web Services that suits one’s needs and then invoke them in a transparent, secure and reliable way.
WSMX itself is made available as a Web Service, so to find a Web Service or to invoke a Web Service a
requester simply invokes WSMX itself. In the first case, a formal description of requester goal has to be
provided, and in the second case, the actual data the requester wants to use for the invocation. In this way,
WSMX can take care of all the other required computations such as heterogeneity reconciliation,
composition, security or compensation.

Creating ontologies and semantic descriptions for Web Services is only useful if these descriptions can
ultimately be applied. Infrastructure is vital for a technology to be applied. Web servers and web browsers
are the infrastructure that has lead to the success of HTML on the web. WSMX is an execution environment
for finding and using Semantic Web Services described using WSMO. Considering current Web Service
technologies there is a large amount of human effort required in the process of finding and using Web
Services. Firstly the user must browse a repository of Web Services to find a service that meets his
requirements. Once the Web Service has been found the user needs to understand the interface of the service,
the inputs it requires and the outputs it provides. Finally the user would write some code that can interact
with the Web Service in order to use it. The aim of WSMX is to automate as much of this process as is
possible. The user provides WSMX with a WSMO Goal that formally describes what he would like to
achieve. WSMX then uses the Discovery component to find Web Services, which have semantic descriptions
registered with WSMX that can fulfil this Goal. During the discovery process the users Goal and the Web
Services description may use different ontologies. If this occurs Data Mediation is needed to resolve
heterogeneity issues. Data Mediation in WSMX is a semi-automatic process that requires a domain expert to
create mappings between two ontologies that have an overlap in the domain that they describe. Once these
mappings have been registered with WSMX the runtime data Mediation component can perform automatic
mediation between the two ontologies. Once this mediation has occurred and a given service has been chosen
that can fulfil the users Goal, WSMX can begin the process of invoking the service. Every Semantic Web
Service has a specific choreography that describes they way in which the user should interact with it. This
choreography describes semantically the control and data flow of messages the Web Service can exchange.
In cases where the choreography of the user and the choreography of the Web Service do not match process
mediation is required. The Process Mediation component in WSMX is responsible for resolving mismatches
between the choreographies (often referred to as public processes) of the user and the Web Service. The
description of WSMX components and their detailed functionality can be accessed at www.wsmx.org. The
reasoner component can be used by several otyher components: discovery, composition and data mediation if
needed, in order to process the logic constraints needed by each of these components.. In the next section we
will present the application that will have as support the previously described WSMX platform.

2.3     Geospatial Technologies

#...#
2.3.1   Overview

Geographic or Geospatial Information Systems (GIS) encompass all computer systems that can store, serve,
manipulate and display geo-localized information. Various types of data can be manipulated by such systems
starting from basic geographic data to meteorological data, information related to the population, the
industry, the ecology, etc. These data are generally better used in combination to improve the level of
information that can be presented to a user. For example, meteorological data is better visualised with a map
background for an improved understanding. The distributed nature of GIS data and services fits perfectly
with the web service model. However, a communication standard is required for ensuring the interoperability
between the components. The Open Geospatial Consortium (OGC) has defined, over the past ten years, a set
of specifications that aim at describing such a communication standard. The following section provides an
introduction to this approach and these specifications, that are now widely used and recognized in the GIS
world.

#put OGC in perspective : OGC is not the only paradigm in GIS; MassMarket technologies are emerging;
explain how those technologies relate to this work#

2.3.2   Open Geospatial Consortium (OGC)

The OGC (Open Geospatial Consortium) is a non-profit, international organization that is leading the
development of standards for geospatial and location based services. Its objective is to develop publicly
available interface specifications. The OpenGIS Specifications support interoperable solutions that geo-
enable the Web, wireless and location based services and mainstream IT.

The set of OpenGIS Specifications enables interoperability between GIS services. The interoperability is
important to guaranty the independence of service supplier. It allows for data publishers and consumers to
interact without worrying about the details of implementing a communication mean. To publish data, a
provider chose an OGC compliant server. The consumer will then implement the standardized OpenGIS
interfaces to connect to the services and use the data.

In the context of the SWING project, the OGC services are of two kinds: on one hand, the data-access and
processing services, and the catalogue service (CS/W) on the other.

The data access and processing services are composed of interfaces such as the WFS, WMS, WCS or WPS.
They allow the publication and processing of geospatial data, like vector, raster or coverage data.

The Catalogue service has a specific and centric role in the network. It is designed to provide a way for the
data consumers to discover the services that have been previously published and annotated.

2.3.3   OGC Geospatial Web Services

#...#

Data and processing services

The OGC has defined access interfaces for each type of data: the WMS (Web Map Service) for raster data
like maps, the WFS (Web Feature Service) for geo-referenced features and the WCS (Web Coverage
Service) for matrix data (e.g. altitude, temperature, …) The OGC has also issued a WPS (Web Processing
Services) specification covering the need for geospatial processing services.

The Web Map Service Error! Reference source not found. produces spatially referenced maps
dynamically from geographic information. The WMS supports the GetCapabilities operation for retrieving
information about the service offered (content and acceptable parameters) and the GetMap operation for
retrieving the map itself. Optionally, the WMS can also support (Queryable WMS) the GetFeatureInfo
operation to get information about the features that would be portrayed on the map. A WMS can produce
maps based on its own geographical information or query a WFS for retrieving the features to be displayed.

The Web Feature Service Error! Reference source not found. allows a client to query and update
geospatial data encoded in the Geography Markup Language (GML)Error! Reference source not found..
The WFS service supports the GetCapabilities operation for retrieving information about the service offered
(feature type and operation available on the features), the DescribeFeatureType operation to retrieve the
structure of a feature, the GetFeature operation to retrieve instances of a feature (filters Error! Reference
source not found. can be used to specify the properties to be sent in the result and the spatial location).
Optionally, the WFS can support the GetGmlObject operation (Xlink WFS) to retrieve instances by
following XLink references, the Transaction operation (Transaction WFS) to modify features (create, update
and delete) and the LockFeature to allow serializable transactions.

The Web Coverage Error! Reference source not found. Service provides access to digital spatial
information representing space-varying and potentially time-varying phenomena. A WCS returns its data in
their original format. The WCS supports the GetCapabilities operation for retrieving information about the
service offered (service and data information), the DescribeCoverage operation for retrieving a complete
description of the data and the GetCoverage operation for retrieving the data themselves in a well-known
coverage format.

The Web Processing Service Error! Reference source not found. offers any sort of GIS functionalities to
clients across a network, including access to pre-programmed calculations and/or computation models that
operate on spatially referenced data. A WPS may offer calculations as simple as subtracting one set of
spatially referenced numbers from another, or as complicated as a global climate change model. The data
required by the WPS can be delivered across a network, or available at the server. The Web Processing
Service is targeted at processing both vector and raster data.

Catalogue service

The OGC has defined a Catalogue Error! Reference source not found. specification that specifies the
interfaces and data required to publish and access digital catalogues of metadata for geospatial data, services
and related resource information. The Catalogue acts as a central repository for publishing resources
described by metadata. The Catalogue can then be used to discover the required resources using an identified
query language Error! Reference source not found..

This section will go into the details of the Catalogue specification, and particularly the difference between
the ISO and ebRIM profiles, because such issues will be at stake when later discussing the means to insert
ontologies into the Catalogue.

The Catalogue specification: a layered definition:


                              CS/W Application Profile                           Defines the
                                                                              metadata model

                           CS/W (Catalogue Service/Web)                        Catalogue /
                                                                             HTTP Protocol

                            OGC Catalogue Services 2.0                     Catalogue Interface
                                                                                 (Abstract)


           Figure 5 : The stack of OGC specifications that constitute a full CS/W profile
           specification.


As can be seen in Figure 5, the catalogue specification is divided into three distinct ‘layers’ defining the
catalogue interfaces and behaviour on three different levels of abstraction. It is useful here to go deeper into
the details of these levels since one of these levels of specification is likely to be the subject of change
recommendations that the SWING project may issue to the OGC.

The figure above represents those three layers, namely:

    •   OGC Catalogue Service: this part of the specification defines functionally the basic operations (such
        as HARVEST, GETRECORD, …), at a very abstract level. It does not define the communication
        protocols, the structure of the messages, or the representation model to use for the metadata.

    •   CS/W (Catalogue Service/Web): this specification, built on top of the previous one, describes how to
        implement the above mentioned operations on a particular communication protocol, namely HTTP.

    •   CS/W Application Profile: a Catalogue Application Profile is an implementation specification; it
        defines how metadata should be modelled and stored within the catalogue. There exist at the moment
        two main application profiles, namely ebRIM and ISO19115.

Catalogue profiles: ebRIM vs ISO19115

The two competing application profiles, ebRIM and ISO19115, offer two very different approaches to the
modelling and storage of catalogue metadata. It is important to detail these differences here, and the choices
they entail, because the requirements of the SWING project will involve enhancing the metadata model to
insert semantic metadata.

On one hand, the ISO19115 profile Error!
Reference source not found. is based on the
ISO19115/19119 specifications to describe
metadata stored in the catalogue. These
specifications define in a very detailed way the
structure and content of metadata related to
geospatial data and services (Figure 6 shows a
very small portion of the ISO19115 model).
Therefore, a catalogue compliant with the ISO
profile will be able to store metadata expressed
in the ISO19115 struture only. No flexibility is             Figure 6 : a small excerpt of the ISO19115 model
provided for other data structures.

On the other hand, the ebRIM profile Error! Reference source not found. recommands the use of the
ebRIM specification defined by the OASIS group. This specification allows the description and storage of
arbitrary data structures. The ebRIM profile is thus rather a meta-model: it allows the definition of arbitrary
objects and arbitrary associations between them.

In a few words, chosing between the ISO or the ebRIM profile can be summarized in a choice between
genericity and simplicity. The ISO19115 profile allows for better interoperability (since everybody knows
the exact content of any instance of metadata), but restricts the metadata to only the ones defined in the ISO
models. On the other hand, the ebRIM profile is more powerful in terms of flexibility, but less strong as a
data model specification, since using the ebRIM model is not enough to agree on the representation of
metadata and achieve interoperability. And because it is a meta-model, ebRIM can be seen as a superset of a
profile such as the ISO profile: ISO data types can be described and stored in an ebRIM structure.

Ionic has chosen the ebRIM profile (the Ionic RedSpider Catalogue product implements only this profile). It
must be noted that recommending the Catalogue ebRIM profile does not mean rejecting the ISO
specifications for the representation of metadata but provides another way to store the ISO data structure or
any other data structure using the ebRIM meta-model. This choice should be seen as being in favour of more
genericity, allowing the Catalogue to store various kinds of data.

Furthermore, in the scope of the SWING project, we will want to store semantic annotations in the the
catalogue. Without going into the details of what wil be the subject of deliverable D5.2, it is likely that we
will need the flexibility of ebRIM to store such metadata.

It must also be noted that the OGC recently voted a motion recommending ebRIM as the preferred profile for
the Catalogue specification implementation. Extension packages have been defined by the OGC for
specifying how each data model, like ISO 19115/19119, can be stored in a standardized and interoperable
way within the ebRIM model.
Catalogue query language: the OGC Filter specification

Discovery of services in the Catalogue is done using the GetRecords operation. Queries can be expressed
using filters as defined in the OGC Filter Specification Error! Reference source not found.. Below is an
example of such a request, where the filter part is in black, and the filter arguments are in blue.

Basically, this request says “give me all layers named ‘quarries’ that are WMS layers, and that belong to this
bounding box ”. As one can see in this request, filters mainly allow for string matching, and spatial querying.

The result of such a query would be as below (the result shown here has been edited to show only relevant
data; actual result is much more verbose):


      <?xml version="1.0" encoding="UTF-8"?>
      <csw:GetRecords xmlns:csw="http://www.opengis.net/cat/csw"
      xmlns:ogc="http://www.opengis.net/ogc"
      xmlns:gml="http://www.opengis.net/gml"
      version="2.0.0"
      outputSchema="EBRIM">
      <csw:Query typeNames="ExtrinsicObject">
      <csw:ElementName>/ExtrinsicObject</csw:ElementName>
      <csw:Constraint version="1.0.0">
      <ogc:Filter>
      <ogc:And>
      <ogc:PropertyIsLike>
      <ogc:PropertyName>
      /ExtrinsicObject/Name/LocalizedString/@value
      </ogc:PropertyName>
      <ogc:Literal>Quarries</ogc:Literal>
      </ogc:PropertyIsLike>
      <ogc:PropertyIsEqualTo>
      <ogc:PropertyName>/ExtrinsicObject/@objectType</ogc:PropertyName>
      <ogc:Literal>WMS_Layer</ogc:Literal>
      </ogc:PropertyIsEqualTo>
      <ogc:Intersects>
      <ogc:PropertyName>
      /ExtrinsicObject/Slot[@name="FootPrint"]/ValueList/Value[1]
      </ogc:PropertyName>
      <gml:Box srsName="EPSG:4326">
      <gml:coordinates>-180.,-90. 180.,90.</gml:coordinates>
      </gml:Box>
      </ogc:Intersects>
      </ogc:And>
      </ogc:Filter>
      </csw:Constraint>
      </csw:Query>
      </csw:GetRecords>
    <?xml version='1.0' encoding='utf-8' ?>
      <csw:GetRecordsResponse xmlns:csw="http://www.opengis.net/cat/csw"
                              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        <csw:SearchStatus status="complete"/>
        <csw:SearchResults numberOfRecordsMatched="1"
                           numberOfRecordsReturned="1">
        <ebxml:ExtrinsicObject id="urn:uuid:1f5d09e6-ba7a-4540-8dce-acb7461fe81b"
                               objectType="WMS_Layer">
          <ebxml:Name>
            <ebxml:LocalizedString xml:lang="en-US" charset="UTF-8" value="Quarries"/>
          </ebxml:Name>

           <ebxml:Slot name="SRS" slotType="String">
             <ebxml:ValueList>
               <ebxml:Value>EPSG:4326</ebxml:Value>
               <ebxml:Value>EPSG:27582</ebxml:Value>
             </ebxml:ValueList>
           </ebxml:Slot>

           <ebxml:Slot name="Service_Layer_Name" slotType="String">
             <ebxml:ValueList>
               <ebxml:Value>quarries</ebxml:Value>
             </ebxml:ValueList>
           </ebxml:Slot>

          <ebxml:Slot name="FootPrint" slotType="Geometry">
            <ebxml:ValueList>
              <ebxml:Value xmlns:gml="http://www.opengis.net/gml">
                <gml:Polygon srsName="EPSG:4326">
                  <gml:outerBoundaryIs>
                    <gml:LinearRing>
                      <gml:coordinates>-6.06258,41.1632 10.8783,41.1632          10.8783,51.2918    -
    6.06258,51.2918 -6.06258,41.1632</gml:coordinates>
                    </gml:LinearRing>
                  </gml:outerBoundaryIs>
                </gml:Polygon>
              </ebxml:Value>
            </ebxml:ValueList>
          </ebxml:Slot>

    <!--<Association                       id='urn:uuid:120bdaae-e903-48b4-8e11-c59fff185c57'
    associationType='Contains'   sourceObject='urn:uuid:1f5d09e6-ba7a-4540-8dce-acb7461fe81b'
    targetObject='urn:uuid:0a2055c2-3256-4cf7-88eb-f6463e6d3bda'/>-->
    <!--<Association                       id='urn:uuid:ba2bd568-9d4d-46ac-a7eb-ed70c78e7e4e'
    associationType='Contains'   sourceObject='urn:uuid:1f5d09e6-ba7a-4540-8dce-acb7461fe81b'
    targetObject='urn:uuid:3d33cfe6-abad-4d25-969c-8d57636b14bc'/>-->

        </ebxml:ExtrinsicObject>
      </csw:SearchResults>
    </csw:GetRecordsResponse>


In this sample result, significant fields have been highlighted in black, with their corresponding values in
blue. The associations declarations are highlighted in green. They define how this object is linked to other
objects in the database.

Looking for metadata

One of the main issues in today’s use of cataloguing solutions is the lack of proper metadata. Service
metadata are at best incomplete, at worst inexistent, and most of the time expressed in so many different
languages or conventions that it prevents any actual interoperability.

Results from the SWING project may help tackling the two main aspects of this problem: obtain metadata by
helping people create metadata through semi-automatic annotation of services (WP4), and achieve better
interoperability between the content of various metadata, through annotations using semantic concepts
(defined by WP3).

2.3.4   Open Issues

As previously discussed, Geospatial services as stand alone units of functionality within SDIs need to be
discovered, executed and sometimes chained to perform complex scenarios. OGC Web Services (OWS)
were developed with the intention of offering a high degree of interoperability. This is achieved on the
interface level with standardized interfaces (syntactic interoperability). However, interoperability includes
more aspects than the syntactical one. Semantic heterogeneities are inevitable in a real world SDI setting,
and make the composition of service chains a cumbersome process. Specifically, semantic heterogeneities
cause problems in Geospatial service discovery, data retrieval, and interoperation. In what follows, we
consider these issues in more detail. We then discuss Web Processing Services, which are particularly
difficult to handle due to the wide range of operations they may carry out.

Semantic heterogeneities in Discovery

Currently, Geospatial service discovery is based on keywords. This suffers from the usual weaknesses of
keyword-based discovery:

    •   Low recall, i.e., not all suitable services are discovered. Assuming the user enters the keyword
        “quarry” when searching the catalogue for data about quarries, she will fail to discover data
        providing services that are described by keywords like “pit” or “mine”. More generally, the human
        creators of the metadata for the services may use different words to describe the same things.
    •   Low precision, i.e., not all discovered services are suitable. Searching for “quarry” might return
        “aggregate quarries”, which are wanted, but the search might also return “lignite quarries”, which are
        not wanted. More generally, the human creators of the metadata for the services may describe
        different things with the same words.

Semantic heterogeneities in Retrieval

Retrieval problems may occur wherever an arrow connects two services in. Most commonly, the application
schema of feature types is only given in XML schema. Since this information is essential for formulating
filters on Web Feature Services and for indicating the correct calculation parameters for Web Processing
Services, the user runs into problems when property names of features are not intuitively interpretable and no
additional information is available.

Semantic heterogeneities in Interoperation

In addition, translation problems may occur wherever an arrow connects two services in. They concern the
required translation between one given data model in order to fit another and include:

   •    Structural heterogeneity. Relates to differences in schemas/data models. In a first step such
        mismatches need to be identified; e.g. corresponding attributes might occur in different parts of two
        schemas. In a second step the transformation rules to solve the mismatch need to be specified.
   •    Naming heterogeneity. The same keyword-related difficulties that a user experiences during
        discovery and retrieval might also inflict the interoperability between the input and output
        specifications of two service that are to be chained.
   •    Measurement Scale conflicts. Such problems include the use of varying kinds of scales (nominal and
        ratio) between attributes of two schemas, as well as the use of varying units. In our example
        scenario, both the output data model of the Consumption WPS and the input data model of the
        Production-Consumption WPS may have an attribute called “consumption”. In case that both use
        different units (e.g. one in kilograms, the other in tons), it is essential that the user would be able to
        recognize this mismatch.
   •    Spatial reference system conflicts. Different Geospatial services may use different spatial reference
        systems, and mappings between those must be used to make the services communicate successfully.
   •    Precision conflicts. For example, the Consumption WPS might require the population density and
        average consumption in the same accuracy, in order to generate valuable results. In this case a
        mechanism is required that recognizes whether the Admin WFS and the Aggregate WFS comply with
        this restriction.
   •    Resolution conflicts. These may occur if data from the Admin WFS and the Aggregate WFS are used
        together in the Consumption WPS. For example, the “communities” within the Admin WFS and the
        “administrative entities” used within the Aggregate WFS might not correspond to each other. If
        administrative entities correspond to sub-parts of the communities, then they need to be aggregated
        before the data can be used for the consumption calculation.

Geoprocessing – How to capture the semantics of operations?
A major problem for service composition as depicted in Figure 3 is to ensure that services that match certain
user goals are properly discovered; such matching requires clear and unambiguous ways of describing the
operations involved. Such descriptions have to match in two aspects:

    1. The semantics of the operations (or functionality). This ensures that the service actually does what
       the user expects it to do.

    2. The semantics of the interfaces of adjacent services. This ensures that the service correctly interprets
       the data it receives as input from the preceding service.

For data services such as WFS, and portrayal services such as WMS, the semantics of operations is not a big
problem. The operations are well defined and have agreed upon un-ambiguous semantics. Hence the
semantic description of these operations is likely to be similar in all the services. WPSs, on the other hand,
also offer a standardized interface; but the processing functionality underlying this interface may differ a lot
between different WPSs. This calls for explicit semantic descriptions of the processing operations offered.

We cite the example introduced in to illustrate the need for explicit semantic descriptions of geoprocessing
operations. Calculating the distance between two points on the earth’s surface is not a trivial business.
Different kinds of distance measures are applied in various kinds of spaces. For example, the Euclidian
distance is measured in the plane, a shortest path in a graph, geodesic distance and Euclidian distance in 3-
dimensional space (R³). Although all these operations significantly differ in their functionality and their input
requirements, they could all be described by the same (syntactic) signature, where x1/x2 and y1/y2 each
represent a pair of coordinates:

      distance(x1: double, y1: double, x2: double, y2: double): double

Distance services described with this signature can differ widely in the semantics of the input required, the
output produced and the functionality provided. For example, the input parameters could represent
geographic or projected coordinates (in a specific coordinate reference system) and the output could
represent the distance in kilometers, miles or degrees. It is not possible to judge only from the signature what
kind of distance operation the service provides and how its result is to be interpreted.
3       Combining SWS and Geospatial Services: Potential Benefits, and
        Technical Challenges -Joerg Hoffmann, Dumitru Roman, LFUI
In this chapter, we provide an overview of how the domain of Geospatial services can benefit from SWS
technologies, and what the main challenges and technical issues are. In particular, we address the open issues
outlined in the previous chapter. We start with a brief summary of how those issues can be solved with SWS
technology. A series of sub-sections then discusses the main technical aspects in more detail.

To a large extent, the domain of Geospatial Web services experiences the same open issues as the broader
Web service context, for which SWS techniques were developed. Hence, in principle the SWS approach is
well-suited to help address those issues. Still, significant challenges arise from the particularities of the
Geospatial domain.

The open issues, as outlined above, come from the areas of (a) discovery, (b) retrieval, and (c) interoperation;
these issues arise for all kinds of Geospatial Web services, and are particularly challenging for WPS services.
From a SWS perspective, (a) is a perfect match with the issues usually addressed around SWS discovery.
What must be decided is how exactly Web services should be annotated, how exactly discovery queries
should be formulated, and how exactly the annotations and queries should be compared – matched against
each other – in order to implement the semantic discovery process. This kind of issue has been addressed in
the SWS area for a long time, and one can draw from a significant pool of well-researched approaches . An
approaches must be selected, and the necessary adaptations for the Geospatial domain and its particular
forms of services (in particular WFS and WPS) must be made.

Regarding (b) and (c), those issues occur in what the SWS community knows as composition and execution.
The former addresses techniques for supporting, or even fully automating, the construction of new Web
services by combining existing ones in a suitable way; the latter addresses techniques for supporting the
flexible execution of composed Web services, in particular taking care of their integration. Most of the open
issues arising in the Geospatial domain correspond to a need for (what the SWS community calls) mediation
during either composition or execution. The mediation problems are highly non-trivial because mediation for
SWS is a complex and yet unsolved problem; in particular, the Geospatial domain requires some fairly
specific mediation techniques that have not yet been researched in the SWS community.

The basis of both, semantic discovery and semantic composition/execution, is to decide on the logical
language used to formalize the semantics. In the remainder of this chapter, we first discuss that decision, in
the context of the Geospatial domain. We then discuss semantic discovery and semantic
composition/execution, in that order.

3.1       Which Logics?

We go through the most relevant particularities of the domain of Geospatial services, and examine how they
affect the choice of the logical language to use. We consider, in this order: Spatial Constraints and
Operations; Web Feature Services; and Web Processing Services.

###

Since geospatial data models (also called application schemas) serve to represent entities that are present in
the real world, we have argued for the need of semantic annotation (Figure 2 of SWING D3.1). This section
explains the conceptual and technical realisation of such annotations using WSML-Flight (Roman, Keller et
al. 2005). The annotation of data providing services is documented in the first part. Here, Web Feature
Services (WFSs) serve as examples. WFS is one of the two common data provision services specified by the
Open Geospatial Consortium (OGC)11 (OGC 2005). The second part of this section focuses on the
description of processing services. The descriptions are motivated by the need of discovering Web Services
which provide processing for a desired analysis. The Web Processing Service (WPS) is used as example.
Recently, the first version of the WPS interface specification has been published by OGC (OGC 2007).


11
     The official Web page is available from http://opengeospatial .org.
Annotation of Geospatial Web Service

The baseline for semantic annotation has been specified (SWING D4.1), the WSML variant to formalise the
statements has been chosen (SWING D3.1), and we have started to examine methods for automation, which
will help the user in specifying the semantic annotations of an information source by elements from the
domain ontology (Grcar and Klien 2007; Klien 2007). The question, which has not yet been satisfactorily
answered, is how exactly the annotation has to be formalised in the SWING project ontologies.

###
Existing Approaches

Several approach exist that use logic and terminological reasoning for describing and reasoning on
geographic information (Bowers, Lin et al. 2004; Frank 2003; Janowicz 2006; Lemmens and Vries 2004;
Lutz 2006; Lutz and Klien 2006). Most of the recent work has been focusing on Description Logic (DL) in
combination with subsumption reasoning. DL is a family of knowledge representation languages that are
subsets of first-order logic (for a mapping from DL to FOL, see e.g. (Sattler, Calvanese et al. 2003)). They
provide the basis for the Web Ontology Language (OWL-DL), the proposed standard language for the
Semantic Web (W3C 2004). Subsumption corresponds with checking whether a class description subsumes
(is more general than) another class description. By doing this for all classes in the knowledge base, one can
compute the subsumption hierarchy. By checking the place in the subsumption hierarchy of a given class
description, this class description is classified with respect to the knowledge base and hidden relationships
with other classes in the knowledge base become visible (De Bruijn, Lara et al. 2004).

In the SWING application, we need a strategy that allows for an ontology-based discovery and retrieval
(respectively execution) of geographic feature types and geoprocessing functionalities. Lutz and Klien
(2006) present an approach for ontology-based discovery and retrieval of geographic feature types using DL
and subsumption. Lutz (2006) presents a survey on the description of and the reasoning on geoprocessing
functionalities based on DL and FOL formalisms. In the following, we have a closer look at both approaches
and examine, to what extent they are reusable in the SWING context.

3.1.1   Spatial Constraints and Operations

The Geospatial services domain frequently involves particular arithmetic; namely, in spatial constraints that
specify the covered/required Geographical areas, and in spatial operations required to resolve measurement
scale conflicts, spatial reference system conflicts, and precision conflicts. In both cases, one may choose to
encode the required arithmetic into logics; in both cases, we come to the conclusion that it is more
appropriate to leave the responsibility with specialised (Geospatial) techniques instead. An encoding in
logics is possible, but not natural, has no real benefits, and likely comes at a high computational cost.

Consider first spatial constraints. Any time a Geospatial service offers some functionality, and anytime a
functionality is requested, if desired a relevant Geographic area can be specified in terms of a bounding box
(in the case of WFS retrieval filters, in the form of a polygon). In particular, during discovery, for every
available service it must be tested whether the area served by the service intersects/subsumes the area
requested by the query. This test is a simple arithmetic operation that can be implemented very easily and
efficiently. A more “generic” option, in the Semantic Web context, would be to encode the arithmetic into
logics instead. For example, say a discovery query comes with the bounding box (17,5; 37,25) (left top
corner; right bottom corner) and a service is annotated with the bounding box (35,20; 40,40). Then,
formulated in logics, the query would contain the formula:

                             X ≥ 17 AND Y ≥ 5 AND X ≤ 37 AND Y ≤ 25

The service’s precondition, on the other hand, would contain the formula:

                             X ≥ 35 AND Y ≥ 20 AND X ≤ 40 AND Y ≤ 40

To check whether the query matches the service, one would create the combined formula:

                          X ≥ 17 AND Y ≥ 5 AND X ≤ 37 AND Y ≤ 25 AND
                              X ≥ 35 AND Y ≥ 20 AND X ≤ 40 AND Y ≤ 40

This formula would then be passed to a generic reasoning tool able to handle arithmetic expressions of this
kind. That tool would then answer the question of whether the (combined) formula is satisfiable, which is the
same as the question whether the two bounding boxes intersect. The problem with this approach is that the
generic reasoning tool is designed to be able to deal with all formulas built out of the involved arithmetic
operations. Such generality does not come without a price. Indeed the price of arithmetic theorem proving is
quite high. Even very restricted formalisms are intractable or undecidable, and the existing tools usually
scale very badly. More pragmatically, as the above example should have illustrated, such an approach really
“shoots cannons at birds”, i.e., addresses the problem with unnecessarily general techniques. One does not
need arithmetic theorem proving to check intersection of bounding boxes.

The reader may argue that the above example uses only very simple arithmetic operators (indeed, only the
“<” and “>” comparisons) and that reasoning tools may exist that are efficient for such a restricted language.
While this is true, even such a tool is likely to involve a lot of overhead when it sets up its data structures for
every matching check.

For the spatial operations underlying unit, spatial reference system, and precision transformations, matters
are very similar. Formulation as arithmetic logic is possible, but not needed and hence an unnecessary
overhead. For example, say we want to resolve a measurement scale conflict between m and cm using
arithmetic logics. We have a web service with output field X, measured in m, which takes the value 5 in our
ongoing execution. We want to connect X to input field Y, measured in cm, of another web service. In our
knowledge base, we have an axiom 100*cm=m; this perceives m and cm as constants and states how those
constants behave in equations. In our execution, we hence have the formulas X=5m,100*cm=m, Y=X, and
Y=Z cm. We want to know the value of Z. In terms of logical reasoning, this means that we have to find a
value for Z so that, inserting that value, the negation of the conjunction of these formulas is unsatisfiable.
This is the case only for Z=500.

The example should illustrate nicely that it is not natural at all to resolve measurement scale conflicts by
arithmetic logical reasoning, i.e., as part of the reasoning that underlies the rest of the semantic framework.
What’s more, the generic reasoning does not bring an advantage. There are a fixed number of necessary unit
transformations, and those can simply be hard-coded. Similar arguments apply to spatial reference system
transformations, and to precision transformations.

All in all, it can be concluded that spatial constraints and operations should be dealt with by dedicated
Geospatial techniques, rather than generic techniques based on logics. The challenge then is to efficiently
combine the latter with the former. We will see further below that, in many cases, such a combination is
quite natural.

3.1.2   Web Feature Services

WFS and WPS are the key components from which Geospatial web service compositions – implementing
information gathering and aggregation – are built. Hence these two kinds of Geospatial services are the most
important ones to describe in logics, for discovery and execution. We will focus on WPS in the next
subsection.

###

In standard OGC-compliant Catalogues, users register geospatial web services by providing metadata (e.g.
ISO 19115, ISO 19119) on the data and/or functionality it serves. In the following we analyze what is needed
to register and annotate geospatial web services in WSMX. We distinguish between the process of
registration and semantic annotation. Registration refers to transforming the available service descriptions
into the formalism (i.e. WSML) needed for further information processing in WSMX. Semantic Annotation
refers to making the semantics of the service’s underlying functionality or data explicit by establishing a link
to domain ontologies. Thus, the goal of the annotation process is to generate a WSMO WEBSERVICE (written
in WSML) for a specific OGC service that integrates explicit semantic descriptions of the functionality or
data that is served. We will analyze the annotation task separately for geographic information services (like
WMS, WFS, and WCS) and processing services (like WPS).

###
For WFS, what we need to describe in logics is their feature type, i.e., we need to state what kind of data they
provide. In recent years, some works in this area have appeared. These works are all based on variants of
description logics. The idea is to describe the data output of a WFS as a DL concept A (“I output these
data”), and to describe the discovery query as a DL concept B (“I want these data”). The matching is then
formulated as standard DL subsumption reasoning: the WFS matches the query iff B is a sub-concept of A
(“The desired data is provided”). The latter is checked using generic DL reasoning techniques. It is important
to note here that the “concepts” A and B are DL concepts. That is, they are not atomic but can be built from
complex constructors such as existential and universal DL quantification.

A disadvantage of the above method is that generic reasoning techniques checking subsumption for DL
concepts are fairly complex. Their worst-case complexity is exponential, and every single call to them
involves considerable overhead for setting up the data structures (which are tailored to solving reasoning
tasks with huge search spaces). If there are tight time constraints, even just the latter overhead can be
prohibitive in discovery, where many reasoning tasks – one for every available service – must be solved.
Another, perhaps more important, disadvantage of using DL for WFS is that it seems to be not user-friendly
enough, in the following sense. A general conclusion from is that the involved DL concepts become
complicated even for rather simple WFS and discovery queries. This is a serious disadvantage because,
ultimately, those concepts need to be created by the service provider and the end user, respectively. Neither
of those will be a DL expert. Some attempts have been made towards user-friendly formulation of DL
discovery queries , but that direction does not appear very promising.

A promising alternative to DL is logic programming, specifically Datalog and its extensions. Datalog is a
very simple logical formalism consisting of rules. Each rule states that a certain logical predicate – the rule
head – can be derived if a certain logical condition – the rule body – is known to hold true. Datalog restricts
the rule body to be a conjunction (a set) of logical atoms. The advantage of Datalog lies, obviously, in its
computational aspects; evaluating a set of Datalog rules against a database of facts is exponential only in the
number of variables involved in each rule – not in the number of rules, and neither in the size of the database.

How does the notion of Datalog rules lend itself to formulate WFS feature types? The idea is very simple.
Each WFS service w is formulated using one rule. The rule head is a new predicate indicated as the output of
w. The rule body is a set of atoms from the underlying domain ontology, constraining the data output of w. In
other words, this approach restricts WFS output descriptions to conjunctions of atoms from the domain
ontology. If the WFS – which is typically the case – provides several alternative kinds of data, then one
Datalog rule is used for each kind. Note that this is really more appropriate than allowing disjunction in the
logics itself: the number of distinct kinds of data is typically small, and no other disjunctive element is
present. Similarly, there is no real use for negation in the description of WFS services. Hence, allowing a
more complex logic is, once again, like shooting cannons at birds.

In our work, we have found the above modelling approach to be entirely adequate to formulate the relevant
distinctions between different WFS services. Aside from the computational advantages of Datalog, the
semantic descriptions are also much easier to generate and much more human-readable than DL descriptions.
For one thing, Datalog rules are very similar to Database queries, which many potential users are already
familiar with. For another thing, essentially all that is needed, for annotating a WFS service, is the selection
of a suitable set of concepts and relations from the domain ontology. We will see below, in Chapter 6.3, that
this can be supported quite well.

3.1.3   Web Processing Services

WPS are significantly more complex than WFS, in that their functionality is much less uniform. A WPS can
implement an arbitrary function of its input parameters. This makes WPS difficult to describe semantically.

One option is to simply “describe” the WPS in terms of the arithmetic equation it computes. An obvious
drawback of this option is that, for many WPS, these equations might be fairly complex, and difficult to
write up succinctly. And even if that is not the case, the approach is certainly not user friendly (quite the
opposite, in fact). Only people very familiar with (Geospatial) mathematics can be expected to work with
arithmetic equations on an everyday basis. A further point is that the arithmetic approach is flawed
conceptually. It does not “semantically annotate” the WPS, it just states their entire functionality in detail.
This is more like publishing the WPS source code than like annotating it with meta-information. Semantic
annotation should not expose all details about the advertised web service, but expose just enough detail so
that the web service can be discovered by the right people, and sufficiently efficiently.
Following the policy of publishing “just enough detail”, the question is what “enough detail” is, and how to
formulate it in logics. We need to express dependencies between the inputs and the outputs of the WPS. This
goes beyond the capabilities of description logics, which can only define the types of the inputs and outputs.
Since DLs are a subset of first-order logics, an idea that springs to mind is to extend the input/output type
descriptions with first-order logics to model the input/output dependencies. In this context, the dependencies
can be formalised relative to a first-order axiomatisation of the relevant domain. This, in turn, offers the
possibility to axiomatise the relevant domain at a chosen level of detail, hence providing us with a formal
instrument for publishing “just enough detail”. In , this approach is applied to distance measures. A first-
order axiomatisation is constructed that is not physically exact, but that contains enough details to distinguish
between different distance measures; in a nutshell, the measures are distinguished in terms of the type of
hyper-space within which the two points are to be connected.

While the first-order approach seems promising, it also has some important drawbacks. First of all, its user-
friendliness is questionable. The first-order axiomatisation of distance measures, as developed in , is not any
more understandable than the arithmetic equations defining these measures. If anything, the notion of
“hyper-space”, and its role inside the set of first-order formulas defining the axiomatisation, is even more
confusing than the plain arithmetic notations. Another major drawback of the first-order approach lies in its
computational demands. In contrast to DL, first-order logic is not even decidable. Accordingly, generic
reasoners for first-order logics generally scale even worse than those for DL. A final observation regarding
is that, at bottom, the distinctions between distance measures are actually made by keywords rather than
logics. Namely, the description of all the distance measures is the same except for the type of hyper-space
that they use. In other words, there is a fixed set of different kinds of hyper-spaces (e.g. plains, 3D space,
directed graphs), and each characterises a distance measure. Hence the classification scheme comes down to
the names of the different types of hyper-spaces. This is not much different from classifying the measures by
the usual keywords such as “Euclidian distance” and so forth. In fact, the latter is probably much easier to
understand for an end-user; we get back to this below.

In our work, we found that Datalog, as also used for WFS, is a promising alternative. The key observation is
that, in difference to DL, Datalog allows to “share variables” between the formula (the precondition)
characterising the inputs of a WPS, and the formula (the postcondition) characterising the outputs of a WPS.
That is, we can identify entities that will be the same at both ends of the WPS. This enables us to express,
e.g., that a distance measure outputs the distance between the two points given as the input – which is
something we cannot express in DL. While we can express this also in first-order logics, Datalog has a huge
advantage on the computational side, as already stated. In that sense, Datalog is a way of trading the
expressivitiy of the first-order approach for computational efficiency. In our work, we found that the loss in
expressivity is not a major problem; as pointed out above, complex axiomatisations are not user-friendly
enough anyway.

For illustration, consider the following severely simplified example. A WPS distance measure could be
specified by the precondition point(p1), point(p2) and the postcondition distance(d,p1,p1,Euclidian). The
intuitive meaning is obvious, which is good since a typical user will probably understand it easily. Note that
the actual distance functionality is characterised by the keyword “Euclidian”. This is not a coincidence of the
simplified example, but, as we found, is actually quite a good idea provided the “keyword” is taken from an
ontology of wide-spread WPS functionalities. Such an approach (previously pursued in ) clearly goes
beyond pure keywords, in that it incorporates the relations between the different functionalities; e.g. all
distance measures are sub-concepts of an abstract “distance measures” functionality. Such an ontology is
much more user-understandable than arithmetic equations or first-order logic; at the same time, it provides us
with some flexibility in combining and comparing different functionalities.

To sum the discussion up, arithmetic and first-order logic are too cryptic to be dealt with by typical users,
and have undesirable computational properties. In our work, we follow a Datalog-based approach, where
annotations are simple and intuitive, reasoning is efficient, and we still have the ability to express
dependencies between WPS inputs and outputs. The prize to pay is that functionalities must be distinguished
in terms of concepts from an ontology of WPS functionalities. On the other hand (see also the above
comments regarding ), such an ontology may be the only truly viable way of communicating with end users.

###
3.1.4   Decisions for ontology formalisation

###

A baseline for the registration and annotation of geodata along these lines, as well as strategies for
automating this process have been documented in Deliverable D 4.1. Section Error! Reference source not
found. discusses approaches for formalization and matchmaking of semantic annotations. It also explains in
detail the design decisions that we have taken on this in SWING.

Requirements for WEBSERVICE specification for Geoprocessing Services

Web Processing Services (WPS) also offer a standardized interface (note, the OGC Specification on WPS is
still in discussion state though); in contrast to the WFS, the processing functionality underlying this interface
is different between WPSs, which requires explicit semantic descriptions of each processing operation
offered.

We cite the example introduced in Lutz (2006) to illustrate the need for explicit semantic descriptions of
geoprocessing operations. Calculating the distance between two points on the earth’s surface is not a trivial
business. Different kinds of distance measures are applied in various kinds of spaces. For example, the
Euclidian distance is measured in the plane, a shortest path in a graph, geodesic distance and Euclidian
distance in 3-dimensional space (R³).

Although all these operations significantly differ in their functionality and their input requirements, they
could all be described by the same (syntactic) signature, where x1/x2 and y1/y2 each represent a pair of
coordinates:

                       distance(x1: double, y1: double, x2: double, y2: double): double

Distance services described with this signature can differ widely in the semantics of the input required, the
output produced and the functionality provided. For example, the input parameters could represent
geographic or projected coordinates (in a specific coordinate reference system) and the output could
represent the distance in kilometres, miles or degrees. Furthermore, it is not possible to judge only from the
signature what kind of distance operation the service provides and how its result is to be interpreted.

###

Methodology for Semantic Discovery of Processing Services

This section discusses possibilities and decisions regarding the discovery of processing services. It
summarises the outcomes of (Fitzner 2007). Web Processing Services (WPSs) are used as example. The
interface specification of the WPS was released in version 1.0 by OGC (OGC 2007). We provide a general
discussion of possible levels of detail, which the functional descriptions used for WPS discovery can have.
Based on this discussion we derive the requirements for the functional descriptions of WPSs. In the
following, we present our logic programming-based approach to WPS discovery (see Figure 7 for an
overview).
                                    Figure 7: Overview of the discovery approach.

One crucial aspect is the computational cost of the discovery process. Current approaches either use a
formalism that cannot capture the necessary level of detail (Lemmens, Granell et al. 2006), or they are highly
expressive at the expense of costly, even un-decidable, reasoning tasks during discovery (Lutz 2006). An
alternate expressive discovery technique is needed.

In order to do so, two domain ontologies are introduced that provide the formal background for the
functional descriptions of requests for services and of WPS advertisements. The knowledge used to generate
these ontologies is acquired from the ISO Spatial Schema (ISO/TC211 2003), a standard for modelling the
spatial characteristics of geographic entities. Details on the implementation of this approach within WSMX
are documented in SWING D2.4. In principle, the developed approach can be applied to describe other kinds
of Web Services, which provide processing functionality. WPS is used for illustrations, because of the recent
interest in the standard.

As a running example, we use overlay operations, but other geospatial operations could be used. We favour
the overlay operations because all kinds of matches can be illustrated on this single kind of operation.
Overlay operations receive multiple (at least two) layers of spatially referenced data as input and yield one
layer as output that is the result of spatially overlaying the input. As pointed out in (Chrisman 1997), overlay
operations have a certain similarity to joins between tables in a relational data model in the sense that they
combine different datasets based on a common key. In the case of overlay operations, this key is the
geometry instead of some non-spatial attribute inside the attribute tables. For this reason, a useful output
layer can only be computed, if all of the input layers adhere to a common spatial reference system (unless the
WPS also performs coordinate transformation). Different ways for computing the output geometries from the
input exist, namely Intersection, Union, Difference and SymmetricDifference (ISO/TC211 2003). For
example, intersecting two polygons A and B delivers a polygon that covers the geometric area where A and
B overlap. Figure 8 shows the different overlay operations on polygons.
                                          Figure 8: The difference operations.

Discussing Possibilities for WPS Discovery

WPSs are different from data providing Web Services since they offer functionality instead of data.
Discovery, the process of locating Web Services that potentially fulfil some user’s needs, has been identified
as one of the most challenging problems in the area of (geospatial) Web Services (Klien, Lutz et al. 2004). In
general, the possibility to describe this functionality for the purpose of discovery ranges from its
classification by keywords to a detailed description of the implemented operation or algorithm (Figure 9).




                                    Figure 9: Discovery – Range of Possibilities.

A WPS discovery that is based on a detailed description of the applied algorithm does not provide a good
solution, because a detailed algorithm description duplicates the WPS rather than advertising its
functionality. It would make the WPS itself superfluous since all information is already given in its
functional description used for discovery. As described in SWING D2.1, "this is more like publishing the
WPS source code than like annotating it with meta-information". Another argument against a detailed
algorithm description as a WPS advertisement is that in order to enable automated discovery that makes use
of all this detailed information, a Web Service requester must also describe the required algorithm in detail.
This is neither a realistic nor a desirable assumption since geospatial algorithms (even those of well known
GIS operations) can be arbitrarily complex and unknown to users. Furthermore, the reasoning mechanisms
required for comparing algorithm descriptions (if ever possible) are quite expensive, which heavily
"threatens the efficiency of service discovery and evaluation" (Kuhn 2005). Moreover, if requesters are able
to provide such detailed description of the requested algorithm, they can implement it in an executable
programming language and no longer need a WPS.

Anyway, once a WPS is discovered, it needs to be clear that it can really be executed by the requester. The
service may, for example, require a specific type of input (e.g. polygons, possibly adhering to some specific
feature type schema). Furthermore, WPS often have conditions that have to be fulfilled by the input in order
to be successfully executed. For example, a WPS offering overlay calculations on polygons requires all input
polygons to adhere to a common spatial reference system.

Requirements for WPS Discovery

From our perspective, any strategy for WPS discovery needs to base on functional descriptions that, on the
one hand do not duplicate the advertised functionality, but, on the other hand enable discovery with high
precision and recall. In the following, the term functional description refers to a WPS advertisement as well
as a request. Hence, we adopt the usual way of treating requests as "desired" Web Services.

In our opinion, functional descriptions used for WPS discovery must at least contain descriptions of four
elements, namely:

      1. Type signatures (the input/output types). The description of the signature is a crucial part of
         functional descriptions. It ensures syntactic interoperability between requester and Web Service and
         therefore guarantees that the Web Service can execute on the provided input type and that the
         requester can accept the delivered output type.

      2. Constraints on in- and output. To ensure that the Web Service really executes on the provided input
         in the way expected, constraints need to be formulated that further narrow the possible input and
         output values. For example, a WPS offering overlay calculations on polygons requires all input
         values (besides being of type polygon) to adhere to a common coordinate reference system in order
         to be able to calculate a useful output.

      3. The operation that is performed, respectively requested. Consider the following example. All of the
         different overlay operations on polygons have equal type signatures and equal constraints. They
         input and output polygons adhering to a common spatial reference system. It is impossible to
         distinguish them just by signature and constraint descriptions, assuming that the constraints do not
         contain any operation description. Therefore, the functional descriptions have to additionally include
         some description of the operation or algorithm that is performed.

      4. The dependencies between input and output. Consider a WPS offering the difference operation on
         polygons as example. When discovery is only performed based on the three descriptions pointed out
         above, it is impossible to differ between the calculation of Difference(A,B) and Difference(B,A),
         although their results are quite different (Figure 4). In order to ensure that the WPS´ input variables
         are instantiated with a valid permutation of requester input, some description of the relation or
         dependency between input and output is needed. This description guarantees that both, requester and
         Web Service provider, agree on the way, the output is computed from the input.

Additionally to the requirements for the functional descriptions, we formulate requirements for the discovery
process itself. Our first requirement is that it considers all the information given in the functional
descriptions. That is, the signature, the pre- and post-conditions, the dependencies between in- and output
and the computation that is performed. Furthermore, we require requesters to provide a functional
description of the same structure as a WPS advertisement. However, we do not assume that both,
advertisements and requests, have the same level of detail. Web Service advertisements might be very
detailed compared to user requests (or vice versa). The discovery process must allow some sort of "relaxed"
matchmaking that also succeeds if the functional descriptions to be compared adhere to different levels of
detail. Those "relaxed" matches therefore allow simple requests such as "give me all services offering
overlay calculations", even if the WPS advertisements are much more detailed.

3.2     Discovery

The prevailing difficulty regarding discovery in the domain of Geospatial services is the usual difficulty with
keyword-based discovery: low recall since different keywords may be used to denote the same thing, and
low precision since the same keyword may be used to denote different things. For precisely this kind of
problem, the idea of semantic discovery was introduced in the first place. The idea is to describe both the
available web services and the discovery query in a logical language, and then use reasoning over that
language to match each available web service against the query; those that match are returned as the
discovery result. It is obvious that, in principle, this approach can also be applied to the Geospatial domain.
The question is, precisely what kinds of “logical languages” should one use? And what is the required
“reasoning”? We have outlined our answer to the first question above; herein we outline our answer to the
second question. Note that another question arises also: How should the semantic discovery interact with the
existing OGC discovery services, Catalogues? We dedicate one sub-section to Geospatial services that
should be discovered, and one sub-section to Geospatial services that are themselves part of the discovery
process.

###
3.2.1       Discovery based on Semantic Annotation

In current standards-based geospatial catalogues users can formulate queries using keywords and/or spatial
filters. The keyword search is only successful if the service provider has used exactly the same symbols in
his feature type and service metadata as the requestor in the keyword-query. For example, a user’s keyword
query for quarry would not result in the discovery of a service offering quarry features if the descriptions use
terms like exploitation or stone pit synonymously to quarry. However, even if the requestor and the service
provider use the same keywords in the query respectively service advertisement, it is not said that their actual
meaning is the same. For example, a service on quarries might also provide information on the substance that
is exploited at the quarry (e.g. limestone), whereas the user, who is looking for substance has the quarry’s
product (e.g. a specific kind of aggregate) in mind. Both use the same term, but the actual meaning of
substance is different. Besides these problems of semantic heterogeneity, the service’s metadata descriptions
are often inadequate (mostly, only a few keywords are used) and, moreover, the catalogues’ search interfaces
often lack a detailed and distinct search on e.g. feature-names,-attributes or -titles12.

The matchmaking underlying ontology-based discovery is a reasoning process with the goal of deciding,
which of the available information sources match the request. Reasoning is the fundamental procedure
enabling matchmaking (Sycara, Klusch et al. 1999). The main task of the matchmaking process is to resolve
semantic heterogeneities between the request and the offer. This reasoning perspective emphasizes the need
for approaches that go beyond the mere construction of ontologies and involve their use in discovery,
evaluating, and combining geospatial information (Kuhn 2005). Semantic matchmaking mechanisms will (a)
lead to enhanced usability of heterogeneous and distributed GI sources and (b) facilitate the task of automatic
service composition.

WSMO discovery provides a conceptual model for service discovery that exploits WSMO formal
descriptions of GOALS and WEBSERVICES. The semantic capability descriptions serve as input for the
discovery process, which compares the requestor’s description with those of providers to figure out which
service offer is relevant for the request. This means that discovery operates on the ontological descriptions of
the capabilities of a service. In Section 0, we have shown how geographic feature types and geoprocessing
functionality can be semantically annotated by reference to domain ontologies. The same strategy is applied
for GOALS, i.e. users formulate their query based on the shared vocabulary that is defined in the domain
ontologies. Thus, both descriptions become machine-comparable.

Figure gives an example of ontological descriptions, as they are used for the discovery process in WSMX.
The user goal on the left side is formulated with vocabulary from GML ONTOLOGY (gml#) and QUARRY
ONTOLOGY (domain#). The service description on the right side references the Feature Type Ontology (fto#).
A match between both is possible, as the Feature Type Ontology in turn is linked to GML ONTOLOGY and
QUARRY ONTOLOGY (see Section 0).


                                        GOAL                            WEBSERVICE
           capability                                                                 capability
           postcondition                                                              postcondition
             definedBy                                                                  definedBy
               ?x memberOf ?C and                                                          ?features memberOf
               ?C subConceptOf gml#Feature and                                             fto#exploitationsponctuals
               annotate(?C,?y) and                                                    .
               ?y subConceptOf domain#Quarry.



        Figure 10: The discovery process operates on the ontological descriptions of the capabilities of GOALS and
       WEBSERVICES. The WEBSERVICES description on the right is the result of the annotation example in Section 0.

Different notions of match and their respective formalizations for WSMO discovery approaches are
discussed in Keller, Lara et al. (2004), providing a theoretical basis for web service discovery. In Section
Error! Reference source not found., we discuss in more detail the language expressivity required for
implementing web service discovery in SWING.


12
     For a more detailed analysis of the types of semantic heterogeneity problems that arise in geospatial web environments please refer
       to Deliverable 2.1.
###

3.2.2   WFS, WPS, WMS

First, WMS services are a special case in our context, for the following reasons. First, the output of such a
service is a map, which is hard to describe in detail in a logical language; at best, one could create a
categorisation (an ontology) of map types and use that for the classification. Second, a map service
constitutes, in all reasonable usage scenarios, the user end point of a Geospatial application – displaying the
results – and is not used in the intermediate stages of the information gathering and aggregation. Hence
WMS need not be discovered and composed for the core part of the information aggregation process, and
semantically describing them semantically is less relevant than semantically describing WFS and WPS.

Regarding WFS and WPS, semantic discovery obviously has the potential to improve both precision and
recall, so the potential benefits are clear. On the other hand, semantic annotations and semantic discovery
queries must be easy to generate; and semantic matching (i.e. the actual discovery process) must be efficient
enough to not result in intolerable waiting times (where “tolerable” means human patience and will often be
limited in the order of seconds). As outlined above in Section 3.1, for these reasons we have decided, in our
work, to describe both WFS and WPS in the Datalog formalism (WFS only by their postcondition, WPS by a
combination of precondition and postcondition). This is a fairly light-weight approach (not using a “heavy”
logical language) that allows easy human-readable annotations and efficient matching.

Naturally, discovery queries are also described as Datalog rules. It remains only to define the matching. The
standard reasoning task in Datalog, for comparing rules, is query containment. In a nutshell, rule r1 is said to
be contained in rule r2 if every fact that can be derived by rule r1 can be derived by rule r2 as well. Focussing
on WFS, r1 would be a discovery query, and r2 would be a service description (its postcondition, describing
what kinds of data the service offers). If r1 is contained in r2, then this means that the data desired by the
user is delivered by the service. If r1 is not contained in r2, then this means that the user desired data which is
not delivered. So the notion of query containment coincides well with the intuitive notion of when a service
is relevant and should be matched with the discovery query. Since query containment in Datalog is
computationally efficient (the complexity is the same as that of evaluating a rule base), this leaves open only
the question how usefully this approach can, in practice, distinguish between different services. Our
experience, with all the WFS services in our use cases, is that the approach works quite well.

For WPS, matters are somewhat more complicated. Like services, discovery queries are pairs of precondition
and postcondition, specifying the desired functionality of the service to be discovered. There is of course a
number of straightforward ways how to match pairs of pre/post condition, of service/query, separately. But if
we do that then we lose the connection between precondition and postcondition, i.e., we cannot take into
account the dependencies between inputs and outputs of the advertised/desired service. It is hence necessary
to design a combined query, drawing from preconditions and postconditions, in a way so that a match is
found if and only if that should intuitively be the case. This is quite a tricky issue; in fact, even figuring out
when a match should intuitively occur is non-trivial. How should the shared variables be handled? Should
every variable that is shared in the query be also shared in the service? How do we map variables onto each
other, anyway? These and various other questions must be answered. The good news is that, as we found in
our work, one can answer them all, and in a way so that service descriptions and queries remain, themselves,
simple; the only more complicated aspect is the form of the matching, which is internal to the discovery
engine and needs not worry the user. Our approach will be detailed in…

3.2.3   Integration with Catalogue Services

The functionality of a CS/W is very similar to the functionality of a SWS discovery component itself. One
might think about semantically annotating and discovering CS/W so that SWS discovery can discover these
services, and in turn utilize them for further discovery. While this may be a viable scenario, it is unclear,
even intuitively, how a Catalogue service should be categorised (annotated). For the above scenario to make
sense, the categorisation would have to be made based upon the set of services registered in the Catalogue;
such a categorisation is likely to be cumbersome if not impossible. Apart from this, even the interaction of
SWS discovery with one Catalogue poses several open questions. These must be answered before
considering generic scenarios involving arbitrary Catalogues. The contribution of our work is in this area.
The most basic question is how to connect two components that provide such closely related functionality.
Sketchily formulated, one can both, make a Catalogue a sub-component of SWS discovery, or make SWS
discovery a sub-component of a Catalogue. In our work, we decided to go along the latter option. This makes
more sense from a long-term research perspective, since our aim is to bring semantic functionalities into the
Geospatial toolset. SWS discovery, on the other hand, is a generic method that does not in general need
functionalities specific to the Geospatial domain. Hence using SWS discovery to enhance Geospatial
Catalogues makes more sense than using Geospatial Catalogues to enhance SWS discovery.

More questions arise regarding the technical details of the combination of the tools. For example, where are
the available services – the web service repository – stored? Catalogues have their proprietary database
formats, SWS discovery tools have their internal stores of semantically described services. In line with our
design decision above, we decided to keep the repository within the Geospatial Catalogue, enhancing the
current service descriptions with semantic annotations; for the discovery, SWS discovery then accesses the
repository inside the Catalogue. The Catalogue’s query interface is accordingly extended to admit also
semantic queries. The two discovery processes are combined in intelligent ways. E.g. the Catalogue’s
standard discovery is used to select the subset of services satisfying the spatial constraints; the subset can
then be forwarded to SWS discovery to select those services also complying with the semantic query.

3.3     Composition and Execution

Semantic composition and execution have to deal with all the open issues regarding integration. “Semantic
composition” here refers to a process that supports the human user in adequately combining existing
Geospatial services into a new functionality; obviously, such a process is fairly advanced and executing an
existing composed service is, in principle, a much easier task. We hence first focus on the latter; we then
discuss in detail how various mediation techniques can address the open issues. We close the section with a
brief discussion of automatic composition.

3.3.1    Semantic Execution

Semantic execution of composed Web services is known in the SWS community under the name
orchestration; an SWS orchestration engine accesses the services as well as their semantic descriptions. In
the Geospatial context, the composed services will fulfil functionalities such as information gathering and
aggregation. Ideally, the “execution” should provide support for all the interoperability problems that arise,
as listed in Section 2.3.4. In fact, it can be expected that SWS orchestration is much better suited for this
purpose than the standard proposal BPEL, for the following reasons. BPEL is an emerging industry standard
for specification and execution of Web service compositions. BPEL is purely syntactic, and basically
provides a programming language targeted to the purpose of web services composition/execution. Like in
any programming language, all the technical details must be taken care of by the human programmer. By
contrast, SWS orchestration is tailored to interact with semantic specifications. Hence enhanced support
functionalities become possible; we now explain these in more detail.

The most important functionality provided by semantic execution, in the Geospatial context, is semantic
(ontology-based) mediation between the data formats expected by the communicating web services. If two
services are annotated based on the same ontology, then further mediation is, in fact, not necessary and the
interoperability problem disappears (the question remains, of course, where the semantic annotations come
from). If two services are annotated based on different ontologies, then mappings between these two
ontologies can be used to resolve the interoperability problem. Such mappings are easier to obtain, and more
re-usable, than mappings made on the syntactic level; we detail this in the next sub-section.

Regarding retrieval, i.e., the formulation of filters for WFS queries, providing automation for this task
appears difficult. The spatial constraints (bounding box) are fairly simple: those are just forwarded through
the semantic execution (there is no need to generate them). However, there is no way of knowing what the
correct database query (non-spatial constraint) is without understanding the functionality of the whole
(intended) web service composition. Hence automatic generation of WFS filters comes very close to
automatic programming, which is a notoriously hard and unsolved task; see also the section on automatic
composition, Section 3.3.4, below. Expecting this kind of thing to work fully automatically in the short term
is not realistic; future projects will have to address these issues. Within our work, what we achieved is more
convenient support for the application developer (in the development environment) who creates the web
service compositions. Once the WFS are annotated semantically, the naming heterogeneity problems listed in
Section 2.3.4 largely disappear: the domain ontology can be taken as the basis for selecting the correct filter.
For non-annotated services, term matching techniques can be used to provide the mapping from the domain
ontology into the output attributed as specified in the feature type description of the WFS service.

Another functionality of SWS orchestration that might turn out useful in the Geospatial context, in the long
term, is the ability to incorporate semantic discovery goals rather than concrete Web services, into the
composed Web service. That is, the composed service can contain “placeholders”. Those are dynamically
assigned to concrete Web services (which are discovered at runtime). This is beneficial in situations where
the best Web service to use is not known at design time. It is also more flexible in situations where, e.g., the
Web service incorporated at design time does no longer exist at the (later) time point of execution. (Of
course, if the filter for a newly discovered WFS must be created, then such an approach requires the
automatic generation of feature type ontology filters from domain ontology filters, as discussed above, to be
in place).

###

3.3.2   The Role of Ontologies for Execution

Considering the execution of a workflow, the OGC Ontologies define the parameters required for requesting
geospatial web services and for handling their responses. For example, the WFS ONTOLOGY describes the
requests a Web Feature Service is able to handle. For a GetFeature request the specification includes
parameters like service name, service version, feature type name and filter. An ontology of the OGC Filter
Encoding Implementation Specification (OGC 2005d) is used to describe the filter parameters. The response
is characterised by containing exactly one Feature Collection as specified in the GML ONTOLOGY. The WFS
ONTOLOGY establishes the shared terminology for WFS invocations within WSMX workflow descriptions.

In addition to this shared knowledge about the specific service type, the structural information of the feature
types served by a specific WFS has to be defined. We show in Section 0, how this structural information is
formalised in FeatureType ONTOLOGIES. Having FTOs available, the interaction with Web Feature Services
can be embedded within workflow descriptions.

###

3.3.3   Data Mediation

Six different classes of interoperability problems were identified in Section 2.3.4. First, structural
heterogeneity, where two services have the same input/output types, but are organised in a different data
structure/format. Second, naming heterogeneity, where two services have the same input/output types, but
this is not obvious since different keywords were used. Third, measurement scale conflicts, where two data
fields match, but the underlying units (e.g. kilometres, miles) are different. Fourth, spatial reference system
conflicts, where different reference systems are used. Fifth, precision conflicts, where two data fields match,
but the provided precision differs from the needed precision. Sixth, resolution conflicts, where two services
match but provide/need the relevant data at different levels of Geographical resolution (e.g. one works on
Departements while the other works on communities). Of these six problems, the first and the second are
classical issues solved by SWS data mediation; the third and fourth can be solved by an extended form of
data mediation; to resolve the fifth and sixth, fundamentally different mediation support (changing the
structure of the composed service) would be necessary, which is a topic for long-term research. In what
follows, we discuss all these points in more detail.

Structural and Naming Heterogeneity

These are classical data interoperability problems from the general web services domain, for which the SWS
approach has been developed. As stated, the problems are resolved as a side-effect if all involved services
were annotated using the same ontology. In an open environment, it may of course happen that different
ontologies were chosen for the annotation. In that case, mappings between these ontologies must be
established. For this, a variety of existing mediation tools provides advanced support . Such tools typically
provide the user with an interface showing the two ontologies between which mappings shall be established.
During the session with the data mediator, the user – the domain expert – then creates mappings between
pairs of concepts of the two ontologies. Support is provided by statistical measurements of similarity
between concept names, as well as based on the structure of the ontologies (i.e. on similarities regarding the
relations between concepts). Given these functionalities, creating the correct mappings, and hence resolving
the interoperability problems, is much easier than doing the same job without help on the syntactic side.
Further, the information entered into the data mediator is entirely and generically reusable. Once entered, the
mappings can be automatically inserted whenever services based on these ontologies need to interact. Note
here that this works for any services using these ontologies, not only for the couple of services involved in
the original interoperability problem. The ontology mappings can even be re-used in completely different
scenarios, e.g., when matching a discovery query formulated in one of the ontologies against Web service
descriptions formulated in another one.

Measurement Scale Conflicts

This is an interoperability problem more specific to the Geospatial domain, where parameters of Geospatial
entities are frequently measured in some unit, and are likewise frequently measured in different units,
depending on the context. The problem can be resolved using a technique that is conceptually compliant with
the SWS approach, but that has not yet been explored and implemented. Since ontologies are based on
logical languages, they are not well suited to express the details how to resolve measurement scale conflicts.
Instead, what one can do is, leave the actual unit transformation up to Web services, and use the ontology
mediator to automatically recognise which transformation service is needed. The latter is easily
accomplished by introducing an ontology of units. The mediator then becomes more complex in that, to
resolve the measurement scale conflicts, it automatically inserts the correct unit transformation services. This
is in line with some notions of mediators explored in the SWS community , but has not yet been explored. It
needs to be implemented in the context of an SWS orchestration engine. Of course, the unit transformation
services must be provided. This is not a problem since implementing such a transformation is typically easy,
if not trivial. It’s the insertion at the right places that is cumbersome for the human programmer; this task
will be automatically taken care of by the semantic mediation, forever and in all relevant situations.

Spatial Reference System Conflicts

These can be resolved with the very same mediation support outlined above for measurement scale conflicts;
instead of unit transformation services, reference system transformation services are linked in.

Precision Conflicts

This kind of conflicts are, similarly to the measurement scale conflicts above, easy to detect based on
ontological (precise) information. However, they are not as easy to resolve, simply because one cannot make
a piece of data “more precise”. What one can do, in certain simple cases, is to automatically make a piece of
data less precise. In particular, of course it is possible to round a numeric value to a certain required
precision. On the other hand, the precision might involve more complex transformations, like from a polygon
to an approximating circle, which is not possible without detailed knowledge about the domain and the
particular application considered. Potentially worse, if the conflict requires making the data more precise,
then a “mediation” would require to contact the data provider with a revised request. To automate such a
reaction would involve fairly advanced techniques, such as automatically composing the correct data request.
This is in principle compliant with the SWS framework, but beyond the scope of our work herein.

Summing up, apart from very simple precision transformations, the best that can be provided in the short
term is a validation functionality for the human developer, indicating any precision conflicts. Note that this
may already be of significant help. Further, if the precision requirements are included into the semantic
discovery, then such conflicts can potentially be altogether avoided.

Resolution Conflicts

This kind of conflicts, finally, is of a fairly advanced sort. A resolution conflict basically means that the data
provided by two services do not match. To make them match, the different geographic areas must be
coordinated. Even in the simplest case, where one service, A, provides data for a finer partitioning than the
other service, B, this requires a complicated “mediation” functionality. For every area C addressed by B, one
needs a loop around A accumulating the data for all areas D that are contained in C. This amounts to a major
modification of the composed web service; the precise form of how to arrange the loop, and how to
“accumulate” the data, depends on the context. It is not realistic to expect such complex compositions to
work fully automatically in the immediate future. A realistic expectation is to be able to provide a validation
functionality for the human developer, indicating the conflict, and perhaps making a suggestion for resolving
it.

3.3.4   Automatic Composition

The ultimate goal of SWS composition is to fully automatically create Web service compositions that satisfy
complex user requirements. This is a very hard task, both in theory and in practice. First tools have been
developed , but those support only fairly restricted languages, both regarding the available Web services (the
allowed building blocks) and regarding the forms of the generated solutions. In the Geospatial domain, in
particular in our use cases, the necessary service compositions involve loops, as well as arithmetic operations
(and WFS filters). Composing such constructs automatically amounts to automatic programming, which is a
notoriously hard and unsolved task. In almost all of the existing composition tools, the composed solutions
are simple parallel workflows (no loops and no branches), and no arithmetic is involved.

An option one may think about in the shorter term is to solve abstract composition problems and use their
solutions to guide the human Web service composer. In an abstraction, i.e., in a simplified version of the
composition task, the available composition techniques may be sufficient to obtain a solution. While this
solution may not work in reality, it may contain useful information – e.g., the Web services it employs.

An option one may think about in the longer term is to alleviate the “automatic programming” difficulty by
exploiting the typical forms of Geospatial Web service compositions. In particular, such compositions will
typically follow a [collect-data, aggregate-data] pattern. It should be possible to fill these patterns in
automatically, based only on user input as to hat kind of data should be collected, and how it should be
aggregated. ...
4   SWING Architecture Overview – Joerg Hoffmann, LFUI/Arne J Berre, SINTEF
In this part of the book, we present the SWING results on the application of SWS technologies to the domain
of Geospatial Services, realising much of the potential benefit outlined previously. We start with a high-level
overview of the SWING architecture, showing its main components and their interactions. The later chapters
will discuss the functionality of the different components in detail.

The SWING architecture is comprised of 6 different components. These are briefly explained in the
following; we also introduce abbreviations (names) for the components, which we will use throughout the
book.

    •   MiMS. MiMS is the environment for the domain expert, i.e. for the expert in the geospatial domain
        who wants to make use of resources on the web. MiMS provides the expert with convenient access
        to the functionalities offered by the other components.

    •   WSMX. WSMO, the Web Service Modelling Ontology, is a generic framework for modelling the
        entities involved in handling semantically described web services. This includes logic languages for
        describing ontologies (bundled in WSML, the Web Service Modelling Language), as well as
        reasoning methods and their use to provide complex functionalities such as semantic discovery of
        web services. WSMX is a reference implementation of WSMO. From the WSMX perspective,
        SWING is an application of generic semantic web techniques. The application is non-trivial since
        many of the generic SWS functionalities must be adapted and/or extended to suit the Geospatial
        domain.

    •   ONTO. An ontology is a formal vocabulary describing a domain of interest. Types of entities, and
        their possible relations, are described in a logic language. The formal semantics of the language
        enable computer support for complex tasks. In SWING, the ontologies provide the shared
        vocabulary used to describe geospatial entities and services. Every component accesses the
        ontologies.

    •   ANNOT. One main obstacle in the successful realization of semantic web techniques, in particular in
        the geospatial domain, is semantic annotation, i.e., the need to describe all the involved entities in a
        logical language. Such languages are typically cryptic and understandable only to a logics expert, not
        the domain expert who should use them. Hence support for semantic annotation is required. In
        ANNOT, such support is provided for the geospatial domain, based on text and data mining
        methods.

    •   CAT. As stated, the current solution to service discovery in the geospatial domain are so-called
        Catalogues, providing keyword-based discovery. In SWING, a catalogue was developed that
        incorporates functionalities of WSMX to support semantic discovery instead, hence improving the
        quality of the result.

    •   DEV. As stated, one important thing in the geospatial domain is to be able to combine the
        functionalities of different available services, e.g., one service providing data and another service
        computing a statistic over that data. Doing such a composition fully automatically is a highly
        sophisticated task, and cannot realistically be expected to work within the foreseeable future. Rather,
        a human expert should be provided with convenient support for doing the composition. This includes
        sophisticated functionality for service discovery and annotation, as well as graphical interfaces to
        design the composition as, e.g., UML diagrams. DEV provides all these functionalities in an
        environment targeted at the engineer of geospatial web services.

Figure 11 provides an overview of how these components interact in the SWING architecture. The figure
shows the main actors of the SWING environment, namely MiMS, DEV, CAT, WSMX, and ANNOT, as
explained above. The ontologies, ONTO, are accessible for every component, so they simply form the
background of the picture. Further, the picture shows three additional components that have been established
at crucial interaction points of the environment: Query Annot, Service Annot, and WFS Wrapper.
 ONTO
     MiMS               Environment for Domain Expert.                                          DEV      Environment for Developer.
     Use known services, or discover with CAT; use                  Request Composition        Compose services as UML diagrams; discover
     Query Annot to annotate OGC query with WSML.                    (natural language)        using Query Annot to annotate OGC query
     If discovery result empty, contact DEV with request.                                      with WSML. Translate UML into WSML−ASM, use
     Execute composed services with WSMX.                                                      Annot to annotate and register.




                                                                                                                                       Annotate
                         Discover


                                                         Annotate
      Query Annot                                                                                                         Service Annot
     Interact with user to                                                                                                Interact with user
     annotate discovery query                             Register Annotated Service                                     to annotate OGC
     with WSML.                                                                                                           service with WSML.




                                                                                                                                       Annotate
                         (OGC+WSMO)
             Discover




                                                  Annotate

                                        Execute (Composed) Service

      CAT                                                                   WSMX                                            ANNOT
                                                                                WFS Wrapper                                Interact with user
                                                                                                                           to select concepts
     Store service repository including.                                        Map OGC parameters
                                                                                                                           and relations.
     semantic information.                                                      to WSML parameters.
                                               Discover (WSMO)                                                             For Query Annot,
                                                                            Store/execute WSML−ASM                         take keywords;
                                               Goal & Service Base          composed services.                             return WSML query.
     Perform spatial (OGC) discovery,
     forward result to WSMX for                                             Discover services based on WSMO
     semantic filtering.                                                                                                   For Service Annot,
                                                                            queries, selecting from service
                                                                                                                           take OGC description
                                                                            base forwarded by CAT.
                                                   Discovery Result                                                        of service; return
                                                                                                                           WSML annotation.




                                                Figure 11. Overview of SWING Architecture.

Let us consider Figure in a little more detail, starting at the user environments. The domain expert using
MiMS mainly wants to formulate what sort of service she desires, and then pose these queries to a discovery
interface. Formulating the queries is done in interaction with Query Annot, which guides the user through
possible OGC queries and WSMO annotations. The latter are created via an interaction with ANNOT:
starting from keywords entered by the user, ANNOT helps the user to select, from the domain ontology, the
appropriate concepts and their connections; from that selection, the WSML query is automatically generated.
Once the user submits the combined OGC+WSML query, Query Annot forwards the query to CAT, which
interacts with WSMX to obtain the discovery result. The discovery result is forwarded back to the user
through Query Annot. If the discovery result is empty, then the MiMS user contacts the developer, with a
natural language request to compose a suitable web service in DEV.

The developer in DEV needs to discover the services that will form part of the composition. This is done
using the same functionality as explained above. The discovered services are composed as a UML diagram
in DEV. One the composition is completed, it is automatically translated into an abstract state machine
(ASM). The developer uses Service Annot to annotate the service; this is done in a way very similar to the
annotation done by Query Annot, except that the keywords can (but not necessarily have to) be entered in the
form of a standard OGC service description, such s the feature type description of a WFS service. Service
Annot then registers the annotated service in both CAT and WSMX. The domain expert in MiMS can now
access the composed service just like any other service. To execute the composed service, however, an
additional interface tool is needed to convert the parameter formats. Namely, the composed service is
exposed in MiMS as a standard OGC service. The OGC parameter values are transformed by WFS Wrapper
to the ontological format needed internally by WSMX to execute the ASM; the result of the ASM execution
is translated by WFS Wrapper back into the OGC format needed for MiMS.

Service Annot can also be used, both from MiMS and DEV, to semantically annotate legacy services. The
underlying ANNOT tool helps, as stated, to map keywords to ontology concepts and relations. The mappings
are generated based on Machine Learning techniques such as text mining, with some specific aspects tailored
to the particular properties and requirements of the Geospatial domain. Since the process is bound to yield
several alternatives – and probably not only good ones – the overall annotation process is semi-automatic,
interacting with the user to make sure that the chosen annotation makes sense.

The interaction between CAT and WSMX proceeds as follows. CAT provides an interface (to Query Annot)
for queries in both OGC and WSMO format. The services are stored inside CAT, where they are accessed
both by CAT-internal and by WSMX functions; the services in CAT’s repository are semantically annotated.
The discovery itself is performed by sophisticated combinations of CAT’s and WSMX’s capabilities, e.g.,
using CAT’s methods to match bounding boxes and polygons of query and provided service, and WSMX’s
methods to match the semantics of the data content as desired by the query and provided service.

The following chapters will fill in the formal and technical details of this framework. We start with a detailed
discussion of the ontologies developed for SWING. We then in turn discuss how services are annotated,
discovered, composed, and executed. We demonstrate the overall process with the SWING use cases, and
conclude.
5       Ontological Backbone – Sven Schade, UOM
Domain ontologies support several tasks in semantically enhanced applications like semantically annotating
service capabilities and contents, formulating goals for discovery and specifying workflows for execution.
This chapter documents the ontology engineering approach taken for the SWING ontologies, along with
strategies for knowledge acquisition, evaluation and maintenance. We will also discuss interesting
ontological problems and specific design decisions. Finally, the potential of ontologies to resolve semantic
heterogeneities in application workflows is illustrated with examples from the SWING use cases.

5.1      Ontologies in SWING

The objective of the SWING project is to provide an open and easy-to-use SWS framework of suitable
ontologies and inference tools for annotation, discovery, composition and execution of geospatial web
service. On the SWS side we work with WSMX, a reference implementation of the SWS Framework
WSMO. On the geospatial web service side we work with implementations of the OGC specifications. We
start this chapter with a short introduction on WSMO and the ontology language WSML to the extent needed
for understanding the subsequent sections. This is followed by a brief overview on Geospatial Web Services.
For further information on the technologies please refer to Deliverable 2.1.

The remainder of the chapter is organised along the lines of analysing the requirements for annotation (0),
discovery (3.2.1), and execution (3.3.2) of geospatial web services in WSMX.

5.1.1     Methodology for Ontology Engineering and Representation Language

#...#

5.1.2     Ontology Development Process

This section provides a general overview of the ontology development process and allocates the position of
the deliverables of Work Package 3 (WP3) to appropriate phases in the process.




 Figure 12: States and activities of the ontology development process and how the deliverables of WP 3 relate to them
                            (modified from Fernández-López, Gómez-Pérez et al. (1997)).

Developing ontologies from scratch involves several phases, which are depicted in Figure 12. We follow a
methodology introduced under the name of METHONTOLOGY (Fernández-López, Gómez-Pérez et al.
2004; Fernández-López, Gómez-Pérez et al. 1997). METHONTOLOGY has been successfully applied in
other projects, e.g. in On-To-Knowledge (Staab, Schnurr et al. 2001). We will not strictly stick to
METHONTOLOGY but also incorporate ideas from other methodologies into the process where appropriate.
For example, part of the knowledge acquisition strategy in SWING relies on formulating competency
questions based on usage scenarios as introduced by (Grüninger and Fox 1995; Uschold and Grüninger
1996).

According to METHONTOLOGY, the ontology development process can be divided into five phases:
Specification includes the identification of intended use, scope, and the required expressiveness of the
underlying representation language. In the next phase (conceptualization), the knowledge of the domain of
interest is structured. During formalization, the conceptual model, i.e., the result from the conceptualization
phase, is transformed into a formal model. The ontology is implemented in the next phase (implementation).

Finally, maintenance involves regular updates to correct or enhance the ontologies. This paper focuses on
two activities involved in this process: knowledge acquisition and evaluation. More details can be found in
(Fernández-López, Gómez-Pérez et al. 1997)).

5.2     Knowledge Acquisition Strategy

Since we are using OGC services in the SWING project, we need a generic WFS Ontology that captures the
service implementation rules for WFS as specified in the OGC WFS Implementation Specification (OGC
2005b), a generic FilterEncoding Ontology that captures the Filter Encoding Implementation Specification
(OGC 2005a), and a generic GML Ontology that captures the encoding rules for features as specified in the
OGC GML Encoding Specification (OGC 2002). The OGC ontologies are imported and referenced within
the concept definitions of WFS WebService and FTO.

The identification of ontology types needed for the SWING application in Section Error! Reference source
not found. has lead to a distinction between domain ontologies that capture the semantics of real world
geospatial entities and ontologies that capture the processing and encoding rules for geographic information.
Defining the scope of the latter ontologies is relatively simple as they are based on existing models for
implementing geographic information, i.e. the specifications of ISO and the OGC. The versions of the OGC
specifications used as basis for the ontology engineering process are determined by the actual
implementations in the SWING project. The necessity for upgrading to updated versions of the underlying
specifications will be constantly reviewed in the course of the project.

Defining the extent of the thematic domain ontology is much more difficult since ontologies on the domain
level claim to comprise the basic concepts of a common conceptualization. Great care must be taken to
define the concepts and relations on an appropriate level of expressiveness. The terms have to be general
enough to allow the annotation of all information sources, but specific enough to make meaningful
definitions possible (Schuster and Stuckenschmidt 2001). In consequence, the domain ontologies require to
be defined within a certain context and for a well-known user community, i.e. we have to come up with
adequate and manageable subsets of the geospatial domain. Not all of these domain ontologies have to be on
the same level of abstraction. Also, it is possible for ontologies on a more specific level to include concepts
from ontologies on a more abstract level.

Methods for identifying the scope based on using motivating scenarios and informal competency questions
are part of the activities defined in the Knowledge Acquisition Strategy (see Section 5.2.2). Please note, that
we document the results of the KA Workshop (6/7 July in Paris) regarding the scope identification before we
introduce the generic KA strategy in Section 5.2.2. We follow this structure because it documents the results
of the specification phase in subsequent chapters (Section #...#). Results of the specification phase are
needed before the conceptualization phase can be started. The results of the KA Workshop belonging to the
specification phase are documented here in Section Error! Reference source not found.. The results
belonging to the conceptualization phase are documented in #...#.

5.2.1    Types of SWING Ontologies

Based on the usage scenarios analysed in the previous sections, two core sets of geospatial domain
ontologies are identified to play a role in SWING. The domain ontologies have to account for the
community’s domain of interest as well as for the application specific geodata encoding and processing
rules.
Table 1: Types of ontologies that are of interest in the SWING application.

              DOMAIN OF INTEREST                                          KNOWLEDGE SOURCE

 UNIVERSE OF DISCOURSE
 Ontology on Quarrying (central)                     1st KA Workshop with domain experts July 2006

                                                     2nd KA Workshop with domain experts planned in 2007
 Continuously extended along upcoming use
                                                     External Sources (Thesauri, Encyclopaedias, Classification
 cases
                                                     Standards, etc.)
 GEODATA ENCODING & GEOPROCESSING
                                                     ISO/TC211: 19109 Geographic information - Rules for
 General Feature Model Ontology
                                                     application schema.
 Spatial Schema Ontology                             ISO/TC211: 19107 Geographic information - Spatial Schema.
                                                     ISO/TC211: 19108 Geographic information - Temporal
 Temporal Schema Ontology
                                                     Schema.
                                                     OGC: Geography Markup Language (GML) Implementation
 Geography Mark-up Language Ontology
                                                     Specification, Open Geospatial Consortium.

 OGC SERVICES
                                                     OGC: OpenGIS® Web Service Common Implementation
 Web Service Common Ontology
                                                     Specification, Open Geospatial Consortium Inc.
                                                     OGC: OpenGIS® Web Map Service (WMS) Implementation
 Web Mapping Service Ontology
                                                     Specification, Open Geospatial Consortium Inc.
                                                     OGC: OpenGIS® Web Feature Service Implementation
 Web Feature Service Ontology
                                                     Specification, Open Geospatial Consortium Inc
                                                     OGC: OpenGIS® Web Coverage Service (WCS)
 Web Coverage Service Ontology
                                                     Implementation Specification, Open Geospatial Consortium Inc.
                                                     OGC : OpenGIS® Filter Encoding Implementation
 Filter Encoding Ontology
                                                     Specification, Open Geospatial Consortium Inc.

                                                     OGC: OpenGIS® Web Processing Service. Discussion paper,
 Web Processing Service Ontology
                                                     Open Geospatial Consortium Inc.


Table 1 provides a list of ontologies that should be available in the SWING environment by the end of the
project. The scope of each ontology is restricted by the requirements of SWING Use Cases I-III. Section
Error! Reference source not found. defines the scope of domain ontologies for the first use case “Creating
production-consumption map”. Section 5.2.2 further examines the sources that are available for extracting
the knowledge needed to build the ontologies.

5.2.2   Knowledge Acquisition in SWING

The strategy for Knowledge Acquisition (KA) is meant to obtain required knowledge from available
resources and to structure it. We have defined the scope of SWING ontologies with the help of competency
questions during the specification phase (see Section Error! Reference source not found.). Defining
competency questions is already part of the KA activity and is introduced in more detail in Section 5.2.4.
Detailed knowledge on the domain of interest (within the scope of the competency questions) is acquired
during the conceptualization phase. Subsequent formalization and implementation may be semi-
automatically achieved based on the results of the conceptualization. During the formalization and
implementation, it might turn out that additional knowledge needs to be acquired. This leads to iterations on
the conceptualization phase. Conceptualization, formalization and implementation phases are tightly
interwoven in the development process and in most cases it will not be possible to strictly separate between
them. Once a consistent and stable version of the domain ontology has been reached (a strategy for
evaluating this will be part of the work in D3.2), the KA activities will be minimal during the maintenance
phase.

The SWING knowledge acquisition strategy aims at considering as many relevant resources as possible.
Available knowledge sources include the following:

       •    ISO and OGC standards

       •    websites,

       •    thesauri,

       •    existing ontologies, and

       •    domain experts.

In the remainder of this section, we give a detailed account on the KA Strategy that has been specifically
developed for the SWING project. The strategies target ISO and OGC standards (Section 5.2.3), external (or
third-party) sources (Section 5.2.5), and domain experts (Section 5.2.4). The application of these strategies is
illustrated with respect to SWING Use Case I. An overview of the relation between required ontologies and
available knowledge sources is provided in Table 1 (Section 5.2.1).

###

5.2.3       Knowledge Acquisition from ISO and OGC Specifications

A comprehensive framework of standards is available for the realms of geospatial data modelling and
geospatial web services. The main bodies developing these standards are the Open Geospatial Consortium13
(OGC) and the Technical Committee 21114 (TC211) of the International Organisation for Standardisation15
(ISO). Both work in close cooperation since 1994. The geospatial web services developed for the SWING
project rely on OGC standards, therefore the domain ontologies that capture the knowledge about geospatial
data encoding and service specifications can be directly derived from them. The activity of knowledge
acquisition becomes almost redundant, as the knowledge has already been gathered and formalised to some
degree during the standardisation process. Once the scope has been defined during the specification phase,
we can directly jump to the implementation phase.

In the following, we will introduce in more detail those specifications, which have been selected as basis for
deriving the SWING domain ontologies (Section 0). The specifications apply UML for conceptual
modelling. In 0, we give an overview of methods for automatically translating the semi-formal
representations in UML into WSML. We argue for revising and extending the resulting artefacts based on
the documentation of the facilitated UML diagrams.

Available Knowledge Sources

Conceptual knowledge is provided by the OGC Abstract Specifications and as part of the ISO 19100 series
of standards developed by the ISO TC 211. Specific knowledge about information encoding and concrete
service interfaces are defined by OGC Implementation Specifications and another part of the ISO 19100
series. The following ISO/OGC specifications are relevant for the domains of geospatial data modelling and
geospatial web services:

Those standards directly required to support the SWING Use Cases will be available as WSML ontologies
by the end of the project. A first version of GML, WFS and Filter Encoding ontologies have been generated
to test the methods described in 0 and to provide test ontologies for the implementation of the first use case
in SWING. In particular we used GML in version 2.1.2, the WFS in version 1.1 and the Filter Encoding in
version 1.1.


13
     The official web site is available from http://www.opengeospatial.org/.
14
     The official web site is available from http://www.isotc211.org/.
15
     The official web site is available from http://www.iso.org/iso/en/ISOOnline.frontpage.
###

Geographic Data Types Ontology

We propose a domain ontology of geographic data types to enable automated reasoning on the input and
output types of WPS. According standards for geographic data types have already been identified in SWING
D3.1 (Table 1 of D3.1). For illustration purposes, we use a small subset of the ISO Spatial Schema
(ISO/TC211 2003) as a basis for the GEOGRAPHIC DATA TYPES ONTOLOGY. An extract is shown in (1).

                   gm_object(A) :- polygon(A).                                                            (1)
                   gm_object(A) :- hasSRS(A).
                   SRS(A) :- projSRS(A).
                   projSRS(gk)
The geographic data types are represented as unary predicates and the sub-type relationships are formalised
as LP implication between these predicates. For example, the first line of (1) defines a polygon as a subtype
of a generic gm_object. Representing types as predicates instead of keywords allows for a more flexible
discovery process since e.g. subtype relationships can automatically be inferred by the LP reasoning engine.
Attributes of the geographic data types, such as the spatial reference system of a geometric object, are
represented as binary predicates. This is exemplarily shown in the second line of (1). Furthermore, the
ontology contains instance specifications such as the projected spatial reference system Gauß-Krüger. This is
formalised in the last line of (1).

Geospatial Operations Ontology

The adopted standards for geographic data types can serve as a basis for the development of a
GEOGRAPHIC OPERATIONS ONTOLOGY and therefore for the annotation of type signatures. But (as
mentioned in (Lutz 2006)) they provide no information about the functionality or behaviour of geospatial
operations. For this reason and since the functional descriptions should not in detail describe the numerical
calculations, an abstract conceptualisation of operations is required. Since this section focuses on discovery
(and not a comprehensive ontology for spatial operations), we only provide a lightweight conceptualisation
of operations In the simplest case, this ontology of geospatial operations boils down to a collection (or
dictionary) of agreed-upon "keywords" denoting the different operations. We chose again the ISO Spatial
Schema as the agreed upon bases for this vocabulary.

In the following, we illustrate how the running example of the overlay operation can be formalised. In the
case of overlay, the vocabulary contains the keywords "intersection", "union", "difference" and
"symmetricDifference". Additionally, we formalise relationships between the "keywords" denoting the
different operations to make use of the automated inference mechanisms of LP engines. For example, stating
that every union operation is an overlay operation enables Web Service requesters to discover WPS offering
the union operation when searching for overlay.

We formalise the overlay operations on polygons in our GEOSPATIAL OPERATIONS ONTOLOGY as
ternary predicates, shown in (2). The first two variables refer to the operations input. The third argument
represents the operations output. For example the statement difference(A,B,C) denotes that C is the output of
the difference operation on A and B. Moreover, we define explicitly that each of the different overlay
operations is indeed an overlay operation. The formalisation of operations additionally contains type
constraints on the in- and output variables. For brevity reasons, we skip this part of the ontology.

                                     overlay(A,B,C) :- union(A,B,C) (2)

                                    overlay(A,B,C) :- intersection(A,B,C)

                                     overlay(A,B,C) :- difference(A,B,C)

                               overlay(A,B,C) :- symmetricDifference(A,B,C)
Some overlay operations compute the same output when executed with different permutations of the input.
As can be seen in Figure 4, the geometric output of SymmetricDifference(A,B) is equivalent to the geometric
output of SymmetricDifference (B,A). We formalise this as LP implication between the predicates
representing the operations:

                            symmetricDifference(A,B,C) :- symmetricDifference(B,A,C)                   (3)

Furthermore, since the different overlay operations are quite similar to set-theoretic relationships, it is
possible to naturally formulate dependencies between them. Some overlay operations can be computed via
other overlay operations. For example, as can be seen in Figure 4, a SymmetricDifference operation can be
computed via Difference and Union operations. These dependencies between operations are formalised as LP
rules:

            symmetricDifference(A,B,C) :- difference(A,B,X) ∧ ...difference(B,A,Y) ∧ union(X,Y,C)            (4)

###

A Method for semi-automated translation of OGC standards to WSML

The conceptual models for geospatial information are specified using UML (in version 1.3) (ISO/IEC 2003),
(Rumbaugh, Booch et al. 1999) as the conceptual schema language (ISO/TC211 2003a). See Figure 13 for a
small example. The semantics of the specific classes and relations between classes are defined in the natural
language documentations.




                                 Figure 13: Small subset of ISO 19107 – Spatial Schema in UML.

Currently, no direct translation from UML to WSML is available, but the translation between UML and
OWL (W3C 2004) is implemented16. The Web Ontology Language (OWL) is another family of languages to
represent ontologies for the Semantic Web based on the Resource Description Framework (RDF) (W3C
2004). OWL and RDF files in turn can be imported into WSMX.

Instead of defining a UML to WSML translation tool from scratch, we decided to evaluate existing tools
regarding automated ontology generation for the ISO and OGC standards. Three products are taken into
account:

       1. Official documented UML diagrams of the standard17,

       2. UML to OWL translation tool18 (actually there are many, but only one found wide application) and

       3. OWL and RDF import options for WSMX.



16
     A related protégé plug-in is available from http://protege.cim3.net/cgi-bin/wiki.pl?UMLBackend.
17
     Available from the “Models” section of http://www.isotc211.org/.
18
     See footnote 4.
Using this tool set, the UML class diagram presented in Figure is first translated into OWL (Figure ).
Subsequently, the OWL code is translated into WSML (Figure 14).




         Figure 13: OWL translation of the class diagram presented in Figure . We use OWL abstract syntax for the
                                                      representation.




                             Figure 14: WSML translation of the class diagram presented in Figure .

For some of the specifications (namely ISO Spatial Schema19, the Temporal Schema20 and GML21)
representations in OWL are available on the internet. In these cases, the translation step from UML to OWL
can be skipped and we can proceed directly with the OWL to WSML translation. The method of importing
existing OWL or RDF ontologies into WSMX is a good start for building WSML ontologies. However, the
method needs to be performed carefully. Before third-party OWL ontologies are imported, their content has
to be evaluated. A strategy for ontology evaluating will be part of D 3.2.

Using the approach as illustrated above, taxonomies (like the transitive sub-class relationships between
GM_Object, GM_Primitive, GM_OrientablePrimitive, and GM_OrientableCurve as specified in the Spatial
Schema) are translated into sub-concept relationships in WSML. Additionally, the example illustrates how
non-taxonomic relationships and their cardinalities can be mapped from UML to WSML. However, no
matter what kind of translation strategy is chosen, the automatic translation of UML diagrams into any
formalism cannot include the hidden knowledge from the natural language documentation. Therefore, the
result of the automatic translation from UML diagrams into WSML will only serve as a skeleton.

After the UML diagrams are translated into WSML, the documentation of the standards has to be studied
manually in order to derive further information about the discovered concepts and their relations. This
knowledge is captured in axioms that extend the ontology. The expressivity available for formulating
constraints in axioms depends on the logic variant chosen for formalization. For example, according to
(ISO/TC211 2003b) the inherent dimension of a geometric object (the root class is called GM_Object) has to
be less or equal to the coordinate dimension. This fact is added as an axiom to the ontology (Figure ). The
axiom defines the rule that instances of a GM_Object (and of any sub - class of GM_Object) have to fulfil.




19
     OWL ontology for Spatial Schema is available from http://loki.cae.drexel.edu/~wbs/ontology/iso-19107.htm.
20
     OWL ontology for Temporal Schema is available from http://loki.cae.drexel.edu/~wbs/ontology/iso-19108.htm.
21
     OWL ontology for GML is available from http://loki.cae.drexel.edu/~wbs/ontology/ogc-gml.htm.
                            Figure 15: Specifying the dimension of a geometric object.

5.2.4   Knowledge Acquisition with Domain Experts

In many domains, existing standards and external resources aid knowledge formalization and may even be
used for automated ontology generation. However, domain experts, i.e. humans familiar with the domain of
interest, are knowledge sources, too. In specific areas of expertise, domain experts are the main source of
knowledge. For example, this is the case in the domain of aggregate quarrying and aggregate consumption.
We use the roles of domain expert and ontology engineer for persons involved in the knowledge acquisition
process. Ontology engineers are specialised in using logic for knowledge structuring but have no expertise
regarding the domain specific knowledge.

In this subsection, we define the strategy for knowledge acquisition with domain experts for the SWING
project as we have already done for ISO and OGC Specifications in 5.2.3 and External Sources in 5.2.5. A
successful interplay between domain experts and ontology engineers needs some strategic planning
beforehand. Figure shows, how the elements of the proposed knowledge acquisition strategy can be
structured along the different phases of the Ontology Engineering Process (as introduces in Section #...#).




                Figure 16: Overview of the Strategy for Knowledge Acquisition with domain experts.

In the following, we introduce the different activities for each phase in more detail, i.e. preparation phase (0),
specification phase (0), conceptualization phase (0), formalization phase (#...#), and implementation phase
(#...#). This strategy is applied for the interplay between domain experts and ontology engineers throughout
the project. In the first year of SWING, we applied and evaluated the proposed strategy at the KA Workshop
in Paris (6/7 July 2006), which was organised as a joint workshop by BRGM and UOM. The results are
partly documented in Section Error! Reference source not found. (Defining the Scope of SWING
Ontologies), and #...# (Concept Maps and Matrixes).

Preparation

To prepare the knowledge acquisition, the ontology engineers survey external sources considering the
domain. This opening step aids gaining some background knowledge valuable for the following process.
Also, some examples of the intended use of ontologies are developed. Such examples aid to motivate the
domain expert to participate in the knowledge acquisition process.
The preparation for SWING resulted in a five-step strategy. These subsequent steps are explored in more
detail in the following sections:

    1. Discussing scenarios and competency questions that define the scope of the ontologies.

    2. Acquiring structured knowledge using the 20 question technique.

    3. Using concept maps for representing results of interview.

    4. Applying concept map as the main intermediate representation for discussions between domain
       experts and ontology engineers.

    5. Utilizing matrix-based techniques for more extensive structuring.

Specification

Clarifying the background and scope of the KA deserves a lot of attention. It is necessary to involve both
parties, i.e. the domain experts and the ontology engineers. Each participant has to understand the goals to
achieve and the upcoming challenges. To establish a common background, an introduction to the domain and
typical usage scenarios capturing every day tasks is given by one of the domain experts. This directly leads
into a discussion of the domain of interest based on the application scenarios.

A structure of domains that are involved in the KA and competency questions formulated in natural language
are the main outcomes of this phase. The results of the KA Workshop in Paris are documented in Section
Error! Reference source not found..

Conceptualization

Most domain experts never experienced ontology engineering before. Therefore an easy entry point is
desirable. The results of this first contact with knowledge acquisition are the seed for the domain ontology.
Within SWING, a variant of the 20 question technique is applied. Following this technique, one domain
expert and one ontology expert perform an interview. The domain expert has a particular concept in mind.
This concept should be relevant for the domain of interest. They can be derived from the usage scenarios and
from previous discussions. For example, concepts used for this interviewing technique at the KA Workshop
were “Quarry”, “Quarry Location”, “Crushed Stone” or “Production Rate”. The group performs this
procedure for five different concepts.

The domain expert has to guess the concept by asking up to 20 questions. The ontology engineer is only
allowed to answer each question with “yes” or “no”. The questions and answers are written down in
protocols. Answers should be kept as strict as possible to “yes” and “no” to avoid ending up in large side
discussions. Arising discussion points are noted in the protocol. A subset of such a protocol is shown in
Figure 17.




 Figure 17: Subset of an application protocol regarding the 20 question technique on the concept “Quarry Location”.

The ontology engineer might also adapt the concept in the course of the interview to a more specific one.
This can be done, if the domain expert is close to guess the concept within few questions. For example, a
more special concept then “quarry” might be “aggregate quarry” or even more specialised “inactive
aggregate quarry”.

This simple, game-like knowledge acquisition technique is surprisingly powerful. Domain concepts and
relations between them can be derived out of each protocol. For example, #...# (Section #...#) illustrates
some of the concepts and relations derived from the interview protocol shown in Figure . Additionally, the
order of asking reveals the importance of certain concepts for the domain. Once the domain experts becomes
confident with this technique and after the first or second round, questions become repeated and combined
revealing the structure of detailed domain knowledge. Notably, this knowledge structure highly depends on
the individual domain expert and his/her conceptualization of the domain.

Concept Maps for Intermediate Representation

Teach-back using Concept Maps

Based on the concept map that has been generated out of the 20 question interviews, a teach-back technique
is used to receive feedback on the structure of the concept map from the domain expert. Both experts add,
delete, rename or re-classify nodes in a given concept map. This technique reveals misunderstandings and
results in structured knowledge. Uncertainties and difficulties which arose during the rounds applying the 20
question technique can be targeted now. In this way, the common vocabulary becomes stable and the
structure derived out of the 20 question rounds is extended.

Through the teach-back session, a common understanding between one ontology engineer and one domain
expert is established. The results of the individual discussions are then presented in the plenum, where the
concept maps developed by the individual groups are taken as the basis for generating a joint concept map.
The generation of this map is done in four steps:

    1. Concepts and relations common to all of the developed concept maps are selected into a new one.

    2. Each group brings in those concepts that they found important. A discussion reveals if and how to
       integrate them into the shared version.

    3. Points that the groups are still uncertain about are brought up, discussed, and may influence the
       shared concept map.

    4. Finally, the experts ensure if all concepts and relations that are required to answer the competency
       questions defined earlier (see Section 0) are contained in the map. For example, the question “Were
       do I find production information about aggregate quarries (in France)?” (see above) requires
       concepts like “Aggregate”, “Production” and “Quarry” to be available.

The concept map from the KA Workshop, which represents the first well-agreed on structured domain
knowledge for SWING is attached in Appendix A.2.

Extending the Structure

In the subsequent step, the knowledge structure (being concepts and directed relation between them) is
extended by using stereotypes. Stereotypes enable the definition of mathematical properties of relations, like
symmetry and transitivity. The stereotype “transitive” is used to represent the transitive nature of, for
example, the “partOf” relation (#...#). Inverse relations may be indicated in a similar manner (#...#). More
details on capturing advanced knowledge structures can be found in (SWING D3.1).

Identification of Relevant Domains

The SWING project was kicked-off with the definition of use cases covering a set of problems that BRGM
encounters in the information management on Quarries in France. Table 2 lists the specific use case targets,
which should be supported by the SWING application by the end of the project. For more detailed
information on the use cases please refer to Deliverable 1.1.
   Table 2: Set of specific tasks in the context of information management by BRGM, taken from Deliverable D1.1

    Use-case step                                     Target

    UC1: Create a simple map                          Create "production-consumption" map
    UC2: Create a complex map                         Compile constraints map
    UC3: Use created complex map to make
                                                      Get quarry best location
    sophisticated queries



The central theme for the SWING application is information management on Quarries with the goal of
providing decision support on questions like: Where to open a quarry? How to open a Quarry? and Finding
the best location for a quarry. We therefore start with some basic facts about quarrying, which is then
followed by the identification of thematic domains that play a role in the use cases.

A quarry is an industrial site which is open for extraction of a natural subsurface resource. The objective is to
deliver a “product” from a “raw material”, extracted from a "mineral resource". This product must be
transported to another place where it is used. Examples for quarry products are Gypsum, Plaster, Ornamental
Stone, or Limestone for Roads and Concrete.

The decision on where a quarry is opened depends on the geology. Geological data are the source for
identifying an occurrence of a mineral resource. Market drives the management for opening quarries.
Depending on the market (e.g. a specific construction application), different raw materials are needed to
satisfy the demand (large or small volumes, low or high value-added potential…). Raw material can be
defined by two different key-word lists: either as a lithology (= rock names) or as a mineral resource (=
something useful from rock).

The process of opening a new quarry is very complex. Quarry management is determined by law and
administrative organization in every country. Environmental acts become more and more available and
restrictive. Also, quarrying means work on raw materials, implying specific requirements regarding the
equipment, mechanics, roads, fuel and/or energy sources, and others.

To find potential quarry locations that meet the market requirements is only the first step. The real challenge
is to find the best quarry location. Beside geology, the location of a quarry is determined by several other
factors: Communes / territorial administration, protection areas (environment, water production, historical
monument areas, etc.), topography, transportation infrastructure and also by social factors like the NIMBY
(Not In My BackYard) phenomenon.

The domains identified in the specification phase determine the kinds of domain ontologies that will be
developed in the course of the SWING project. Their extent is determined (and limited) with the help of
competency questions. Depending on their thematic interrelation and extent, several domains might be
captured in only one domain ontology. This domain structure has been used as baseline for modelling and
formalizing the expert knowledge in concept maps, resulting in a first common view on Use Case I
(documented in Appendix #...#). In the course of the ontology engineering progress, this common view will
be split into separate ontologies along the lines depicted in #...#.

Competency Questions

One of the ways to determine the scope of an ontology is to sketch a list of questions that a knowledge base
based on the ontology should be able to answer, competency questions (Grüninger and Fox 1995). In
SWING, we adapt the original notion of competency questions in order to evaluate the domain vocabulary
needed to generate semantic annotations and to formulate user queries in the scope of the SWING Use
Cases. Thus, we will not test the questions against instances in a knowledge base. Rather, we use them to test
the vocabulary specified in the domain ontology for completeness and adequate level of detail. Does the
ontology contain enough information to formulate these kinds of questions? Do we need a particular level of
detail or representation of a particular area? The competency questions are just a sketch and do not need to
be exhaustive.
At the Knowledge Acquisition Workshop in Paris, the domain experts from BRGM defined the following
competency questions for Use Case I “Creating a production-consumption map”:

•     Were do I find information on administrative entities (in France)?

      Specifically the question targets information about:
        •     Boundaries of communities, departments and regions

•     Where can I find information about population counts at the most specific administrative level ?

      Specifically this question targets information about:
         •    The same year as production information at hand


Since ontology development is an iterative process, the same will be done for SWING Use Case II + III in
the course of the project until the whole range of knowledge needed for supporting the SWING application is
covered. Besides defining the scope of ontologies, these informal competency questions will also play an
important role in the evaluation strategy which will be developed in the 2nd year of SWING.

5.2.5    Knowledge Acquisition from External Sources

Importing existing RDF and OWL ontologies into WSMX is one possibility of integrating external sources
into the knowledge acquisition process. The same import option can be used on other kind of knowledge
sources that are represented in UML or OWL. Besides these resources, a huge amount of natural language
documents is electronically available like dictionaries, thesauri and text documents of any kind. This is a vast
source for information on any kind of knowledge.

There exist a number of text mining techniques capable of extracting ontology structures out of text
documents, e.g. OntoGen (Fortuna, Mladenic et al. 2005), which is an ongoing development by the SWING
partner institution Josef Stefan Institute (JSI). A (semi-)automatic extraction is possible, if the knowledge
sources are available electronically. Resulting ontology structures can provide a starting point for the
development of extensive (thematic) domain ontologies. Surely, they will not replace the need for interacting
with domain experts, but they might serve as valuable input for the domain of interest. In a similar way, text
mining techniques can support the extension of existing ontologies. Single concepts are selected from the
ontology and mining techniques are applied building new ontologies around these concepts. The result can
be compared to the original ontology and might lead to the inclusion of newly discovered concepts and
relations.

This method requires the availability of a critical mass of information sources for the text analysis. The
application extent of the first Use Case did not have the thematic outreach to provide enough sources. This
will change with the work on Use Case II+ III, so the method will be applied and evaluated in the upcoming
iterations.

5.3     Design Decisions for Domain Ontologies

D 3.1 section 3 and D 3.2 section 3

...

...

5.4     Ontology Evaluation Strategy

Ontology Evaluation and Validation

Before releasing ontologies and having them deployed in software systems running critical tasks, the
ontology engineer must ensure that the product meets pre-decided quality standards. Quality impacts any
product's usability and durability, and therefore also on the cost of maintenance in case of defects (see also
Section 5 about ontology maintenance). A sophisticated and systematic quality control therefore needs to be
established throughout the ontology engineering lifecycle.
The process of assessing an ontology in application specific aspects is known as ontology evaluation (Brank
and Grobelnik 2005). The task of evaluating an ontology covers the ontology itself, as well as associated
software environments and the documentation of the ontology (Gómez-Pérez, Fernández-Lopéz et al. 1996).
Notably, evaluation is just the assessment of values that indicate the Quality of an Ontology (QoO).
Confirming good quality of an ontology is called Ontology Validation. In the context of any project,
evaluation is performed in order to validate a certain ontology according to project-relevant criteria
(Gangemi, Catenacci et al. 2005). These criteria are also called quality dimensions or quality parameters of
an ontology (Hartmann, Sure et al. 2004; Gangemi, Catenacci et al. 2005). Their measures range from
structural over functional to user-oriented parameters.

5.4.1   “Quality” of an Ontology

Before measures of quality of an ontology can be detailed, the topic of evaluation requires clarification.
Ontologies represent knowledge (Brewster, Alani et al. 2004). The knowledge is particular to a certain view
on the world and is thus subjective. Hence, the suitable representation of domain knowledge is an obvious
evaluation topic. Defining an ontology as “an engineering artefact, constituted by a specific vocabulary used
to describe a certain reality, plus a set of explicit assumptions regarding the intended meaning of the
vocabulary words” (Guarino 1998), see also SWING D3.1, requires to consider further aspects as quality
parameters. In general, five kinds of ontology quality parameters can by separated:

    1. Technology-related parameters target the technology, which is available to generate, store, process,
       and visualise the ontologies (Sure and Studer 2002). It includes the used technology as well as
       technologies that are generally able to handle the produced artefacts.

    2. Structure-related parameters focus on the shape of the directed graph that is build by the elements of
       the ontology, i.e. concepts and relations (Gangemi, Catenacci et al. 2005).

    3. Function-related parameters relate to the intended use of an ontology (Gangemi, Catenacci et al.
       2005). These parameters indicate if the formalised knowledge suites the intended purpose and if the
       way of formalisation suites the desired application.

    4. Conceptualisation-related parameters rather focus on underlying principles of conceptualisation and
       of correctly expressing knowledge in general (Gangemi, Catenacci et al. 2005).

    5. Usage-related parameters target the actual use of released ontologies (Gangemi, Catenacci et al.
       2005). Such parameters are closely related to maintenance (Section 5). They detail acceptance of the
       ontology and indicate needs for changes.

5.4.2   Assessing Quality of Ontology Parameters

Based on literature review, the most commonly concerned parameters for each kind of quality parameters
have been identified. Notably, each parameter can be applied on a certain abstraction level, e.g. ontology
implementation language (like WSML), logic (like SHIQ(D)) or concept map. A table that lists the
appropriate abstraction levels for each parameter can be found in the appendix (Appendix B). Here is a
summary, more details are presented in SWING D3.2:

    •   Technology-related parameters apply to the WSMX tool suite (Haller, Cimpian et al. 2005), and to
        the concept map (Cañas, Hill et al. 2004) and spreadsheet tools that are used for knowledge
        acquisition and structuring (SWING D3.1).

    •   Most structure-related parameters apply already on intermediate representations used, i.e. on the
        concept maps, as well as on the spreadsheets for capturing structural knowledge. Common
        parameters include depths and widths of is_a or part_of hierarchies (Gangemi, Catenacci et al.
        2005). On implementing language level, language conformity must be ensured (Sure and Studer
        2002), e.g. the conformity of the ontologies to WSML-Flight.

    •   Function-related parameters consider the concept maps and spreadsheets, but include specifics
        regarding the logic and even the implementing language. For example, the correct spelling of terms
        spans over all these abstraction levels, whereas linguistically correct usage of terms can basically be
        ensured at the concept map level.

    •   Considering conceptualisation-related parameters, the spreadsheets need to be examined to offer
        one description per concept and per relation. All of these descriptions have to target concepts not
        words and need to confirm with relevant terminological standards put forward by e.g. ISO (Köhler,
        K. et al. 2006; Obrst, Ceusters et al. 2007). Testing the concept map for ontological adequacy of
        taxonomic relationships, and of correct use of ontological notions drawn from philosophy deserves
        special attention (Guarino and Welty 2004). Detailed methods for judging and improving the
        ontological adequacy of taxonomic relationships, called OntoClean, have been developed previously
        (Guarino and Welty 2004). The integration of OntoClean with METHONTOLOGY, the ontology
        engineering methodology, which is followed within the SWING project (D3.1) happens during the
        conceptual phase (Fernández-López, Gómez-Pérez et al. 2001). Alignment of domain ontologies to
        an upper-level, also called foundational ontology serves a second posibility. An according approach
        is elaborated in SWING D3.2.

    •   All usage-related parameters depend on the interfacing level. Usually concept maps or fragments of
        the ontology implementing language, i.e. WSML-Flight, are applied at this stage. It is of major
        importance to get feedback on the ease for users to recognize the ontology’s properties and on the
        difficulties to find out which set of concepts is most (economically, computationally) suitable for a
        given (series of) task(s).

Depending on the involvement and types of users, the techniques are either objective or subjective
(Hartmann, Sure et al. 2004). In cases of well-known algorithms machines can be used to calculate the
quality measure for one of the QoO parameters. In cases where machines cannot be used, ontology engineers
and domain experts have the knowledge to judge quality. In addition, the intended users of released
ontologies may give valuable feedback. Notable the final users may neither be ontology engineers, nor
domain experts. In open environments, anybody has access to the ontologies and thus even non-experts and
non-intended users can give judgment. Overall (starting from the machines and arriving to anybody’s
participation in the process of evaluation) credibility is lost, whereas subjectivity increases. The increasing
subjectivity might be resolved by including a large amount of evaluators and by applying statistical
measures. In the current ontology usage, the critical mass for reducing subjectivity is not reached.

Approaches Assessing Usage-related Parameters

Usage can only be evaluated based on user feedback. The following section shows how user feedback
channels allow for the analysis of the usage of ontologies. In the first approach (implicit user feedback), the
accesses to concepts and relations are counted (Noy et al. 2005). From these numbers, the ontology engineer
is able to derive (preferably in cooperation with the domain experts) valuable information about the current
state of the ontology. Users only interact with the intermediate representation of the ontology, i.e. a graph
showing only concepts, relations, and labels and descriptions in the preferred language. Usage statistics give
therefore only feedback about these entities (and not, for example, about axioms).

If a certain entity is rarely used (which means, far below average), it can be concluded that:

    •   The entity is too specific. Most domain experts simply do not need that much detail and more
        general concepts are sufficient. Changes to the ontologies are not necessary in this case. For the
        visualisation such concepts could be filtered out, though.

    •   The entity is understood incorrectly. Users might have misunderstood the meaning of the concept
        due to an incorrect label (even with a description). Or is simply not modelled the right way. The
        engineer must reconsider the concept in this case and has to communicate with the domain experts to
        find a solution.

If the usage of the entity is high (far above average), it can be inferred that:

    •   The entity is too general. If a further specification of this concept exists, the ontology engineer must
        add additional knowledge to the ontologies to make more a diverse usage possible. If a further
        specification already exists and the sub-concepts are used regularly, no further action is required.

Explicit user feedback serves as the second approach. Here, questionnaires are used to find out if users face
difficulties in recognising the ontologies properties. Errors in labelling, missing concepts, improper
granularity or too specific concepts are some additional aspects users can contribute to. Questionnaires can
be used in the form of print-outs or digitally. Digital questionnaires, which are directly coupled with the
access point to the ontologies, seem most promising.

Both, implicit and explicit user feedback applies after ontologies have been released. Thus usage evaluation
is closely related to ontology maintenance (Section 5).

5.4.3   The SWING Ontology Validation Strategy

Ontologies are developed to follow a specific task. The importance of each parameter depends much on the
application and cannot be generalised. Therefore specific strategies, which incorporate a sub-set of the
measures that were introduced in Section 4.2 and which defines criteria for successful validation, need to be
defined for each project. This section introduces such a strategy for the SWING project. The parameters,
which are important for validating the SWING ontologies are identified, decisions of implementing their
validation are detailed, and we report on first experiences. A more detailed report will be included in the
documentation of the final ontology repository, i.e. SWING D3.3.

Within the project the task include semantically annotating service capabilities and service contents,
formulating goals for discovery, and specifying workflows for execution. In this case domain experts and
end-users need to be supported in describing the content of provided services, as well as in formulating
queries for discovery and goals for execution. Within the scope of the project, the domain experts and end-
users are geologists from BRGM and software developers, which specify service compositions using the
development tool (SWING D6.1). For the project, the goal is to ensure fitness for purpose of the ontologies
for these users. Therefore we focus evaluation on functional, conceptual and usage level.

Considering validation, most work that needs to be done focuses on Domain Ontologies. ISO and OGC
Ontologies in contrast base on well established standards regarding service interfaces, respectively data
formats. These kinds of ontologies have been developed in order to translate data instances into instances of
an ontology. Evaluation against the correct representation of the standards is done manually and requires
only considering the axioms. Remaining elements (concepts and relations) have anyway been directly
derived from established UML class diagrams (see Section 6.1.2 from SWING D3.1). The main issue is
developing additional ontologies if standards are released in new versions.

In the remainder of this section, we discuss each kind of parameter separately and argue why the SWING
project requires to specifically focusing on some parameters while others are neglected. Steps towards an
integrative evaluation are outlined in the final part of the section. They point to a way forward, but are not
directly in the scope of this project.

Validating Technology-Related Parameters

In the case of WSMX (Roman, Keller et al. 2005), the tool suite needs to compete with widely distributed
OWL tools. All comparing measures can in fact be used to evaluate WSMX in comparison to the classical
tools of the OWL community. Such evaluation is a general WSMX issue and not in the scope of the SWING
project. Reaching (or underpinning) a certain value that indicates performance is important for many
industrial projects, not for the research focussed SWING project. Amount of working memory as well as
memory for data storage are again classical measures for industrial product development. Both won’t be
evaluated within the SWING project. The same holds for scalability.

Nevertheless, SWING D3.1 already contains some technology evaluation by arguing for the WSML variant
that suites the requirements of the SWING project (Section 4 of D3.1).

Validating Structure-Related Parameters

On the level of concept maps and used spreadsheets, the available tools kind of ensure language conformity,
because the offered constructs can only be used in the intended way. In the case of ontology implementing
software, like WSMX, the ontology code can automatically be tested against the well defined alphabet.

Further structure-related parameters are not focussed by the strategy because of two reasons. First, many of
these parameters apply to hierarchies, especially to is_a hierarchies (taxonomies) with multiple-inheritance
and to part_of hierarchies (partonomies). Both kinds of hierarchies are not extensively used in the modelling
approach used within the SWING project. This is due to the fact that, especially is_a is often misused and
modelling via non-taxonomic relations often suits better (Guarino and Welty 2002). Second, the value of
structural metrics is questionable. It is unclear, which values of the various parameters can be used as
reasonable thresholds, i.e. in which cases do parameter values indicate good quality. Evaluation is just the
assessment of values that indicate QoO. The question of thresholds for these values has not been answered
yet. Does 4.5 as ratio between primitives and non-primitives indicate high quality? If 68% of the conducted
domain experts agreed to a statement, does this mean the according part of the ontology is ready to be
released?

Concerning the structure-related parameters, final remarks can only be drawn after evaluations have been
applied in a large setting. In order to answer this and related questions, we plan to apply the introduced
measures (see Section 4.2.2) on the final set of ontologies for all use cases as a case study. Based on the
study, we will include suggestion for reasonable thresholds and valuable use of the structure-related
parameters in the documentation accompanying SWING D3.3.

Validating Function-Related Parameters

Spell checking needs to be included. At current this needs to be done manually, i.e. copying the ontologies to
some text editor and doing a spell check. Ideally this is integrated into the ontology implementing tool. In
this case, labels can be filtered and checked, which avoids identifying terms of the ontology language , like
“subConceptOf” or “definedBy” as spelling mistakes.

One requirement, which is defined in the specification phase, is to annotate desired resources sufficiently. In
the case of the SWING project, these sources are the Web Services that are required to implement the use
cases. This basically means: can a WFSs be annotated and discovered. A method for achieving this is trying
to annotate the services and reporting change request and errors if discovered. In the SWING project, these
annotation can be done using the OntoBridge tool (D4.2).

Meeting competence questions is a fundamental goal of the ontologies under development (SWING D3.1). If
the questions defined in the specification phase of the ontology engineering process cannot be posted in the
first place, the ontologies fail the functional purpose. Additionally, like in software engineering, unwanted
changes and side effects can be detected by specifying discovery goals in WSMX and by annotating services
as described in the paragraph above. After each change in the ontologies, it can be checked (using the
reasoner) if the desired set of services is still discovered. By nature the test set of goals will grow within the
ontology development process.

Questions can also be used to test the validity of the ontologies. Questionnaires with statements from the
ontologies can be generated, and domain experts can be asked to state their correctness. Statements like “One
SIRET code can be related to one unique address” can be accepted, rejected or marked as unclear. The latter
means, more clarification is required and the domain expert must comment on it. It is also possible to
deliberately generate wrong statements, which relate concepts that originally have no link within the
ontology. Acceptance of such statements might result in the need for new relations.

Generating these Statements is not easy, though. One must be very careful to put not too much information
into them. Often the statement was generally correct, and a small part of the sentence bothered the domain
expert and made him reject the whole sentence.

A couple of such questionnaires have been used for functional evaluation during a second knowledge
acquisition workshop (16th and 17th of August 2007, Paris) in cooperation between BRGM and UOM.
Examples of the utilised questionnaires are attached as Appendix C. While one session of the workshop
focussed on acquiring knowledge for the second SWING use case, one session was dedicated to validation of
the ontologies generated according to the first use case. Two domain experts were available to perform this
task, and the outcome was mostly accepted. Since these were the domain experts we also collaborated with
to create the ontologies in the first hand, such outcome has been expected. Before re-using this method for
future ontologies, if needs being ensured that the domain experts are from a different group and didn't
contribute to the creation of the ontologies.
Validating Conceptualisation-Related Parameters

Testing the number of descriptions per concept can be done manually, but for WSMX it would be possible to
check the number of dc#description occurrences as well as the number of the values for this parameter.
This requires implementation work within the WSMX environment.

Linguistic thesauri can be used where suitable to reduce human effort in writing own concept descriptions.
Anyway, it is not the target in the SWING strategy, to analyse the interplay of using linguistic thesauri and
structuring knowledge that is very specific to a certain domain. A possible solution has been pointed out, but
is not fully implemented, because of required effort and expertise. Due to the fact that definitions are not
commonly available for all concepts and relations of an ontology, it is “best practise” to indicate (within the
ontologies) where a description has been evaluated, and where and which foreign sources were conducted.

Considering the SWING project ISO TC211 and OGC serve terminology standards. Both are application-
independent. Hence, terminology standards for domain knowledge are not considered. Only the ISO and
OGC Ontologies have to commit to these terminology standards. Conformance is ensured, because we
derived to GML ONTOLOGY from the standard UML and its documentation.

Not as a standard, but as a guideline, the GEMET thesaurus22 can be used to ensure using a well agreed
domain vocabulary. For the use case two ontologies and onwards, this is already ensured by the knowledge
acquisition strategy, which was adopted from (Stuckenschmidt and van Harmelen 2005). In a nutshell, the
authors proposed a bottom-up knowledge acquisition strategy based on refinements and using information
from foundational ontologies, scientific classification, domain thesauri (as in our case GEMET), linguistic
thesauri and data catalogues.

In many other research activities on information system (IS) ontologies especially those related to the
semantic Web, the philosophical roots are widely ignored, bending the original notion of ontology in a
questionable way, see (Born, Drumm et al. 2007) and (Vaculin and Sycara 2008) for examples. We consider
it as important to include this philosophical issues in ontology engineering. OntoClean provides one step into
this direction, the use of foundational ontology gives another.

Since OntoClean focuses on taxonomic relations and is especially useful if multiple inheritance occurs
frequently, we decided not to use OntoClean in a wide extent. Extensive use of is_a relations and of multiple
inheritance to not apply to the SWING ontologies, where non-taxonomies relations are prominently used.

Alignment with a foundational ontology seems an appropriate approach towards high quality domain
ontologies. For evaluation purposes, trying to align a given domain ontology to a foundational ontology
reveals if the domain ontology has philosophical underpinning. Anyway, for knowledge acquisition it makes
in turn sense to use a foundational ontology straight away. The alignment improves structure and robustness
of the ontology and helps to reveal the conceptualisation of the domain. For this reason, instead of using
foundational ontologies for evaluation, we suggest integrating the approach directly into the knowledge
acquisition strategy. Once first concept maps are available, a foundational ontology can be used to improve
structuring before formalisation is done. Such an approach is partially already suggested in (Stuckenschmidt
and van Harmelen 2005), where Upper Cyc (Lenat 1995) is promoted as foundational ontology.

Due to the unavailability of instances, we need to use an intentional approach in which alignments were
developed by manually comparing the informal definitions and related elements (e.g. super-concepts, sub-
concepts, relations) of the SWING domain ontologies and the foundational ontology. As concrete
foundational ontology, we furthermore suggest using the Descriptive Ontology for Linguistic and Cognitive
Engineering (DOLCE). DOLCE suites for this undertaking for several reasons. Most important, we think that
the top-level notions introduced by DOLCE are adequate for deriving the different kinds of entities that are
related to geospatial and geosciences domains. In fact physical space is the key criteria in DOLCE to
distinguish between its most general entities: only endurants are located in physical space, where perdurants
are located in time. Entities which have no location in physical or temporal space are abstracts.

Experiments using DOLCE, GEMET and Wordnet in the conceptualisation phase of the ontology



22
     GEMET is officially available from http://www.eionet.europa.eugemet.
engineering process have been done while developing the SWING ontologies for use case two. Since this
deliverable is meant to define the general modelling approach and to give guidelines, all revisions of the
knowledge acquisition process will be included in the documentation accompanying SWING D3.3.

Validating Usage-Related Parameters

Recommendation: More Language Independency

A revision of tools that are available to support validation of various parameters has been carried out by
others (Hartmann et al. 2004). Most outstanding, OntoManager was identified to be best suited for
continuous evaluation throughout ontology engineering and OntoClean was judged to be impractical. The
main drawback of all tools is missing integrity. Many tools are ontology language dependent and do support
only the Web Ontology Language (OWL) (Bechhofer, Harmelen Van et al. 2003).

Also out of scope for the SWING project, it is highly desirable to develop a merely language independent
approach for ontology engineering. Continuous validation mechanisms must not be included in language
dependent tools, but in a meta-level once, because many parameters do relate to concept maps only (see
Appendix B).

We envision a tool that is based on Model Driven Architecture (MDA) (Kleppe, Warmer et al. 2003), where
ontologies are developed as models. Most of the evaluation parameters relate to the model level. Automated
translation from one model to various languages, including WSML and OWL, would ensure correct use of a
specific Ontology Description Language (ODL) syntax. Integrated spell-checking could be included in the
model level. All of the structure-related parameters could be calculated automatically. The effort to
implement algorithms calculating such parameters needs only to be taken once. Envisioning conceptual
storage in the long term, implementing a count of descriptions and French translations on WSMX level is not
suitable. This can be done on model level, too. Although judged impractical, OntoClean could be integrated
by specifying meta-properties in the property view.

5.5     Maintenance Strategy

#...#

5.5.1    The SWING Ontology Maintenance Strategy

In Section 4 we already introduced the notion of Quality in the context of ontology development. The result
is a set of engineered ontologies, which are deployed and used in software systems, mostly for the purpose of
information retrieval or internal consistency and interoperability checks. The following section introduces
the last and far too often neglected step in the ontology engineering, the Ontology Maintenance. In this phase
of the ontology development process, the ontologies become actually used and developers retrieve feedback.
The ontologies produced here are strongly coupled with software, applying the methods developed for
software engineering here as well is therefore not only the pragmatic, but also the most reasonable approach.

Goals of Ontology Maintenance

According to the definition found in ISO/IEC 14764, “software maintenance is the modification of a
software product after delivery to correct faults, to improve performance or other attributes, or to adapt the
product to a modified environment”. For ontologies, we can (following the ISO definition above) identify
two major reasons which might require maintenance: the lack of either consistency or usefulness.

Consistency

The notion of consistency has already been introduced in Section 4. There we focussed on the machine-
supported evaluation of the functional parameters, e.g. to detect contradicting statements. Wordnet (see
footnote 8, p. 34) defines consistency as “a harmonious uniformity or agreement among things or parts”.
Inconsistencies can happen among many different things within ontologies, e.g. the naming of the concepts,
the use of non-functional properties, the name-spaces, conceptually wrong compositions, and many more.
Some of the inconsistencies can be avoided by a sophisticated and controlled ontology engineering process
supported by implementation conventions (e.g. different use of name-spaces). Others can be automatically
detected by testing methods (e.g. wrong alignment of concepts). Or by incorporating user feedback methods
(e.g. wrong spelling of concept labels). Some might never be detected, but no one would claim that
knowledge can be modelled without error.

We have to deal with syntactic inconsistencies like different name-spaces, naming of entities which does not
comply with the conventions discussed in Section 3, different use of WSML-dialects, and more. Semantic
inconsistencies due to false assumptions, e.g. wrong classification, illogical conceptualisations, or
contradicting axioms occur as well. Some of these errors might be discovered by automatic testing, others
require collaboration with the domain experts or user feedback to be located. A systematic and diligent
approach for the ontology engineering helps to avoid most inconsistencies.

Usefulness

Lack of usefulness, or “the quality of being of practical use” (Wordnet), has also an impact on the usability
of the software based on the ontology. A system used to retrieve geographic information can be rendered
useless if the ontologies used to formulate the queries do not provide the expected vocabulary. The scope of
the ontologies is defined in the first step of the ontology engineering process (e.g. with the help of
competency questions). Ontologies are appropriate for the intended use if the domain experts and other
specialists using this software all commit to the captured and formalised knowledge. But the community and
the setting evolve over time; the ontologies have to change accordingly.

Improving the usefulness is therefore on of the claimed goals of ontology maintenance. Such maintenance
could also be considered as enhancements. The most important goal here is the avoidance of insufficient
accuracy of the knowledge. Too much encoded knowledge in the ontologies might lead to performance
problems and makes them harder to comprehend, not enough might, on the other hand, result in insufficient
and sometimes not very useful ontologies. Extending the ontologies to capture new knowledge will usually
be required some time after the release. Tweaking of the ontologies does not lead to new knowledge or fixed
errors, but makes the ontologies more usable in the context of re-using them for other purposes.
Modularization, for example, helps to abstract the ontologies and to bundle small bits of related knowledge
into separate ontologies. Tweaking does also address performance and accuracy issues. Refactoring can be
described as the change of the original code which improves its readability or simplifies its structure without
changing its usefulness. Refactoring must take all ontologies of one system into account. A typical example
would be changing of the name-spaces when ontologies are relocated, or switching to new versions of the
ontologies.

Detection of inconsistencies is usually done by testing methods (see also Section 4.2.3), which have to take
place after every (also minor) modification of the ontologies. Other tasks affect all ontologies, on the
syntactic as well as the semantic level, and do therefore initiate a complete new ontology engineering life-
cycle. Such modifications have new versions of the ontologies as result.

Methods of Ontology Maintenance

The following section introduces methods implementing ontology maintenance and explains what has been,
and what will be, applied in the context of the SWING project.

Automatic Ontology Testing

Testing plays a crucial role in software engineering and there is wide support of methods and tools, which
help to perform these tasks. Results of ontology tests indicate if the ontologies can be released, i.e. they meet
the pre-defined level of quality, if they need another revision. The general idea of testing, e.g. for the
functional parameters of an ontology, have been discussed already in Section 4. The following section is
focussing only on the idea of unit testing, which can be automatically applied after maintenance to ensure the
quality discussed and evaluated during the ontology engineering period.

Research to apply unit testing to ontologies has been already performed. The OWL Unit Framework23 can,
for example, be used to create and run unit tests for the ontologies written in OWL (with the Editor Protégé).
Semi-automatic testing of WSML ontologies can be performed with the help of the OUnit view, described in



23
     Available at http://www.co-ode.org/downloads/owlunittest/.
(Tanler 2007).

In SWING D3.1 we introduced the concept of competency questions to identify the needed scope of the
ontologies. We used these questions “to test the vocabulary specified in the domain ontology for
completeness and adequate level of detail” (SWING D3.1). For SWING use case one, the questions have
been implemented as WSML goals and various discovery scenarios have been realized.

In the maintenance phase we extend the purpose of the competency questions to not only check the coverage
and detail, but to also check if modifications resulted in unexpected and unwanted changes. Such testing is
performed automatically after modifications of the ontologies, which means that the (originally informal)
competency questions need to be formalised. WSMO Goals are the obvious approach to formalise these
questions, and the WSMO-WebServices are the expected results. Writing these tests therefore means to
formulate testing Web Services and testing goals. The following example of a competency question is taken
from SWING D3.1 for the first use case of this project “Creating a production-consumption map”.

                             Where can I find information about population counts?

In the first phase of the knowledge acquisition process we claimed that we need to be able to answer this
question with the help of our ontologies. For testing purpose, we therefore implemented the informal goal
above to check if the ontologies satisfy the assumption listed below.


       Goal testFindingPopulation
       capability
        postcondition
          definedBy
            /* GOAL: some information which includes the count of persons */
            ?result[?attribute hasValue ?value] and
            ?count memberOf meas#Count and
            ?count[meas#countsEntity hasValue domain#Person] and
            annot#annotate(?value, ?count).

  Listing X: Post-condition of a goal that formalises the informal goal: “Where can I find information about population counts ?”.

This formalized competency question does not directly ask for something annotated with the concept
domain#population. Population is actually the count of people living in a specific region; the domain
knowledge must provide this information. In the first use case, the population is needed to calculate the
expected consumption of aggregates (produced in the quarries) in the region. The service, which provides
this information is called INSEE, the post-condition of the service looks like the following:
          webService INSEE
          capability
          postcondition getPopulationFromDepartment
            definedBy
              /* WEBSERVICE: response which has Population of
                   department as attribute */
              ?response memberOf ins#INSEEgetPopulationByDepartmentResponse and
              ?response[getPopulationFromDepartment hasValue ?value] and
              ?population memberOf domain#Population and
              annot#annotate(?value, ?population).

                                    Listing X+1: Post-condition of the INSEE Web Service Description.

Running the test, which means running a discovery of Web Services, which match the given goal, must
return the given Web Service (which therefore, of course, has to exist in the knowledge base). If the test fails,
we can assume that the conceptualisation on the domain level has an inconsistency and needs to be revised.

This example gives a first impression of the capabilities of unit testing within WSML. We can not only test
complete discovery scenarios with the help of testing goals, but can also use testing axioms to test specific
conceptualisations on the domain level. Furthermore, with the help of testing goals including instances of
service requests, we can also invoke the WSMX service to execute compositions to test their functionality.

The ontology testing framework based on the simple ideas of JUnit24 has been implemented for SWING. The
goals used to test discovery and compositions, as well as the axioms are classified as test cases if their
identifier starts with the label “test” (case is ignored). After larger modifications it is possible to run the tests
over a set of files. The implemented testing framework scans all files for test cases and executes them
accordingly.

The following example shows a test query which returns the number of concepts which are sub-concepts of
the concept Geometry. We assert that this query has to return eleven results. Assertions, which state the
expected result of the test query, are formulated within the non-functional properties section of the testing
query.


          axiom testDefinedGeometries
            nfp
              test#assertNumber hasValue “11”
            endnfp
            definedBy
            ?a subConceptOf iso#Geometry

                                                     Listing X+2: Example of test query.

In (Tanler 2007) a more generic approach for ontology testing has been introduced. The implementation of
the ontology testing framework for the SWING ontologies has been (a) focussed on the requirements for the
SWING project and was (b) realized from the perspective of an ontology engineer. Our solution, for
example, is independent from the chosen editor, because we frequently switch between the different options
for editing WSML ontologies.

User Feedback Channel

Automatic testing is suitable to detect inconsistencies resulting from faulty modifications of the ontologies. It


24
     The official Web page is available from http://www.junit.org/.
can not detect, for example, missing coverage (testing can detect anticipated errors) or wrong naming. For
such errors we have to rely on other methods. Feedback is well-known and reliable approach to detect
inconsistencies and missing features the engineers have not thought about during the development cycle.

Open source software engineering has brought the integration of the feedback-loop close to perfection. Users
of the software have multiple channels to give feedback to the developers, for example to report bugs,
request missing features, discuss the software in general, or even rate it. The project Website of the WSMO
editor WSMO Studio at Sourceforge25 is an example how these different channels can be used for building
software. User feedback can either be implicit or explicit.

The user gives implicit feedback without knowing it. Downloading software from Sourceforge gives for
example an indication about its popularity. For ontologies, we can count how often specific concepts for
formulating queries to infer different aspects. Concepts which are hardly ever used might be too specific,
named wrong, or might be simply not required. Concepts which are very popular might be accordingly too
general and need a more thorough specification.

Another measure could use relevance feedback to assess if a particular Web Services has been annotated the
right way. If a Web Service is regularly returned for a specific query, and the searching user states that he
actually didn't expect this service to be a result of this query, a revision of the Web Service annotation might
be required.

Allowing for explicit feedback makes it possible to detect errors like wrong labels for the concepts, missing
concepts, wrong granularity or too specific concepts, and more. An example of built-in user feedback in the
MiMS application is given in Figure 18.




                                    Figure 18:.MiMS GUI including an option for explicit (user) feedback.

The user is given the option to suggest a correction to the (in his opinion) false conceptualisation. A Web


25
     Available from http://sourceforge.net/projects/wsmostudio.
Service will be implemented to make both, the implicit as well as the explicit user feedback channel
available. It is yet too early to formally define the interfaces which will be used to acquire the user feedback.
Instead, we describe where the user need being able to give feedback within the tool-set developed within the
SWING project, and what actually happens to the ontologies after feedback has been submitted (Sections
5.2.3 and 5.2.4).

There are only two occasions where the user must interact directly with the ontologies. The task of user is in
both cases to perform the annotation, on the one hand to annotate an existing service or composition, on the
other hand to formulate the queries, i.e. to annotate a goal. And both times he doesn't have to deal with the
ontologies itself, but with an intermediate representation. The ontologies are parsed into a graph, and then
conveniently visualized to make the selection of appropriate concepts as simple as possible. The user is
therefore only confronted with a subset of the ontology; axioms, meta-data properties, and WSMO-specific
constructs are filtered out (Figure 19).




                                                  Figure 19:.The GUI of Virtual Onto Bridge.

Managing Implicit Feedback

In order to perform the usage analysis introduced in Section 4, the ontology engineer needs to have access to
the usage statistics of the different entities within the ontology. For annotating services within QUERY or
DEV ANNOT (SWING D6.3), the user must select a path within the graph which can be best associated
with the service concept, e.g. the feature type. Let us reconsider the example given in the section before
about the competency questions. We assume the user has selected the following path in the graph to annotate
the output of the operation getPopulationFromDepartment provided by the INSEE Web Service26 (Listing
26). The resulting data represents a Count of People living in a department.




26
     All relevant Web Services can be found at: http://swing.brgm.fr/dataaccess.
domain#Department ――subConceptOf――>
 domain#AdministrativeEntity ――domain#hasPopulation――>
  domain#Population ――subConceptOf――>
    meas#Quality ――meas#isQuantifiedBy――>
     meas#Count ――meas#countedEntity――>
      domain#Person

                            Listing X+3: An example graph generated by the ANNOT tool.

The WSML generated by the ANNOT application would then be:


         ?a memberOf domain#Department and
         ?a[domain#hasPopulation hasValue ?b] and
         ?c memberOf Count and
         ?d memberOf domain#Person and
         ?b [meas#isQuantifiedBy hasValue ?c] and
         ?c [meas#countedEntity hasValue ?d]

                                Listing X+4: Example WSML generated by ANNOT.

The selected path and the annotations are sent back either to MIMS or to DEV to do Discovery or
Publication. But before that this information is also transmitted through the feedback channel to the
Feedback-service, which stores statistical usage information about each entity within the path. This will be,
of course, simple hit statistics, i.e. how often one entity has been used in general. But also the correlation
between concepts (how often are two concepts used together) can be of interest here.

Before we explain how changes are stored in the ontologies, we have a look at the RDF representation of the
concept Population and its relation to Department (Listings 28 and 29).
      <wsml:Concept rdf:ID=”Department”>
        <rdfs:subClassOf rdf:ressource=”#AdministrativeEntity”>
        <rdfs:label xml:lang=”en”>department</rdfs:label>
        <rdfs:label xml:lang=”fr”>département</rdfs:label>
        <wsml:hasAttributeDefinition>
           <wsml:AttributeDefinition>
              <wsml:forAttribute rdf:id=”#hasPopulation”>
                <rdfs:range rdf:ressource=”#Population”>
              </wsml:forAttribute>
           </wsml:AttributeDefinition>
        </wsml:hasAttributeDefinition>
      </wsml:Concept>

                         Listing X+5: Department concept description in WSML, RDF format.



      <wsml:Concept rdf:ID=”Population”>
        <rdfs:subClassOf rdf:ressource=”meas#Quality”>
        <rdfs:label xml:lang=”fr”>population </rdfs:label>
        <dc:description rdf:datatype=xsd#string> departments are administrative
                                                  entities, between local and regional level
        </dc:description>
      </wsml:Concept>

                         Listing X+6: Population concept description in WSML, RDF format.

Having WSML stored in its RDF representation opens up for some interesting opportunities. First, the
ontology is now based on the RDF model (which is also the native model of the Web Ontology Language
OWL). With the help of model-to-model transformations and a set of rules it is now possible to transform
between these two languages. Transformation to a description-logic based language like OWL results in the
loss of certain features due to the lower expressivity of such languages.

One additional feature not represented in the human-readable WSML code is already present in the example
above: the language tag, which can be used to annotate RDF Literals like rdfs:label. We have now the option
to extend the ontology with additional information (because RDF is based on XML, the eXtensible Mark-up
Language). We add the usage information as attribute count with the namespace “eval” to the particular
elements in the RDF document (Listing X+7).
        <wsml:Concept rdf:ID=”Department” eval:count=”23”>
         <rdfs:subClassOf rdf:ressource=”#AdministrativeEntity”>
         <rdfs:label xml:lang=”en”>department</rdfs:label>
         <rdfs:label xml:lang=”fr”>département</rdfs:label>
         <wsml:hasAttributeDefinition>
            <wsml:AttributeDefinition>
               <wsml:forAttribute rdf:id=”#hasPopulation” eval:count=”4”>
                  <rdfs:range rdf:ressource=”#Population”>
               </wsml:forAttribute>
            </wsml:AttributeDefinition>
         </wsml:hasAttributeDefinition>
        </wsml:Concept>

                 Listing X+7: Department concept description in WSML, RDF format with usage information.

Please be aware that these extensions are (for now) only used internally to reason if ontology updates are
necessary. The ontologies are not published with this extended information since the resulting documents do
not comply to the official specifications any more.

Managing Explicit Feedback

The user is not aware of the implicit feedback; it is automatically acquired during the interaction with the
software. Explicit feedback, on the other hand, expects from the user some deliberate actions; he must
willingly contribute change requests for the ontologies, with the purpose of adapting the ontologies to better
fit his needs. Due to the fact that the user only interacts with a small subset of the ontology through the
ANNOT component, namely the concepts and their relations, only some specific updates of the ontologies
can be performed here.

The service providing the interface for the explicit feedback channel will provide methods to:

    •    Add a new label or change the label for an entity.

    •    Add a new concept.

    •    Add a new, or change a relation between two concepts.

    •    Give general feedback on the ontologies.

Submitting this information does not directly result in a change of the ontology. The information must be
added to the ontology repository, but it lies in the responsibility of the ontology engineer to decide if the
suggested change will later appear in the released ontologies (Listing X+8).
 <wsml:Concept rdf:ID=”Department” eval:count=”23”>
    <rdfs:subClassOf rdf:ressource=”#AdministrativeEntity”>
    <rdfs:label xml:lang=”en” eva:default>department</rdfs:label>
    <rdfs:label xml:lang=”fr” eva:default>département</rdfs:label>
    <rdfs:label xml:lang=”de” eva:suggest
 eva:count=”1”>Departement</rdfs:label>
    <wsml:hasAttributeDefinition>
      <wsml:AttributeDefinition>
         <wsml:forAttribute rdf:id=”#hasPopulation” eva:count=”4”>
            <rdfs:range rdf:ressource=”#Population”>
         </wsml:forAttribute>
         <wsml:forAttribute rdf:id=”#hasName” eva:suggest eva:count=”4”>
            <rdfs:range rdf:ressource=”#Name”>
         </wsml:forAttribute>
      </wsml:AttributeDefinition>
    </wsml:hasAttributeDefinition>
 </wsml:Concept>

                   Listing X+8: Department concept description in WSML, RDF format with feedback.

In the example above we have, for example, received one request for an additional label for the concept
Department. Such request is reasonable and the ontology engineer might decide to add this modification to
the official ontologies. Four people wanted to have a new attribute hasName for the concept. Here the
engineer might have to take a different action. A generic relation hasName already exists. The users just
failed to spot the relation; the ontology engineer could either reconsider this approach or just ignore the
change suggestions.
6       Semantic Annotation and Discovery Approach – STI, DERI, JSI, UoM
#...#

6.1      Semantic Annotation of WFS

#...#

Requirements for WEBSERVICE specification for Geographic Information Services

Figure gives an overview on all the items that are involved in the process of annotating WFS in WSMX.
Everything starts and depends on the lower right of the picture: the real-world entities, which are represented
as spatial information objects. These spatial information objects are encoded as features in the Geographic
Markup Language (GML) and served via OGC data services. In the following we will concentrate on the
requirements for annotating WFS, but WCS and WMS can be described according to the same principles.


                                           GML ONTOLOGY




                                FilterEncoding                                                        FeatureType
                                   ONTOLOGY                                                            ONTOLOGY
                                                                  Specific WFS                                                                  Domain
                                                                  WEBSERVICE                                                                   ONTOLOGY


                            WFS ONTOLOGY                                                                             b)
                                                            a)                                      a)


                                                                                                  Feature
                                                          WFS
                                                                                                   Type
                                                        Capabilities
                                                                                                  Schema

                                                                                                                          Axiomatized concept definitions
                   OGC WFS                                                             Describe                           that capture a specific view on
                                    Serv                    GetCapabilities
                 Implementation          ic   e im                                    FeatureType                                    the world
                  Specification                  plem
                                               rule entatio
                                                   s        n

                                                                      Web Feature
                    OGC             Filter encoding                    Service:
              Filter Encoding           (optional)                   Feature Types
              Implementation
               Specification
                                                g
                                            odin
                                        enc
                                   data les                                                                                     „Real world
                                Geo ru                                                                                            entities“
              OGC                                                          Spatial             representation
          GML Encoding                                                  Information
           Specification                                                  Objects


            ONTOLOGIES   and WEBSERVICES                    OGC specifications (white) and                      Web service                   Database
            written in WSML                                 specific WFS descriptions (grey)



         Figure 20: Information items involved in the annotation of a WFS in WSMX (Klien (2007), modified).

###

6.1.1     Decisions for Formalising Annotations

### OLD#

The reasoning capabilities needed for discovery in the SWING application determine which of the WSML
variants is to be used for encoding the ontologies. Some reasoning might also be needed for methods
supporting composition of web services, e.g. by providing the human programmer in the development
environment with information on which services can be appended to a given partial composition. Since such
methods require matching functionalities quite similar to discovery, we expect that they can be based on the
same WSML flavour as chosen for the discovery. First, we give an overview on existing approaches for
formally describing and reasoning on geographic information (Section 0). Second, we introduce and explain
the design decisions with respect to language and reasoning that we have taken in SWING (Section 0).

Discovery in WSML-Flight is based on query containment checking. That is, both the web service and the
goal are described by means of WSML-Flight queries; the web service is a match if the answer to its query is
contained in the answer to the goal query. (Alternatively, one can consider the web service a match if it
subsumes the goal query, or if the intersection of the two queries is non-empty.)

Putting the pieces together, our desired discovery query can be formulated as the following goal query:

    ?Cw subsumedBy ?Cs, hasSemanticAnnotation(?Cs,?Ca), ?Ca subsumedBy C, ?Cw[?Iw], ?Iw
              subsumedBy ?Is, hasSemanticAnnotation(?Is,?Ia), ?Ia subsumedBy I

The answer to this query will be all tuples Cw, Cs, Ca, Iw, Is, Ia as requested by our discovery requirement.
The web services themselves will be annotated with a WSML-Flight query of the form Cw, whose answer
(no variables) is simply Cw. This can then be matched with the answer to the goal query.

Note that the goal query corresponds to the discovery requirement in a very natural and intuitive way:
reconsider the general natural language requirement specified with [Query 2 at the start of this section; the
logical statements in the WSML-Flight query are in 1-to-1 correspondence with the sub-sentence of the
natural language query. Since discovery queries should (ideally) be understandable for end users, this is a
significant advantage over the more complex mathematical form of the DL query identified above. Also,
note that the WSML-Flight query actually needs to be answered only once to perform the discovery: its
answer contains all the outputs of the matching web services. This opens the path to optimized
implementations of discovery, and constitutes another advantage over DL. Finally, in WSML-Flight the
discovery can be formulated as a single reasoning task, whereas in DL we would need to construct a loop
around several reasoning tasks.

As of now, the subsumption testing in WSML-Flight (the meaning of the subsumedBy operator) is simply a
transitive closure over the explicitly stated subConceptOf relations in the ontology. While this suffices for
the typical ontologies taking the form of subsumption hierarchies, it may be too weak for more complex
representations in the future. However, this is not a problem as long as the hasSemanticAnnotation relation
is given only between atomic concepts, i.e., between concepts explicitly listed in the domain and web service
ontologies. Namely, under this condition the “bridge” between domain and web service ontologies will
always involve atomic concepts. Assuming additionally that the web service outputs and the goal concepts
will be atomic, the instantiations of all variables in the answer to the above goal query will be atomic. This
means that the subsumption relation can be pre-computed, once and for all, between all atomic concepts in
the ontologies, using a DL reasoner. This knowledge about Subsumption between atomic concepts is then
sufficient to obtain the correct discovery results using WSML-Flight AS IS. Since the creation of the
hasSemanticAnnotation relation, the web services annotation, and the goal query will always have to be
interactive with the user, it is quite reasonable to assume that only atomic concepts will be involved there.

### NEW#

In Lutz and Klien (2006), application ontologies are introduced for describing concepts that represent a
geographic feature type. This feature type concept is defined by referring to and further restricting existing
concepts and roles from the domain ontology. The purpose of the application ontology is to represent the
feature type’s semantics rather than to capture its application schema. Application concepts are thus derived
(as subconcepts) from existing domain concepts. The strength of this strategy relies in the possibility that not
only explicit information (i.e. what is represented in the schema) but also implicit information (e.g. the unit
of measurement values) is included in the concept definition. For ontology-based discovery of geographic
information, users also derive their search concepts from existing domain concepts. Thus, both, the
application concepts and search concepts become machine-comparable; subsumption reasoning will return a
new taxonomy in which all subconcepts of the injected search phrase satisfy the user’s requirements. This
approach allows for expressive query formulation and processing. However, the strategy does not separate
strictly between ontologies that capture real-world entities and ontologies that describe information objects.
By deriving application concepts (as subconcepts) from domain concepts, taxonomic relations are
established where – from an ontological point of view – no taxonomy exists. For example, the feature type
exploitationpunctual denotes an information object that models real-world quarry entities. Thus, it makes
sense to model the application concept exploitationpunctual as subconcept of INFORMATIONOBJECT (a
concept in the domain ontology that represents information objects); a feature is an information object. It
does not make sense though to model the application concept exploitationpunctual as subconcept of QUARRY
(a concept in the domain ontology that represents real-world quarries); the feature is not a quarry, but it
conveys the semantics of representing real-world quarries. Hence, it makes more sense to express this in a
non-taxonomic hasSemanticAnnotation relation (more detail on this follows in Section 0):

                     hasSemanticAnnotation (fto#exploitationpunctual, domain#Quarry)

While subsumption reasoning in Lutz and Klien (2006) allows expressive query processing, still an explicit
link is required between the data source’s schema and its application ontology to support users in
formulating filters for data retrieval. In order to establish this link, they use registration mappings. The main
idea of registration mappings is to have separate descriptions of the application concept C and of the
structural details of the feature type it describes. This has the advantage that the semantics of the feature type
can be specified more accurately in application concepts because the specification does not try to mirror the
feature type’s structure. This is especially true for feature types that do not well reflect the conceptual model
of the domain. However, registration mappings become easily quite complex, especially if a 1-to-1 mapping
is not possible. Creating application ontologies together with registration mappings is a manual process for
which automation is not feasible. One of the main goals of the SWING project is to support the annotation of
geographic information by automating the annotation process. Using feature type ontologies as introduced in
Section 0, simplifies matters considerably. A feature type ontology directly represents the feature type
schema, and by establishing an explicit link to the domain ontology through the hasSemanticAnnotation
relation, expressing a feature type’s semantics is also possible. In Section 0, we show how this relation can
be formalized. This strategy will probably not support equally expressive annotations and query formulations
as the approach in Lutz and Klien (2006), but leaves the tasks of implementing automated annotation support
(please refer to Deliverable 4.1 for more details) and automated retrieval formulation quite feasible.

We have already cited Lutz (2006) in Section #...# for listing the requirements of functional descriptions of
geoprocessing. For the functional description a language is needed that allows to put constraints on input and
output (e.g. a service that calculates a distance between two spatial objects requires both objects to be in the
same coordinate reference system) as well as the relation from input to output. In DL, there is no way of
expressing dependencies between inputs and outputs; one can only represent their respective types. For this
reason, Lutz (2006) proposes to combine DL with first-order-logic (FOL); this provides the expressivity
needed but comes at the cost of complex and ineffective FOL theorem proving.

For the above reasons, neither the application ontologies in combination with registration mappings as
introduced in Lutz and Klien (2006), nor DL and subsumption reasoning, nor the expansion with FOL and
FOL theorem proving seem to fit well with the SWING requirements. Consequently, we have developed our
own strategy for generating and representing formal descriptions for web services and user queries. In the
following, we explain how the hasSemanticAnnotation relation is used to provide the link between
application (i.e. user queries and webservices) and domain level. In addition, the advantages of using a logic
programming formalism like WSML-Flight compared to DL are discussed.

Design Decisions for SWING

User queries in SWING will be posed in the form C.I where the user wants to obtain information I about
entities of concept C. For example, the user might want to obtain the production rate (I) of quarries (C). In
our framework laid out above, this discovery query requires to:

         Find all web services whose output concept Cw has a semantic annotation with concept Ca
        where Ca is subsumed by (is a sub-concept of) C, and where Cw has an attribute of a concept
              Iw so that Is has a semantic annotation with concept Ia, and Ia is subsumed by I.
                                                                                                        [Query 1]

Put more intuitively, we want to discover all web services that output instances annotated with the desired
concept C, and that provide an attribute annotated with the desired information I. In fact, to achieve this, we
can generalize the above requirement as follows:
            Find all web services whose output concept Cw is subsumed by a concept Cs that has a
           semantic annotation with concept Ca where Ca is subsumed by C, and where Cw has an
       attribute of a concept Iw so that Iw is subsumed by a concept Is that has a semantic annotation
                                   with concept Ia, and Ia is subsumed by I.
                                                                                                   [Query 2]

While this statement might be confusing at first glance, its meaning is quite simple. As we have discussed in
Section #...#, there are two “sides” in our framework, the domain ontology side and the web service ontology
side. The connection between those two sides is the semantic annotation. The user poses the query on the
domain ontology side. What we want to find, then, is a semantic annotation “bridge” between the two sides.
This is formed by the concepts Cs - Ca, and Is - Ia in the above statement. On both sides of that bridge, we
can allow subsumption, i.e., the bridge does not need to be directly between the web service output and the
goal.

There are several options how to formulate and address the above discovery requirements in the WSML
context. One obvious solution is to base the discovery process on description logics (DL) reasoning. In DL,
the semantic annotation can be realized by introducing a new relation “hasSemanticAnnotation”, and
introducing the following axioms. For each pair Cw, C where Cw has semantic annotation C, one introduces
the axiom:

                           Cw subsumedBy(forAll hasSemanticAnnotation.C)

The meaning of this DL axiom is that, for every instance x of Cw, there exists an instance y of C so that x
has semantic annotation y. This is exactly the intended meaning of the semantic annotation. Our discovery
requirement can then be formulated as follows:

           Find all web services whose output concept Cw satisfies Cw subsumedBy(forAll
      hasSemanticAnnotation.C), and where Cw has an attribute of a concept Iw so that Iw satisfies
                         Iw subsumedBy(forAll hasSemanticAnnotation.I).
                                                                                                   [Query 3]

That is, the requirements on each of Cw and Iw can be posed as a single reasoning task to a standard DL
subsumption reasoning engine. The discovery process can then be implemented simply as a loop over all
web services in the repository; for each web service, one call to DL reasoning checks whether the output
concept is suitable, i.e. whether Cw subsumedBy(forAll hasSemanticAnnotation.C); then, if so, a loop
over the output attributes checks whether there is one such that Iw satisfies Iw subsumedBy(forAll
hasSemanticAnnotation.I). Note here that the subsumption on “both sides of the semantic annotation
bridge” is naturally taken care of by the DL operator subsumedBy, which allows intermediate concepts
between the two compared concepts.

While the formulation as DL is quite natural, it comes with two severe disadvantages, which have already
been discussed in Section 0. First, DL reasoning is known to be expensive and can become a bottleneck in
the envisioned practical use of our techniques. Second, the approach lacks extendibility to Web Processing
Services (WPS). Instead, we have decided to use WSML-Flight as the underlying formalism in SWING. This
WSML dialect is based on the logic programming paradigm. It allows sharing variables between inputs and
outputs, and hence formulating dependencies between the two. Further, logic programming has been
intensely researched for a long time, and the required reasoning can be implemented very efficiently. The
drawback of the formalism, at least in the current stage of WSML-Flight, is that the possible subsumption
tests are quite limited. In particular, complex DL operators such as the “forAll” used in the above
formulation are not allowed. Future research, possibly within the duration of SWING, will show if the
subsumption reasoning within WSML-Flight can be extended; for now, our conclusion is that the available
functionalities are sufficient for our purposes in SWING. Another possible option for more complex
subsumption testing is outlined at the end of this section.

In the WSML-Flight context, the semantic annotation bridge is realized simple by a binary predicate
connecting concept names. That is, for each pair Cw, C where Cw has semantic annotation C, we introduce
the statement:

                                     hasSemanticAnnotation(Cw,C)
Further, WSML-Flight allows to directly refer to all attributes listed for a concept, by a statement of the
form:

                                                                ?Cw[?Iw]

Here, the question marks indicate that ?Cw and ?Iw are variables. Namely, WSML-Flight formulas are
queries as in databases, and the answer to a query is the set of all tuples of instances in the database that
satisfy the query. E.g. the answer to the query ?Cw[?Iw] consists of all pairs Cw, Iw so that Iw is an
attribute of Cw. In this way, we can formally express the semantic annotation relation between two concepts
as introduced in Section 0.

###

Based on the requirement analysis of the first year (SWING D3.1), we have decided to formalize semantic
annotations within the SWING project as a binary relation between a concept specification, which represent
the information source, and elements from the domain ontology (representing conceptualisations of real
world entities). Using WSML-Flight, we have argued that the semantic annotation is realized by a binary
predicate connecting concept names. That is, for each pair (Cw, C) where Cw has semantic annotation C, we
introduced the statement:

           annotate(Cw, C)

Based on the experience gained from working with WSML-Flight in annotation and discovery experiments
we conclude that this approach is insufficient. In the following we will give a simple example, why our
original strategy for formalising annotations is not expressive enough for the purposes of the SWING
project. From here, we will continue with introducing the approach of “instance-based” annotation which
will be examined with examples that are more complex.

Challenges for Semantic Annotation

The following extract is taken from a Feature Type Ontology (FTO) which is the WSML representation of an
application schema (Section 3.1.1 of SWING D3.1). The feature type “exploitation” and its attributes
(Listing 1) need to be annotated with elements from the QUARRY ONTOLOGY, which was developed to
formalise knowledge from the domain of quarrying and mining. This FTO is a representation of one of the
publically available feature types that are used within the project27. The naming of feature types and
attributes is left original.

           concept exploitation subConceptOf gml#Feature
            /* name of the exploitation site */
            name impliesType (0 1)_string
            /* name of the owner of the exploitation site */
            owner impliesType (0 1) _string

                                              Listing X+9: Subset of the “exploitation” FTO.

The feature type “exploitation” represents quarry sites in France. Its attributes “name” and “owner” both
refer to the name of something. “Name” denotes the name of the exploitation site, whereas “owner” refers to
the name of the owner of the exploitation site. This situation leads to a dilemma for the annotation. While the
feature type can be annotated without problems, the annotation of its attributed, which is required to specify
the meaning of all data elements, is challenging. If we want to apply our strategy of defining 1:1 mappings
with existing domain concepts, we have only two possibilities:

       •    A very fine grained domain ontology that provides concepts for every possible situation, e.g.
            OwnerName, QuarryName, IndustrialSiteName, etc.



27
     The WFS that provides information about quarries is available from http://swing.brgm.fr/cgi-bin/carrieres.
      •    Annotate both attributes with the same (generic) domain concept Name.

The first approach is unrealistic and not desirable from the ontology engineering perspective. In the second
approach, the implicit information about what kind of entity is named cannot be made explicit. Making
implicit relations explicit is the core benefit of defining semantic annotations, so this is not a desirable
solution either. For these reasons, we need to find a strategy that allows the formalization of annotate
relations beyond a simple 1:1 mapping. 1:n mappings could be realised by annotating one concept of the
FTO with many concepts of the FTO. However, the problems listed above remain. We have followed
varying alternate strategies. We desired formalising the annotation link in a way that provides an optimal
solution regarding expressiveness and reasoning capabilities within the framework of WSMO and logic
programming. The solution documented in the next section appears most promising.

An Approach of Instance-Based Annotation

Since, in the domain ontology, restrictions are defined on instances level (by axioms), symbols denoting
instances of domain concepts are necessary in order to use these descriptions for reasoning. When annotating
at the concept level, the only specifications from the domain ontology that are used are the concept
definitions as well as explicitly defined sub-concept relationships, i.e. relationships that are defined via the
subConceptOf declaration.

The challenge introduced above can be met by shifting the annotation from the concept to the instance level.
In this approach, a mapping between symbols standing for instances of feature type concepts and symbols,
standing for instances of domain concepts, is proposed. In this case Iw denotes an instance of the concept
Cw and I an instance of the concept C. We introduce the following statement for the annotation:

          annotate(Iw, I)

In a Web Service post-condition the annotate relation is used to map the advertised feature type instance to
an anonymous domain instance. Values of attributes of the feature type are mapped to values of attributes of
the related anonymous domain instance. If an attribute of a feature type can not directly be mapped to an
attribute of the domain instance with which it is annotated, generic relations come into play. In this approach,
relations are defined globally instead of being specific to a certain concept. Generic relations utilize axioms
for specifying implications on their arguments.

In the following we illustrate the benefits of the approach and refer to examples taken from the SWING use
cases. Each example is structured in the same way. A WSML Web Service description of one of the WFSs
from the use cases is introduced first. These descriptions include the annotation following the suggested
approach. Second, a WSML goal representing a user query is given. Here the annotation is given in the same
way as in the Web Service description. Finally, we show the matchmaking between the goal and the Web
Service description in order to highlight the resulting reasoning capabilities. The third example shows how
the above challenge is met.

The introduced approach implements and extends the annotation relation as proposed in SWING D3.1. The
examples show the flexibility of the approach and illustrate that unrealistically fine grained domain
ontologies are not required for describing the semantics of data content. Furthermore, implicit relations like
the kinds of entities that are named or owned, can be made explicit when following the instance-based
approach.

#Examples fomr the swing use case can be found in the demo section at the end of this chapter.#

###

6.2       Semantic Annotation and Discovery for WPS

#...#

WPS Description in Logic Programming

To cope with these requirements, an expressive language is needed that still enables efficient reasoning tasks
during discovery. Description Logics (DL, (Baader, Calvanese et al. 2003)) is the most prominent language
in the Semantic Web, but has some inherent restrictions. Since, in DL, it is "impossible to describe classes
whose instances are related to another anonymous individual via different property paths" (Grosof, Horrocks
et al. 2003), no dependencies between Web Service input and output can be formalised. Moreover, even
some (pre- or post-) conditions such as "both input polygons should adhere to a common (but anonymous)
coordinate reference system" cannot be expressed. Hence, we need a more expressive language. First-order
predicate logic (FOL, (Russell and Norvig 2003)) provides the required expressivity but this comes at a cost:
reasoning in FOL is no longer decidable and therefore, as described in (Bovykin and Zolin 2006), the
possibility to build an automated discovery engine is limited. Furthermore, FOL axioms are often complex
formulas (containing e.g. complex quantifications of variables), which makes annotation a complicated task.

It has been shown that the logic programming (LP,(Nilsson and Maluszynski 2000; Ullman 1989)) paradigm
could be a solution. Basing semantic Web Service discovery on the LP paradigm seems a good trade-off
between expressivity, reasoning complexity and feasibility of annotation. WSML-Flight can be used for
specifying the semantics of processing services; if services are described as conjunctive LP-queries.
Furthermore, LP recently gained increasing popularity in the Semantic Web. The Web Service Modelling
Language WSML (Roman, Lausen et al. 2005) provides the required infrastructure for LP-based Web
Service discovery.

LP transfers the declarative style of FOL to the realm of computer programming. In particular, LP focuses on
logical reasoning problems that are decidable and that are hence suitable as the basis of practical
applications. LP languages provide a formal syntax that includes: constant symbols for the representation of
individuals; predicate symbols for specifying relations between individuals; logical connectives such as “or”,
“and”; variables for making general statements about unknown individuals. In particular, herein we focus on
DATALOG (Dahr 1996), which features LP rules of the form:

                                           P : −Q1 ∧ Q2 ∧ ... ∧ Qn

The meaning of such a rule is that, whenever all the Qi are true, then P is also true. The left hand side of the
rule is its head, the right hand side is its body. Here, P and each Qi have the form p(t1, …, tm) where p is a
predicate of arity m, and each ti is either a variable or a constant. One special form of rules is formed by
those with an empty body and a head that is ground, i.e. whose arguments contain no variables. Such rules
are called facts. Facts can be used to specify the instances (represented as constant-symbols) on which a
certain predicate holds.

The most prominent reasoning task in LP is query answering, i.e. the process of deriving new facts from a
database of already established facts. Following the characterization of conjunctive DATALOG queries
(Ullman 1989), we define:

 A conjunctive query is a rule in which a predicate is defined in terms of one or more predicates other than
                                                    itself.

In other words, every non-recursive rule is a conjunctive query. The predicates appearing in the body of a
conjunctive query are referred to as the query's sub-goals. The variables appearing in the head of a
conjunctive query are called exported variables; this may be an arbitrary subset of the variables appearing in
the rule body.

Functional Descriptions as Conjunctive Queries

Following the framework for semantic Web Services provided by WSMO (Fensel, Lausen et al. 2006), the
functional descriptions of WPSs and of service requests consist of pre-conditions, post-conditions and a set
of shared variables appearing in both. These pre- and post-condition definitions include a specification of
the signature as well as further constraints on the input and output. The terminology is derived from the
GEOGRAPHIC DATA TYPES ONTOLOGY. The post-condition specification additionally contains the
operation description derived from the GEOGRAPHIC OPERATIONS ONTOLOGY. Those variables that
are marked as shared appear in both, pre- and post-conditions. They are used to formalise dependencies
between input and output. Both, preconditions and post-conditions have the form of the body of a
conjunctive query.
For example, the functional description F (either request or Web Service advertisement) annotates the
intersection operation on two polygons adhering to the Gauß-Krüger reference system. Fpre represents the
pre-condition, Fpost the post condition and Fshare the shared variables.

                    Fpre : polygon(A) ∧ polygon(B) ∧ hasSRS(A,gk) ∧ hasSRS(B, gk )
                    Fpost : polygon(C) ∧ intersection(A, B, C)
                    Fshare : A, B

Web Service Discovery by Query Containment

The key notion in our discovery work is query containment, which is defined as follows:

A conjunctive query Q1 is contained in a conjunctive query Q2 if, whatever the established database of facts
   is, the set of additional facts provable from Q1 is a subset of those provable from Q2 (Ullman 1989).

In other words, Q1 ⊆ Q2 , if and only if the set of facts on which the head of Q1 holds is a subset of the set
of facts on which the head of Q2 holds, independently from the actual database that is evaluated by the two
queries. Due to this independence from a particular database, query containment can be tested syntactically
based on the structure of the two queries and on the set of rules given in the DATALOG ontology. An
interesting characterization of when a containment holds is in the form of so-called containment mappings.
Containment mappings turn the containing query Q2 into the contained query Q1 by mapping each sub-goal
from Q2 to a corresponding sub-goal that can be derived – by applying rules – from the body of Q1 .
Namely, a function h from the set of symbols (predicates, constants, variables) used in Q2 into the set of
symbols used in Q1 is said to be a containment mapping from Q2 to Q1 if:

    •   h is the identity function on constants;

    •   h maps the head of Q2 to the head of Q1 , that is: h ( H ) = I ;

    •   and, for each sub-goal Gi ( y1,..., yn ) in the body of Q2 , Gi ( h( y1 ),..., h( y n )) is a fact in Q1 , where
        Q1 is the deductive closure of the body of Q1 .

In other words, a containment mapping from a query Q2 to a query Q1 exists if and only if the body of Q1
logically implies the body of Q2 : the deductive closure of the body of Q1 is the set of all facts that can be
derived from the body of Q1 by iteratively applying DATALOG rules until a fix-point is reached. It is hence
easy to see that a query containment relation Q1 ⊆ Q2 holds if and only if there exists a containment
mapping from Q2 to Q1 . In the context of service advertisements and request descriptions, we use the
following syntax:

    •   Rpre/post/share refers to the functional description of a request,

    •   Spre/post/share refers to the functional description of a service,

    •   im refers to the containment mapping between the inputs of the request and the service description

    •   im-1 refers to the containment mapping between the inputs of the service and the request description

    •   om refers to the containment mapping between the outputs of the request and the service description

On this basis, we developed different notions of matchmaking between the functional descriptions introduced
in the previous section. All of them are closely related to the ones introduced in (Zaremski 1996). Predicated
matches, which compare request pre- and post-condition with the pre- and post-conditions of advertised
services in one matchmaking step, are detailed in Appendix A.1. They might deliver either wrong results
(low precision) or do not succeed even if the WPS provides the requested functionality (low recall). The
more sophisticated Plug-In Match is described in Appendix A.2. It compares the pre-conditions and post-
conditions separately, but is not capable of filtering out those permutations of input variables, which possibly
deliver wrong results (low precision). For a non-relaxed match, i.e. a match with high precision and high
recall, we developed the Extended Plug-In Match, as described in the following.

Extended Plug-In Match

In order to avoid the drawbacks described in the previous section, we formulate the following conditions on
the input and output mappings. The desired match should only succeed if:

                          (1) im is injective...
                                 (a) ... over all; that is: ∀x, x , ∈ S pre , im( x) = im( x , ) ⇒ x = x ,
                                 (b) ... with respect to Rshare ; that is:
                                                                    ∀x, x , ∈ S pre , im( x) = im( x , ) ∈ Rshare ⇒ x = x ,
                          (2) ∀a ∈ Rshare , there exists x ∈ S pre with im( x ) = a
                          (3) ∀a ∈ Rshare , ∀x ∈ S pre : im( x) = a ⇒ x ∈ S share
                          (4) ∀a ∈ Rshare : im −1 (a ) = om( a )

To give a short summary of the conditions: Condition (1a) ensures that the Web Service's distinct input
variables are instantiated with distinct user input when executing the Web Service. This is a quite strong
requirement for matchmaking and could be relaxed to (1b). At least condition (1b) is required for the
conditions (2) to (4). Condition (2) ensures that every of the request's shared variables is used as input, when
executing the WPS. Condition (3) goes one step further since it requires that every of the request's shared
variables is not only used as input but also appears in the output when executing the service (i.e. that the
shared variables are used as such). This ensures (together with condition (4)) that the Web Service
"behaves" in the way requested. From conditions (1) to (3), it directly follows that a Web Service must have
at least as many shared variables as a request. Conditions (1), (2) and (3) also imply that im is a bijection
between a subset of S share and Rshare . Hence we can assume that the inverse mapping im −1 exists for all
of the request’s shared variables Rshare . In condition (4), we require this inverse input mapping to be equal
to the output mapping. This ensures that request and Web Service agree on the way, the output is computed
from the input. Hence, condition (4) allows the formalisation of dependencies between Web Service input
and output. Condition (4) is obviously the strongest one and definitely requires im to meet conditions
(1a)/(1b) (injectivity), (2) and (3) (surjectivity). In order to meet these conditions when matching functional
descriptions, we define the Extended Plug-In Match.

Denoting with x1 ...x n ( a1 ...a m ) the Web Service’s (request’s) precondition variables, with y1 ... y l ( b1 ...bk )
the Web Services (requests) post-condition variables, we define the following two queries28:

                          S Q ≡ S post ∧ order ( x1 ,..., x n ) ∧

                                     ⎛
                                     ⎜
                                     ⎝ i
                                          ∧
                                     ⎜ x ∈S
                                            share
                                                                 ⎞ ⎜
                                                  shared ( x i ) ⎟ ∧ ⎜
                                                                 ⎟
                                                                     ⎛
                                                                                   ∧
                                                                 ⎠ ⎜ x i , x j ∈ S pre , i ≠ j
                                                                                                                 ⎞
                                                                                                                 ⎟
                                                                                               dif ( x i , x j ) ⎟
                                                                                                                 ⎟
                                                                     ⎝                                           ⎠




28
     We assume that the dif(..) predicate is symmetric.
                                                                                                             (5)

With:

                                  ⎧                                              ⎫
                    X = R share ∪ ⎨ a i ∈ R pre ∃ y j ∈ S pre : im ( y j ) = a i ⎬
                                  ⎩                                              ⎭
                    RQ ≡ R post ∧ order (im ( x1 ),..., im ( xn )) ∧

                                ⎛
                                ⎜    ∧
                                ⎜ a ∈R
                                ⎝ i share
                                                         ⎞ ⎜
                                                             ⎛
                                          shared ( a i ) ⎟ ∧ ⎜
                                                         ⎟
                                                         ⎠
                                                                            ∧
                                                             ⎜ a i , a j ∈ X ,i ≠ j
                                                                                                      ⎞
                                                                                                      ⎟
                                                                                    dif ( a i , a j ) ⎟
                                                                                                      ⎟
                                                             ⎝                                        ⎠

The order(...)-predicate is used to filter out those input mappings im that cannot fulfil condition (1)
(Injectivity). The dif(..) predicates ensure condition (2) (surjectivity with respect to Rshare ). We have to
include the dif(..)-predicate into RQ not only for all of the requests shared variables but also for all
variables that are mapped to by im (hence the definition of X) in order to ensure that im is still surjective
concerning Rshare even if Web Service and request only have one shared variable each.

The shared(.)-predicates guarantee that im maps only Web Service variables that are shared to the request's
shared variables and hence ensures condition (3). All of these predicates are required to ensure condition
(4). Denoting with ai ,..., a j ⊆ a1 ...a m ( xk ,..., xl ⊆ x1 ...x n ) the subset of the requests (Web Services)
precondition variables that are part of the request’s (Web Service’s) input type signature. Denoting with b (y)
the request’s (Web Service’s) output variable. We now define the Extended Plug-In Match as:

                   1. q ( ai ,..., a j ) : − R pre   ⊆ q ( x k ,..., xl ) : − S pre                          (6)
                   2. Test for every im whether: q ( y ) : − S Q ⊆ q (b) : − RQ

Appendix A.4 shows that this match succeeds only if conditions (1a), (2), (3), and (4) hold.

6.3     Application Scenarios: SWING Ontologies for Annotation, Discovery

Brief examples are given for WFS and WPS annotation and discovery from the real scenarios (from
deliverable and demo examples). This section will include pointers to section 6 and 7.

The execution part should describe the (hopefully three) working execution examples and details the use of
the ontologies within the execution. This section will include pointers to section 2, 3 and 8.

###

First Example: Sub-Concept Relations

This example shows how the approach functions if a sub-concept relationship is defined via an axiom in the
domain ontology instead of using the subConceptOf construct.

WEB SERVICE

The following Web Service returns an instance (?z), which is annotated with an instance of
domain#QuarrySubConcept.
      sharedVariables ?z


      postcondition InstanceAnnotationServicePostcondition
        definedBy
         ?z memberOf fto#exploitation and
         annot#annotate(?z,?quarry) and
         ?quarry memberOf domain#QuarrySubConcept

         Listing X+10: Shared variables and post-condition of a WSML description of a WFS offering exploitation data.

GOAL

The goal requests an instance (?x) that is annotated with ?Quarry. ?Quarry is an instance of
domain#Quarry (Listing 3).



      postcondition InstanceAnnotationGoalPostcondition
        definedBy
         ?x memberOf ?C and
         annot#annotate(?x, ?Quarry) and
         ?Quarry memberOf domain#Quarry.

                         Listing X+11: Post-condition of a WSML goal asking for data about quarries.

MATCH

If we now have either the following axiom in the domain ontology...:


      ?x memberOf QuarrySubConcept implies ?x memberOf Quarry.
                                    Listing X+12: Axiom of the QUARRY ONTOLOGY.

...or the following concept declaration...:


      concept QuarrySubConcept subConceptOf Quarry
                            Listing X+13: Sub-concept definition of the QUARRY ONTOLOGY.

...we get a match.

Although both of these definitions are logically equivalent, the WSML-Flight subsumption reasoning at the
concept level only succeeds if sub-concept relationships are explicitly declared via subConceptOf, i.e. the
implication is not used for reasoning in WSML-Flight. Considering the example, with the initially proposed
approach, only the declaration in Listing 5 leads to a match, the axiom presented in Listing 4 does not.

Second Example: Use of Attribute Definitions

This example shows how the instance-based discovery approach makes use of attribute definitions specified
at the domain level. Attribute values of feature type attributes are mapped to attribute values representing
domain concepts. We have the following domain concept domain#Quarry:
      concept Quarry
       hasLocation impliesType QuarryLocation

                       Listing X+14: The concept Quarry definition of the QUARRY ONTOLOGY.

WEBSERVICE

The following Web Service description advertises a feature of type exploitation that offers information
about quarries (Listing 7). The feature additionally provides information about the quarry location. In
WSML: The Web Service offers an object (?z), which has an attribute msGeometry with the value
?geometry. ?z is annotated with ?quarry, an instance of domain#Quarry. ?geometry is annotated with
some instance ?location, which is the value of the attribute hasLocation of ?quarry.


      postcondition InstanceAnnotationServicePostcondition
       definedBy
        ?z[msGeometry hasValue ?geometry] memberOf fto#exploitation and
        annot#annotate(?z, ?quarry) and
        annot#annotate(?geometry, ?location) and
        ?quarry[hasLocation hasValue ?location] memberOf domain#Quarry.

                    Listing X+15: Post-condition of a WSML description of a WFS offering exploitation
                                    data including information about quarry location.

GOAL

The following goal requests data about quarries that also provide information about the quarry location
(Listing 8). In WSML: the goal requests some object (?x) that is annotated with some ?y, which is member
of domain#Quarry. ?z needs to have some attribute (?at) with the value ?attValue. ?attValue in turn
is annotated with ?location, an instance of domain#QuarryLocation.


      postcondition UserGoalPostcondition
       definedBy
        ?x[?att hasValue ?attValue] and
        annot#annotate(?x, ?y) and
        annot#annotate(?attValue, ?location) and
        ?y memberOf domain#Quarry and
        ?location memberOf domain#QuarryLocation.

            Listing X+16: Post-condition of a WSML goal asking for data about quarries including their location.

MATCH

The attribute hasLocation of domain#Quarry was defined using the statement impliesType
(hasLocation impliesType QuarryLocation). This statement defines that if the attribute
hasLocation is present, then the range of this attribute is always a member of the concept
domain#QuarryLocation. Hence, the reasoner infers that the Web Service value ?location is a member
of domain#QuarryLocation. Therefore, we get a match.

Third Example: Use of Generic Relations

The following example shows how the instance-based approach can be used with generic relations, i.e.
relations that are defined globally instead of being specific to a certain concept. Consider the following
feature type in the Feature Type Ontology:
      concept exploitation
        name impliesType _string
        owner impliesType _string

                                    Listing X+17: Exploitation feature type of the FTO.

The attribute name refers to the name of the exploitation site. The attribute owner refers to the name of the
owner of the exploitation site. To be able to cope with such cases, we introduced the generic relations
owns(..) and names(..) at the domain level (Listing X+18).

The first line states that if the names relation holds between two instances, the first instance is automatically
categorised as being a member of the concept Name and the second instance as being a member of the
concept NamedObject. The axiom namesDefinition states that if the names relation holds between
two instances, then the hasName relation exists as its inverse. Similar statements are used defining the
relations owns and ownedBy.


      relation names(impliesType Name, impliesType NamedObject)


      axiom namesDefinition
        definedBy
         names(?x, ?y) equivalent hasName(?y, ?x).


      relation owns(impliesType Owner, impliesType OwnedThing)


      axiom ownsDefinition
        definedBy
         owns(?x, ?y) equivalent ownedBy(?y, ?x).

              Listing X+18: Defining the generic names and owns relations using the instance-based approach.

WEB SERVICE

The following Web Service advertises some feature (?x) that has an attribute (name) with the value ?name.
?x is annotated with an instance (?Quarry) of domain#Quarry. ?name is annotated with some instance
(?quarryName) that names ?Quarry.


      postcondition InstanceAnnotationRequestPostcondition
        definedBy
         ?x[name hasValue ?name] memberOf fto#exploitation and
         domain#annotate(?x, ?Quarry) and
         ?Quarry memberOf domain#Quarry and
         annot#annotate(?name, ?quarryName) and
         domain#names(?quarryName, ?Quarry).

                  Listing X+19: Post-condition of a WSML description of a WFS offering exploitation data
                                   including information about the name of the quarry.
GOAL

This goal searches for quarries that also provide their name.
      postcondition GoalPostcondition
       definedBy
         ?x[?att hasValue ?attValue] and
         annot#annotate(?x, ?y) and
         ?y memberOf domain#Quarry and
         annot#annotate(?attValue, ?Name) and
         domain#hasName(?y, ?Name).

              Listing X+20: Post-condition of a WSML goal asking for data about quarries including their name.

MATCH

Since hasName(..) is defined as being the inverse of names(..)(Listing 10), we get a match. This works
similarly for the owns(..) relation and indicates how the approach of instance-based annotation meets the
annotation challenge (Section 2.1.1).

Fourth Example: Use of Axioms

This example shows how the instance-based approach makes use of logical expressions at the domain level,
which may further restrict attribute values of instances of a specific concept. Consider the following example
axiom from the domain ontology. It specifies that every production rate is measured in tons per year.


      ?x[isQuantifiedBy hasValue ?quantity] memberOf ProductionRate
       implies
         ?quantity memberOf DerivedQuantity and
         ?quantity[hasUnitOfMeasure hasValue TonsPerYear].

       Listing X+21: Axiom of the QUARRY ONTOLOGY that specified the unit used for quantifying production rates.

WEB SERVICE

The following Web Service post-condition (Listing 14) advertises some feature that provides information
about quarries and their allowed production rate (in the axiom, the annotation with quarry is omitted here for
brevity). In WSML: the Web Service advertises an object (?z) that has an attribute
(fto#AllowedProduction) with the value ?production. ?production is annotated with some
?ProductionRate,           an      instance      of     domain#ProductionRate.               The      expression
[domain#isQuantifiedBy hasValue ?quantity] is required so that the reasoner knows that there is
some value assigned to the isQuantifiedBy property. Omitting this expression would mean that no value
is assigned to isQuantifiedBy or, in other words, that the production rate denoted by
?ProductionRate is not quantified. The goal in Listing 15 would be unable to discover such service as it
(i.e. the goal) is directly constraining the quantification of a production rate. Services that provide un-
quantified production rates clearly do not satisfy the goal's constraint that the quantity needs to be measured
in tons per year.
      postcondition InstanceAnnotationServicePostcondition
       definedBy
         ?z memberOf fto#exploitation and
         ?z[fto#AllowedProduction hasValue ?production] and
         annot#annotate(?production, ?ProductionRate) and
         ?ProductionRate memberOf domain#ProductionRate and
         ?ProductionRate[domain#isQuantifiedBy hasValue ?quantity].

                 Listing X+22: Post-condition of a WSML description of a WFS offering exploitation data
                               including information about the quantity of production rate.

GOAL

The following goal requests data about quarries and their production rate (in Listing 15, the quarry-related
annotation is omitted for brevity). The production rate must be measured in tons per year (Listing 15). In
WSML: the goal requests an object (?x) that as some attribute (?att) with the value ?attValue.
?attValue needs to be annotated with some                      ?ProductionRate, an instance of
domain#ProductionRate. ?ProductionRate requirements to be quantified by some ?quantity that
has the unit of measure domain#TonsPerYear.


      postcondition UserGoalPostcondition
       definedBy
         ?x[?att hasValue ?attValue] and
         annot#annotate(?attValue, ?ProductionRate) and
         ?ProductionRate memberOf domain#ProductionRate and
         ?ProductionRate[domain#isQuantifiedBy hasValue ?quantity] and
         ?quantity[domain#hasUnitOfMeasure hasValue domain#TonsPerYear].

                 Listing X+23: Post-condition of a WSML goal asking for data about quarries including a
                                    quantity of their production rate in tons per year.

MATCH

Due to the axiom above, we get a match between the goal and the service. The reasoner infers that the
variable ?quantity in the Web Service post-condition has the attribute domain#hasUnitOf Measure
with the value domain#TonsPerYear.

###

Functional Descriptions in WSMO

In the following, we give examples of functional descriptions in the WSMO framework. Details about the
implementation of according processing service discovery can be found in SWING D2.4.

In our application, the domain ontologies are formalised as WSMO-Ontologies restricted to WSML-Flight.
The functional descriptions are formalised as WSMO-Web Service capabilities in case of WPS
advertisements and WSMO-Goal capabilities in case of requests. The pre- and post-conditions of the WSMO-
capabilities are limited to WSML-Flight expressions.

Listing 16 shows an example of a WSMO-Goal description that represents a request for WPS that input two
polygons and output a single polygon. The request does not specify the requested operation.
          goal Goal1
             capability Goal1Capability
                sharedVariables {?x, ?y}
                precondition Goal1Precondition definedBy
                    ?x memberOf iso19107#Polygon and
                    ?y memberOf Polygon.
                postcondition Goal1Postcondition definedBy.
                    ?z memberOf iso19107#Polygon and
                    ?operation(?x, ?y, ?z).

            Listing X+24: WSML goal requesting a processing service with two input polygons and a single output polygon.

The corresponding representation in standard LP syntax is shown in the following29.

                                              Goal1 pre : polygon(X) ∧ polygon(Y)
                                              Goal1 post : polygon ( Z ) ∧ OP(X , Y , Z)
                                              Goal1share : X , Y

Listing 17 gives an example of a functional description of a WPS that advertises the union operation on
polygons. It requires both input polygons (and the output polygon) to be defined with respect to a common
projected spatial reference system.

          webService UnionWPS
             capability UnionWPSCapability
                sharedVariables { ?a, ?b, ?srs }
                precondition UnionWPSPrecondition definedBy
                    ?a[iso19107#hasSRS hasValue ?srs] memberOf iso19107#Polygon and
                    ?b[iso19107#hasSRS hasValue ?srs] memberOf iso19107#Polygon and
                    ?srs memberOf iso19107#projSRS.
                postcondition UnionWPSPostcondition definedBy
                    ?c[iso19107#hasSRS hasValue ?srs] memberOf iso19107#Polygon and
                    iso19107#union(?a, ?b, ?c).

                                    Listing X+25: WSML Web Service Description of a union WPS.

The following statement shows the corresponding LP representation.

                  UnionWPS pre : polygon(A) ∧ polygon(B) ∧ hasSRS ( A, SRS )...

                                                                             ... ∧ hasSRS ( B, SRS ) ∧ projSRS ( SRS )

                  UnionWPS post : polygon(C ) ∧ union( A, B, C) ∧ hasSRS (C , SRS )
                  UnionWPS share : A, B, SRS

6.3.1       Mediation

#...#




29
     The statement OP(...) contains the variable OP in place of a predicate name and is not allowed in standard LP languages.
7       Semantic Annotation Engine - Miha Grcar, JSI
###

a) Translating the service documents into WSML

The WFS capabilities document contains general metadata on the service provider, a list of feature types that
are served and a list of filters implemented by the WFS. The specific WFS WEBSERVICE, which results from
translating the WFS Capabilities into WSML, constrains the output of the GetFeature operation to the
features the service is actually serving, along with a specification of the optional filters implemented by the
WFS.

Also, based on its feature type schema, every feature type is translated into a corresponding WSML
representation. Those elements of the schemas that point to GML encodings, e.g. the geometry attribute are
referenced to corresponding concepts in the GML ONTOLOGY. The result of this translation step is a Feature
Type Ontology (FTO) providing structural information of all feature types served by the specific WFS. By
referencing the output constraint in the WFS WEBSERVICE to the FTO concepts, it is possible to assess more
detailed information on the feature type schema during the discovery.

Figure 27 depicts an example for how a feature type can be represented in WSML. The example is taken
from one of the WFS offered by BRGM30 that serves features about quarries.


     <element name="exploitationsponctualsproduction" type="qua:exploitationsponctualsproductionType"
                                                                          substitutionGroup="gml:_Feature" />
      <complexType name="exploitationsponctualsproductionType">
        <complexContent>
        <extension base="gml:AbstractFeatureType">
         <sequence>
           <element name="msGeometry" type="gml:GeometryPropertyType“ />
           <element name="ExploitationName" type="string" />
           <element name="Communities" type="string" />
           <element name="Substance" type="string" />
           <element name="Year" type="string" />
           <element name="AllowedProduction" type="string" />
           <element name="SiteName" type="string" />
           <element name="SiteType" type="string" />
         </sequence>
         </extension>
        </complexContent>
      </complexType>                                                                                    a)


      concept exploitationsponctualsproduction subConceptOf gml#Feature

         msgeometry impliesType (1 1) gml#GeometryPropertyType
         exploitationname impliesType (1 1) _string
         communities impliesType (1 1) _string
         substance impliesType (1 1) _string
         year impliesType (1 1) _string
         allowedproduction impliesType (1 1) _string
         sitename impliesType (1 1) _string
         sitetype impliesType (1 1) _string                                                                 b)


                   Figure 27: Transforming the feature type schema (a) into the FTO written in WSML (b).



b) Establishing Links to Domain Ontologies

In order to ensure that we do not only use the correct encoding, but also capture the semantics of the


30
     all services, data, schemas, and ontologies for Use Case I can be accessed on http://swing.brgm.fr/.
registered data sets, the feature types served by the WFS still have to be semantically annotated. This can be
done by mapping elements of the feature type schema to concepts in domain ontologies. Domain ontologies
are developed to capture the conceptualization of a specific view on the world and formalize it in concept
definitions. It is assumed that all members of the geographic information community will interpret the
terminology used in their domain ontology in the same way and, at the same time, people from outside the
community are able to explore the intended meaning with the help of the concept definitions.

Figure depicts another example from the SWING Use Case. The ellipses represent a schematic
representation of an extract of the domain ontology (DO) on Quarries (sites, where mineral resources are
produced or mined). This domain ontology has been developed in cooperation with BRGM as a first result of
the Knowledge Acquisition Strategy presented in Section 5.2.2.

Quarry is the central concept of the Quarry Ontology. It is defined as sub-concept of IndustrialSite, which
means that Quarry inherits all relationships that have already been defined for IndustrialSite. Some of the
relationships require further restrictions. The range of hasLocation points to QuarryLocation, which is a sub-
concept of Location, and the Production produces not any Product, but QuarryProduct. Again, these
concepts are further defined by adding or constraining their non-taxonomic relationships.

The feature type “exploitationsponctualproduction” (which was already presented in Figure 27) denotes
point objects that model quarries (real-world geographic entities). In turn, real world quarries are described
in the concept definition of domain:Quarry. The feature type’s attributes refer either to the information
object (e.g. “msgeometry”) or to the geographic entity (e.g. “allowedProduction”). While those attributes that
describe the information object have been referenced to the GML ONTOLOGY (cf. translation in Figure 27),
attributes that refer to the geographic entity have to be semantically annotated with concepts from the
QUARRY ONTOLOGY. In our example, “allowedProduction” is mapped to the domain concept
domain:ProductionRate.


         Location      hasLocation      IndustrialSite      hasIndustrialActivity    Production                  Quantity




                                                                                    produces hasProductionRate
                                             Quarry

                        rangeConstraint on            rangeContraint on
                           hasLocation                    produces
      QuarryLocation                                                          QuarryProduct                ProductionRate




       concept exploitationsponctualsproduction subConceptOf gml#Feature
         msgeometry impliesType (1 1) gml#GeometryPropertyType
         exploitationname impliesType (1 1) _string
         communities impliesType (1 1) _string
         substance impliesType (1 1) _string
         year impliesType (1 1) _string
         allowedproduction impliesType (1 1) _string
         sitename impliesType (1 1) _string
         sitetype impliesType (1 1) _string



Figure 22: Two elements of the feature type schema are exemplarily mapped to domain concepts.

###

This chapter presents an approach for automating semantic annotation within service-oriented architectures
that provide interfaces to databases of spatial-information objects. The automation of the annotation process
facilitates the transition from the current state-of-the-art architectures towards semantically-enabled
architectures to support discovery, composition, and execution of geo-services.

In SWING, semantic annotation is understood as the process of explicitly establishing links between
geographic information that is served via OGC services and the vocabulary defined in the domain ontology
(i.e. the vocabulary of a specific GI community). Once the bridge between the two sides is established, the
domain ontology can be employed to support all sorts of user tasks. The annotation is a non-trivial process
facilitated by data mining techniques and supervised by the domain expert. The main purpose of this chapter
is to present data-mining techniques that facilitate the annotation process.

7.1     Employing Text Mining for Semantic Annotation

Section 3 revealed the need to semi-automatically map W*S4 concepts to the domain ontology concepts and
exposed the issue of missing data for this task. In SWING we thus plan to resort to using alternative data
sources from the Web to compensate for the missing data. Several such approaches were already presented in
Section 1.2 but no direct connection to SWING was given. In this section we present some of those
approaches in the light of SWING and give concrete examples with respect to the preliminary domain
ontology developed in WP 2.

Let us first define the problem of mapping one concept to another in more technical terms. We are given a
feature-type-ontology (FTO) concept as a single textual string (e.g. OpenPitMine) and a domain ontology
which is basically a directed graph in which nodes represent concepts and edges represent relations between
concepts. Each concept in the domain ontology is again given as a single textual string (e.g. D:Quarry5). The
task is now to discover that OpenPitMine is much closely related to D:Quarry as to for instance
D:Legislation or D:Transportation. Also important to mention is the fact that every FTO concept has a set of
attributes. Each attribute is given as a single textual string (e.g. OpenPitMine.SiteName) and has its
corresponding data type (the data type is not expected to provide much guidance in the annotation process
since it is usually simply string). Concepts in the domain ontology, on the other hand, can similarly be
described with the surrounding concepts, e.g. D:Quarry(hasLocation)QuarryLocation6. The only difference
is that we do not need to limit ourselves to immediate neighbourhood of the concept – we can take concepts
that are more than one step away from the observed concept into the account (e.g.
D:Quarry(hasLocation)QuarryLocation(constrainedBy)......Topography).

A straightforward approach would be to try to compare strings themselves. Even by taking attribute strings
into the account coupled with some heuristics we cannot hope for good results – this can serve merely as a
baseline. By using string-matching we could for instance map attributes OpenPitMine.AllowedProduction,
OpenPitMine.SiteName, and OpenPitMine.SiteType to some attributes of D:Quarry, namely
D:Quarry(hasIndustrialActivity)Production and D:Quary(is_a)IndustrialSite, and thus conclude that concepts

OpenPitMine and D:Quarry are related (we would find no such matches between OpenPitMine and
D:Legislation or D:Transportation). In some cases even some parts of concept names would match but we
can not rely on that.

In the following we present several promising approaches that use alternative data sources (mostly the Web)
to discover mappings between concepts. We limit ourselves to a scenario where attributes are not available
(i.e. we are given a FTO concept and a set of domain ontology concepts). The task is to arrange domain
ontology concepts according to the relatedness to the FTO concept. In the examples we will use
OpenPitMine as the observed FTO concept and domain ontology concepts D:Quarry, D:Legislation, and
D:Transportation. In Section 4.1 we first introduce the idea of concept comparison by first populating
concepts with (textual) documents that reflect semantics of these concepts. To enable the realization of these
ideas in the context of SWING we first need to resolve the issue of missing the documents that could be used
to populate concepts. Section 4.1.3 presents two promising techniques of using a Web search engine (in our
particular case: Google) to acquire the “missing” documents. Sections that follow (4.2 and 4.3) present two
alternative ways of using the Web for concept annotation. Rather than dealing with documents, these
approaches deal with term co-occurrences and linguistic patterns, respectively.

7.1.1    Concept Similarity by Comparison of Documents

Suppose we have a set of documents assigned to a concept and that these documents “reflect” the semantics
of the concept. This means that the documents are talking about the concept or that a domain expert would
use the concept to annotate (categorize) these documents. One of such scenarios is for instance the freely
available on-line directory DMoz http://www.dmoz.org (another such example would be the well-known
Yahoo! directory <http://dir.yahoo.com>). Over 4 millions sites were manually categorized into over
590,000 DMoz categories. The directory is maintained by over 70,000 editors in an “open” fashion which
means that everybody can contribute. The categories are arranged into a taxonomy which can be seen as a
very simple ontology (termed also “topic ontology”). Each7 category contains a set of Web pages (i.e. a set
of documents) that were put there manually by editors. Note that the categories we are talking about are
actually concepts in the corresponding topic ontology. The semantics of each of these concepts is thus
“described” with the set of associated documents. In such cases we can compute the similarity between two
concepts. We are given a FTO concept (in our case OpenPitMine) and several domain ontology concepts (in
our case D:Quarry, D:Transportation, and D:Legislation) with their corresponding document sets8. We first
convert every document into its bag-of-words representation, i.e. into the tfidf representation as explained in
Section 1.1.2. A tfidf representation is actually a sparse vector of word-frequencies (compensated for the
commonality of words (this is achieved by the idf component) and normalized). Every component of a tfidf
vector corresponds to a particular word in the dictionary. With “dictionary” we refer to the different words
extracted from the entire set of documents. If a word does not occur in the document, the corresponding tfidf
value is missing – hence the term “sparse vector”. Each of these vectors belongs to a certain concept – we
say that it is labelled with the corresponding concept. This gives us a typical supervised machine learning
scenario (see Section 1.1.3). In the following subsections we present three different approaches to concept-
concept similarity computation using different machine learning approaches.


Concept similarity by comparing centroids

A centroid is a (sparse) vector representing an (artificial) “prototype” document of a document set. Such
prototype document should summarize all the documents of a given concept. There are several ways of how
to compute a centroid (given tfidfs of all documents in the corresponding set). Some of the well-known
methods are the Rocchio formula, average of vector components, and (normalized) sum of vector
components. Of all the listed methods the normalized sum of vector components is shown to perform best in
the classification scenario [4].

In the following we limit ourselves to the method of normalized sum. We first represent documents of a
particular concept C as normalized tfidf vectors i d v. Now we compute the centroid as follows:

#...#

Having centroids computed for all the concepts, we can now measure similarity between centroids and
interpret it as similarity between concepts themselves (we are able to do this because a centroid summarizes
the concept it belongs to). This is illustrated in Figure 16. Usually we use cosine similarity measure
(presented in Section 1.1.2) to measure similarity between two centroid vectors. Section 4.1.3 provides an
illustration of this approach on our example concepts.

Concept similarity by classification

We already mentioned that every tfidf vector is labelled with the corresponding concept and that this gives us
a typical supervised machine learning scenario. In a typical supervised machine learning scenario we are
given a set of training examples. A training example is actually a labelled (sparse) vector of numbers. We
feed the centroids in a 2-dimensional example. Imagine that the red set of documents corresponds to a FTO
concept and the other two sets to some domain ontology concepts – dotted lines represent distances from the
FTO concept to the domain ontology concepts (the red concept is more similar to the orange concept than to
the blue concept). Note that in this illustration vectors are not normalized neither are the centroids (they are
simply the average of vector components). Euclidean distance is shown instead of cosine similarity. The
intuition however remains the same training examples to a classifier which builds a model. This model
summarizes the knowledge required to automatically assign a label (i.e. a concept) to a new yet unlabelled
example (we term such unlabelled examples “test examples”). This in effect means that we can assign a new
document to one of the concepts. We call such assignment (of a document to a concept) classification. How
do we use the classification to compare two concepts? The approach is quite straightforward. We take the
documents belonging to a FTO concept (in our case the documents of OpenPitMine) and strip them of their
label thus forming a test set. Now we assign each of these documents to one of the domain ontology concepts
(i.e. we classify each of the documents to one of the domain ontology concepts). The similarity between a
FTO concept and a DO concept is simply the number of FTO-concept documents that were assigned to that
particular DO concept. We can also normalize these scores by dividing them with the number of
classifications to get values between (and including) 0 and 1.
                                                     #...#
                Figure 16: Three sets of documents (red, blue, orange) with their corresponding

We mentioned that we use a classifier to assign labels to documents. There are many different classifiers at
hand in the machine learning domain. Herein we employ a very popular classifier – Support Vector Machine
(SVM). In addition we also demonstrate how the same task is performed with the k-nearest neighbours (k-
NN) algorithm which has the property of a “lazy learner”. The latter means that k-NN does not build a model
out of training examples – instead it uses the training examples directly to perform classification.

Classification via SVM

SVM is a very popular classifier. In its basic form it is able to classify test examples into only two classes:
positive and negative. We say that SVM is a binary classifier. This means that training examples must also
be only of the two kind: positive and negative. Since examples are vectors, we can see them as points in a
multi-dimensional space. The task of SVM is to find such hyper-plane that most of the positive training
examples lie on one side of the hyper-plane while most of the negative training examples lie on the other side
(a simple 2-dimensional example is shown in Figure 19). Formally, SVM is an optimization problem that can
be solved optimally. Recently it has been shown that this can actually be done in linear time9, which is quite
a breakthrough regarding the usefulness and quality of SVM [50].

Eventhough SVM is binary, we can combine several such classifiers to form a multi-class variant of SVM.
Let us explain one of the multi-class SVM variants on our example. We build one model to distinguish
between D:Quarry and D:Transportation, another to distinguish between D:Quarry and D:Legislation, and
finally one to distinguish between D:Transportation and D:Legislation. Now we query each of these models
with a particular document (from the FTO concept). Each model decides for one of the two corresponding
concepts and consequently the concepts are getting votes. In the end we assign the document to the concept
that was given the most votes. This kind of classification is known as one-vs-one. Other multi-class variants
are discussed and evaluated in [12].

                                                  #...#
Figure 17: Example SVM scenario: positive examples (red), negative examples (blue), and the computed hyper-
                      plane (purple). This image is courtesy of Blaž Fortuna (JSI).

Classification via k-NN

We already mentioned that k-NN is one of “lazy learners” which means that it does not build a model out of
training examples. It performs the classification of a document by finding k most similar documents of all the
documents that belong to the domain ontology concepts (see Figure 18). The similarity between the
document and a DO concept can be computed as the number of documents (from the set of k most similar
documents) that belong to that DO concept. This score can be normalized by dividing it with k. k-NN uses
tfidf representations and cosine similarity to perform comparisons in order to find similar documents. Each of
the k most similar documents is similar to the reference document to a certain degree (reflected in the cosine
similarity score). We can use these similarity scores for somewhat more sophisticated determination of the
target concept (i.e. by weighting documents with the corresponding similarity scores).

                                                     #...#
                                    Figure 18: k-nearest neighbours (k = 7).

The figure shows three sets (i.e. clusters) of documents (red, blue, orange) and a new document (yellow)
which we want to classify (i.e. categorize) into one of the sets. The neighbourhood of 7 most similar
documents suggests the classification of the new document into the orange cluster because there are three
orange documents in the neighbourhood compared to only two red and two blue documents. In this
illustration vectors are not normalized and Euclidean distance is shown instead of cosine similarity. The
intuition however remains the same.


7.1.2   Google Definitions and (Contextualized) Search Results

“If Google has seen a definition for the word or phrase on the Web, it will retrieve that information and
display it at the top of your search results. You can also get a list of definitions by including the special
operator ‘define:’ with no space between it and the term you want defined. For example, the search
‘define:World Wide Web’ will show you a list of definitions for ‘World Wide Web’ gathered from various
online           sources.”            (excerpt         from           Google            Help            Centre
<http://www.google.com/help/features.html#definitions>) Googlebots are crawling the Web all the time. In
their crusades they gather terabytes of data which is then processed in order to discover information that is
potentially of particular interest to Google users (such as products for sale on-line, weather forecast, travel
information, and images). One of such separately maintained information repositories are the definitions of
words or phrases as found on the Web.

Google definitions can be used to compensate for the missing document instances – each definition (known
by Google) can be seen as one document. In this way we can “populate” concepts with documents and then
perform the mapping (i.e. the annotation) as already explained in Section 4.1. To get back to our example, if
we populate concepts OpenPitMine, D:Quarry, D:Transportation, and D:Legislation with document instances
and then compare OpenPitMine (which is a FTO concept) to the domain ontology concepts (i.e. the other
three concepts), we get centroid-to-centroid similarities as shown in Table 1. Since it is hard to find
definitions for n-grams such as “open pit mine” (i.e. 3 or more words in a composition), we additionally
query Google for the definitions of “pit mine” and “mine”, weighting the contribution of these definitions
less than the one of the initial composed word (if the “complete” definition exists, that is).

This approach already shows promising results. However, there are still some issues that need to be
considered when using this approach. For one, a word can have several meanings, i.e. its semantics depends
on the context (or the domain). Google does not know in which context we are searching for a particular
definition – it thus returns all definitions of a particular word or phrase it keeps in its database. “Mine”, for
instance, can be defined either as “excavation in the earth from which ores and minerals are extracted” or as
“explosive that explodes on contact”. It is important to somehow detect the documents that do not talk about
the geospatial domain and exclude them from the annotation process (one of the tasks of D4.2). Another
issue to consider is the multilinguality in concept names. Google does provide definitions also in other
languages but that does not mean that we are able to compare such “foreign” definitions with the English
documents that are populating the domain ontology. To resolve this, we can populate the domain ontology
with alternative document-sets (in languages other than English) or resort to automatic translation of non-
English concept names and/or definitions to English (D4.3 will tackle this issue).

                                                       #...#
                      Figure 19: Google screenshot – searching for definitions of “quarry”.
                 #... device open pit mine quarry 0.281 legislation 0.011 transportation 0.041.#
Table 1: Centroid-to-centroid cosine similarities for Google definitions. Only English definitions are considered.


Note that we can also populate concepts with Google search results (in contrast or even in addition to
populating it with definitions). In this case we can put the search term into a context by extending it with
words or phrases describing the context. For example: to populate concept OpenPitMine in the context of
“extracting building materials” with documents, we would query Google for “open pit mine extracting
building materials” and consider the first 50 search results. Centroid-to-centroid similarities for this approach
are shown in Table 2. D4.1 Representational language for Web-service annotation models


           #open pit mine, quarry 0.087 / 0.360, legislation 0.040 / 0.147, transportation 0.062 / 0.121#
                   Table 2: Centroid-to-centroid cosine similarities for Google search results.

Only English pages are considered. Top 50 search results are considered each time. The first number denotes
the similarity when the search term is not placed into a context. The second number denotes the similarity
when the search term is placed into the context of “extracting building materials”.

7.1.3   Hypotheses Checking by Using Linguistic Patterns

We can use a Web search engine to estimate a truthfulness of a hypothesis given as a statement in a natural
language. If we query Google for “quarry is an open pit mine”, it returns 13 hits (at the time of writing this
report). We also get 3 hits for the query “mine is a quarry”. In contrast we do not get any hits for queries
“quarry is a transportation” or vice versa, and “quarry is a legislation” or vice versa. We can check for
synonymy between any two words (or even n-grams) w1 and w2 with this same pattern expressed as a
template:

                “w1 is a10 w2” or “w2 is a w1”.

Hearst [24] presented several such patterns for the acquisition of hyponyms. In some works these patterns are
thus called Hearst patterns (see Section 1.2.3). Hearst also provided a recipe for finding new patterns
(potentially expressing some other relations and in domain-specific environments):

       1. Decide on a lexical relation of interest (such as synonymy).

       2. Gather a list of term pairs for which this relation is known to hold, e.g. (“quarry”, “mine”).

       3. Find parts of text in which the terms from a term pair appear syntactically near one another.

       4. Find commonalities among these parts of text – common patterns should indicate the relation of
          interest.

We plan to follow these steps to potentially discover some new, more SWING-specific patterns. Intuitively it
seems that synonymy is the relation that is most suitable for the annotation task because we can infer
similarity between two concepts from the “truthfulness of synonymy” (expressed for instance as the number
of Google search results when “checking” the synonymy hypotheses) between these two concepts. However,
hyponymy can be used to extend the set of synonymy hypotheses. The idea is to actually populate concepts
with instances and then try to find synonymies between these instances. This is particularly useful in cases
when the hypotheses checking on concepts (their string representations, more accurately) fails or yields
inconsistent results. We do not plan to pursuit this possibility in SWING if other techniques prove sufficient.
The system called KnowItAll [13] uses Hearst patterns and a set of Web search engines to populate concepts
with instances.

7.1.4      Google Distance

Word similarity or word association can be determined out of frequencies of word (co-)occurrences in text
corpora. Google distance [6] uses Google to obtain these frequencies. Based on the requirement that if the
probability of word w1 co-occurring with word w2 is high then the two words are “near” to each other and
vice versa, the authors came up with the following distance measure:

#   D ( , ) max{log1/ p( | ),log1/p( | )} 0 1 2 1 2 2 1 w w = w w w w .#

Furthermore they normalized this measure so that if any of the two words is not very common in the corpus,
the distance is made smaller. Thus the formula for normalized Google distance (NGD) is as follows:

#log min{logf( ),logf( )}
max{logf( ),logf( )} logf( , )
max{log1/ p( ),log1/ p( )}
NGD( , ) max{log1/ p( | ),log1/p( | )}
12
1212
12
1221
 Mww
12

wwww
ww
wwwwww

== #,
where p(w1) = f(w1) / M, p(w1 | w2) = f(w1,w2) / f(w2). Here f(w) is the number of search results returned by
Google when searching for w (similarly f(w1,w2) is the number of search results returned by Google when
searching for pages containing both terms), and M is the maximum number of pages that can potentially be
retrieved (posing no constraints on language, domain, file type, and other search parameters, Google can
potentially retrieve around 10 billion pages).

Let us first take a look at NGDs between our example FTO concept OpenPitMine and, on the other hand,
domain ontology concepts D:Quarry, D:Legislation, and D:Transportation:

            NGD(“open pit mine”, “quarry”) = 0.12,

            NGD(“open pit mine”, “legislation”) = 0.41, and

            NGD(“open pit mine”, “transportation”) = 0.48.

Note that we searched only pages written in English but we posed no other search constraints. When
searching for “open pit mine” we did not demand that the term appears in the page literally, we rather
searched for pages containing all three words.

It is also possible to put NGD computation into a context. This can be done simply by extending Google
queries (the ones that are used to obtain frequencies) with words that form the context. Note that in this case
M must be determined as the number of returned search results when searching for the words that form the
context. Suppose we put NGD into the context of “minerals and rocks”. We obtain the following results:

            NGD“minerals and rocks”(“open pit mine”, “quarry”) = 2.35,

            NGD“minerals and rocks”(“open pit mine”, “legislation”) = 7.40, and

            NGD“minerals and rocks”(“open pit mine”, “transportation”) = 3.90.

From this simple experiment we can see that “transportation” is more related to “open pit mine” than
“legislation” when “minerals and rocks” are concerned – that was not the case in the non-contextualized
setting. With this we merely demonstrate the context-sensitivity of the distance measure.

We believe that NGD is not really the best way to search for synonymy because synonyms generally do not
co-occur. It is more a measure of relatedness or association – nevertheless it can be tried out in the SWING
annotation task. Also note that any other search engine that reports the total number of search results can be
used instead of Google.

7.2       Term Matching: A Building Block for Automating the Annotation Process

AutoBridge31 is an implementation of the term matching techniques that will serve as building blocks for
semi-automating the annotation process in SWING. Term matching must not be confused with string
matching. In contrast to string matching, we first attach a set of documents to each term and then the
similarity between two terms is inferred from the similarity between the two corresponding sets of
documents. The documents are retrieved from the Web by querying a Web search engine, query being the
term in question. These techniques are explained in greater details in D4.1 Error! Reference source not
found.].

The current implementation of AutoBridge provides the term matching functionality in the form of a
command-line utility. As input parameters we need to specify the file containing the source terms and the
file containing the target terms32. The source terms are seen as the “common vocabulary” with which the
target resource is to be annotated (e.g. the concepts from the domain ontology). The target terms, on the
other hand, represent the target resource that is to be annotated and thus integrated into the domain (e.g. the
names of Web service attributes). Apart from the source and target terms, AutoBridge allows us to specify:

•      Search context (-sc): A term can have several meanings depending on the context (or the domain). It is
       possible to explicitly provide the context in which the term should be searched for. The context is



31
     In verbal presentation we use the name AutoMap to avoid confusion because of the phonetic similarity between “AutoBridge” and
       “OntoBridge”.
32
     These two files have a very simple textual format – every line contains one term.
    appended to the search term when querying the search engine. For example, when querying for “open pit
    mine” in the context of “excavating material”, the search term becomes either »open pit mine excavating
    material« or »“open pit mine” “excavating material”«, depending on whether we requested to put terms
    into quotes or not (see the -qt option). Here are some examples of contexts that we can specify:

            o   “minerals”: Search in the context of minerals.

            o   “site:wikipedia.org”: Search only through Wikipedia pages.

            o   “minerals site:wikipedia.org”: Search only Wikipedia pages in the context of minerals (the -
                qt flag must be off for this to work properly).

            o   “definition”: Search mainly for on-line definitions.

•   Put terms into quotes (-qt): Specifies whether to put terms (and the context, if provided) into quotes
    when querying the search engine (-qt:yes or –qt:no). If a term that contains more than one word is
    not put into quotes, the search engine will look for pages in which these words co-occur and will thus
    retrieve pages in which these words do not necessarily represent the given term. In contrast to this, when
    the -qt option is set, the search engine will only retrieve pages in which the given term occurs.

•   Classification algorithm (-ca): Specifies the classification algorithm. Currently the following
    classification algorithms are supported:

            o   Centroid classifier (-ca:Cen). This classifier is the fastest one. It works by pre-computing
                the centroid for each set of documents. The similarity between two terms is inferred from the
                similarity between the two corresponding centroids. For a more detailed description see D4.1
                Error! Reference source not found.].

            o   k-Nearest Neighbors (-ca:kNN). This is a well-known machine learning classifier. Its
                characteristic is that nothing is pre-computed (i.e. no model is built). To sort the list of
                source terms, each document corresponding to the target term (let us call it a “target
                document”) is compared to all the documents that correspond to source terms (let us call
                them “source documents”). k source documents that are most similar to the target document
                are then examined to determine the source term to which the majority of these k documents
                corresponds. This process is carried out for each target document and the aggregated
                evidence that the target term corresponds to a particular source term is used to sort the list of
                source terms in the end. See D4.1 Error! Reference source not found.] for more details.

            o   Normalised Google Distance (-ca:NGD). This is not a typical machine learning classifier. It
                is based on the work of Cilibrasi et al. Error! Reference source not found.]. The idea is to
                use a Web search engine (Google is used in the original work hence the name of this
                method) to determine the first term frequency, the second term frequency, and their co-
                occurrence frequency in order to compute the distance between the two terms. The NGD
                equation is also given and explained in D4.1 Error! Reference source not found.].

•   Data source (i.e. search engine) (-ds): Specifies the search engine to be used as the source of
    documents. Currently it is possible to choose between Google search engine (-ds:GooSe) and Google
    definitions (-ds:GoDef).

•   Evaluation (-ev): Specifies whether to perform the evaluation (-ev:yes or –ev:no). For this option
    to work the set of source terms and the set of target terms need to be aligned. This means that the number
    of source terms must match the number of target terms and the k-th source term must correspond to the
    k-th target term. An example of such alignment of terms is shown in Table 3.
                           Target term                   Source term
                         squirrel                        animal
                         cow                             farm animal
                         flower                          flora
                         rose                            flora
                         engine                          automobile
                         fly                             insect
                         butterfly                       insect
                         beetle                          insect
                         fiat                            automobile
                         hamster                         animal
                         bug                             insect
                         horse                           farm animal
                         owl                             animal
                         tulip                           flora
                         pig                             farm animal
                         car                             automobile

                                    Table 3: An example of aligned terms.


The aligned terms can be synonyms, hyper-hyponyms, or terms related in any other way. In the evaluation
process each target term is annotated with a sorted list of source terms. The list of source terms is sorted
according to the relatedness (or association) to the target term. We can measure the accuracy of the algorithm
as the number of correct mappings divided by the number of all mappings. The annotation is correct for the
given target term if the sorted list of source terms contains the term initially aligned with the target term in
the top N items. The optimal annotation algorithm would achieve a 100 % accuracy at N = 1 and would
enable the (full) automation of the annotation process. In reality we can not expect (nor achieve) such a good
performance. Therefore we need to present a sorted list to the user and let him/her choose (or confirm) the
correct match which is expected to be at the top of the list. From this perspective it is also interesting to
measure the accuracy at several other values of N, e.g. also at N = 3, N = 5, N = 10 (in Sections 7.3.1, and
7.3.2 these are termed accuracy at top 3, accuracy at top 5, and accuracy at top 10, respectively). If we
achieve for example a 90 % accuracy at N = 5 this means that 90 % of the time the correct source term will
appear in the top 5 items in the sorted list. In other words, in 90 % of the cases, the user will only need to
consider the top 5 items to identify the source term that corresponds to the given target term.

To explain the output of the current command-line implementation of AutoBridge, we will use the terms
given in Table 3 and execute the following command:

AutoBridge.exe -ltf:TargetTerms.txt -rtf:SourceTerms.txt -qt:yes -ca:kNN
-ds:GooSe -ev:yes

This command annotates the target terms (contained in the textual file TargetTerms.txt) with the source
terms (contained in the textual file SourceTerms.txt), performs the evaluation, the Google search engine is
queried for documents, no search context is provided, terms are put into quotes, and the k-Nearest Neighbors
classifier is used.

After executing the command, the results are displayed to the user as shown in Listing 1.

Matching the term "squirrel" (1/16)...
farm animal .............. 1.2884282123782933
animal .............. 1.24368345645288
flora .............. 1.22127300700062
insect .............. 1.20325424315968
automobile .............. 0.647209043724221

Matching the term "cow" (2/16)...
farm animal .............. 0.245852006788214

Matching the term "flower" (3/16)...
flora .............. 0.226738334103892

Matching the term "rose" (4/16)...
flora .............. 1.65500040794049
farm animal .............. 1.13179803048404
animal .............. 1.01476375759338
automobile .............. 0.812777010474462
insect .............. 0.734361813815835

Matching the term "engine" (5/16)...
automobile .............. 1.29175329003664
flora .............. 0.0717279385834579
farm animal .............. 0.0650156240184754
insect .............. 0.0623527002803854

Matching the term "fly" (6/16)...
animal .............. 1.31011166264108
insect .............. 1.1656433422567
automobile .............. 0.595757172122159
farm animal .............. 0.411790052644674
flora .............. 0.353748480146631

(omitting the results for the target terms butterfly, beetle, fiat, ..., car)

Final results:
squirrel <------------> farm animal (2. animal)
cow <------------> farm animal (OK)
flower <------------> flora (OK)
rose <------------> flora (OK)
engine <------------> automobile (OK)
fly <------------> animal (2. insect)
butterfly <------------> insect (OK)
beetle <------------> insect (OK)
fiat <------------> automobile (OK)
hamster <------------> animal (OK)
bug <------------> insect (OK)
horse <------------> farm animal (OK)
owl <------------> insect (4. animal)
tulip <------------> flora (OK)
pig <------------> farm animal (OK)
car <------------> automobile (OK)

Accuracy       of   this   algorithm      on   top   1: 13/16 = 0.8125
Accuracy       of   this   algorithm      on   top   3: 15/16 = 0.9375
Accuracy       of   this   algorithm      on   top   5: 16/16 = 1
Accuracy       of   this   algorithm      on   top   10: 16/16 = 1




33
     These numbers are the relevance scores. They represent the strength of the association between the target and the source term.
       The way they are computed depends on the algorithm being used. If the centroid classifier is used, for example, a score is
       computed as the cosine similarity between the two corresponding centroids.
                                          Listing 1: AutoBridge output.

From these results we can see that the target term squirrel is annotated with the source terms farm animal,
animal, and flora, in this same order. This annotation is incorrect when only the topmost source term is
considered. The correct source term is in the second place. This is also evident from the final results where
each target terms is associated with the topmost source term. The additional information in brackets tells us
whether the annotation with the topmost source term is correct or not. If it is not, the correct source term and
its rank in the corresponding sorted list are given in the brackets. We can see that the term owl was
associated with the term insect. The reason for this is in the fact that there exist owl butterflies and owl flies
which are of course insects. The domain expert is required to detect and correct this kind of false
annotations.

At the bottom, the accuracy of the selected algorithm is presented for the given experimental setting. The
accuracy is measured for top 1, top 3, top 5, and top 10 items in a sorted list of source terms. In our particular
case we can see that the topmost source term is correctly identified in 81.25 % of the cases. If we take the top
3 items into account, the accuracy is improved to 93.75 %. We can also see that the correct source term will
always be one of the top 5 items in the sorted list (i.e. the accuracy at top 5 is already 100 %).

7.3     Evaluation of Term Matching Techniques

#...#

7.3.1    Preliminary evaluation

To obtain some preliminary results, we tested some of the presented methods on a dataset from the domain
of minerals. We obtained 150 mineral names together with their synonyms Error! Reference source not
found.]. To list just a few: acmite is a synonym for aegirite, diopside is a synonym for alalite, orthite is a
synonym for allanite, and so on. The mineral names were perceived as our domain ontology concepts while
the synonyms were perceived as the feature-type ontology concepts. For each of the synonyms, the selected
algorithms were used to sort the mineral names according to the strength of the association with the synonym
in question. We measured the percentage of cases in which the correct mineral name was in the top 1, 3, 5,
and 10 names in the sorted list. In other words, we measured the accuracy of each of the algorithms
according to the top 1, 3, 5, and 10 suggested mineral names.

We employed 16 algorithms altogether: 7 variants of k-NN, 5 variants of the centroid classifier, and 4
variants of NGD. We varied the context and the data source (either Google definitions or Google search
engine). We also varied whether the order of words in a term matters or not (if the order was set to matter
then the term was passed to the search engine in quotes). Top 30 search results were considered when
querying the search engine. Only English definitions and English search results were considered. In the case
of k-NN, k was set dynamically by taking all the documents within the cosine similarity range of more than
(or equal to) 0.06 into account. The final outcome of a k-NN algorithm was computed as the sum of all the
cosine similarities measured between a synonym document and a mineral name document from the
corresponding neighbourhood. Table 4 summarizes the results of the evaluation. The experimental settings
are ordered according to the accuracy at top 1. The best performing experimental setting is listed first.

According to the preliminary evaluation results, the presented techniques have a very good potential. From
the results it is evident that – at least for the dataset used in the evaluation – it is not beneficial to limit the
search to Wikipedia Error! Reference source not found.] or Google definitions. However, it proved useful
to perform the search in the context of “minerals”.

Also noticeable from Table 4, as we suspected, NGD is not very successful in detecting synonymy.
Furthermore, it is much slower that the other presented algorithms as it is querying the Web to determine
term co-occurrence frequencies. Last but not least, we can see that the best performing algorithms perform
slightly better if the search query is put into quotes (i.e. if the order of words in the term that represents the
query is set to matter).
7.3.2    Large-scale evaluation

In the following section we present a large-scale evaluation of AutoBridge. In the evaluation we are not
using real-life OGC services but rather lexical databases and thesauruses found on the Web. This is due to
the fact that we need the datasets to be aligned (i.e. each source term is associated with the corresponding
target term). Furthermore, the idea is to evaluate these algorithms in a somewhat broader scope – not
necessarily limited to the service annotation task.

Datasets

STINET thesaurus Error! Reference source not found.]. The Defense Technical Information Center’s
Scientific and Technical Information Network (STINET) thesaurus provides a hierarchically organised
multidisciplinary vocabulary of technical terms. The terms are related to each other with the following
relations: “broader term”, “narrower term”, “used alone for”, “use”, “used in combination for”, and “use in
combination”. We included “narrower term” and “used alone for” in our experiments. There are about
16,000 terms in the thesaurus, linked with about 15,000 “narrower” and 3,000 “used alone for” links, which
we subsampled to 1,000 pairs of terms for each of the two relations. The “narrower” relation indicates that
the category described by one term is a subset of the category described by the other term, and “used alone
for” indicates a term that can be used to replace the original term. The terms in this thesaurus are mostly
phrases pertaining to natural sciences, technology, and military. The fact that the dataset contains a large
number of phrases results in an improvement of the accuracy of the evaluated algorithms if quotes are used
when querying the search engine.


                                                                                Accuracy [%]
        Algorithm Data source        Context            Quotes    Top 1     Top 3      Top 5   Top 10
        Centroid   Google search     “minerals”         yes       93.33     98.67      99.33   100
        Centroid   Google search     “minerals”         no        91.33     96.67      98.67   99.33
        k-NN       Google search     “minerals”         yes       88        98         99.33   100
        k-NN       Google search     “minerals”         no        86.67     97.33      98.67   100
        k-NN       Google search     general            no        82.67     90         92      93.33
        k-NN       Google search     general            yes       80.67     89.33      91.33   93.33
        Centroid   Google search     general            no        78        91.33      92      94
        Centroid   Google search     general            yes       78        91.33      93.33   94.67
                   Google
        Centroid                     general            no        76        77.33      78.67   79.33
                   definitions
                   Google
        k-NN                         general            no        70        76         77.33   79.33
                   definitions
                                     “site:
        k-NN       Google search                        no        43.33     60.67      70      76
                                     wikipedia.org”
                                     “site:
        k-NN       Google search                        yes       42        57.33      60      66.67
                                     wikipedia.org”
        NGD        Google search     general            yes       16        26         36.67   54.67
        NGD        Google search     “minerals”         no        12.67     21.33      29.33   42.67
        NGD        Google search     “minerals”         yes       11.33     22.67      36.67   58
        NGD        Google search     general            no        8.67      18         21.33   30.67

                                 Table 4: The preliminary evaluation results.


GEMET Error! Reference source not found.]. GEMET or “General Multilingual Environmental Thesaurus”
was developed for the European Topic Centre on Catalogue of Data Sources (ETC/CDS) and the European
Environment Agency (EEA) as a general multilingual thesaurus on environmental subjects. It is organised
into 30 thematic groups that include topics on natural and human environment, human activities, effects on
the environment, and other social aspects related to the environment. It contains about 6,000 terms related
among each other with over 5,000 “broader than” and over 2,000 “related to” links for each of the languages.
The “related to” relation represents a weaker link than synonymy or “used alone for”, which is nicely
reflected in the results. We subsampled the dataset to 1,000 English terms for each of the two relations.

Tourism ontology. The tourism ontology is describing various aspects of tourism and commercial services
related to it. It consists of 710 concepts – linked together with the “is-a” relation – and a large corpus of
tourism-related Web pages annotated with the concepts. Almost all of the annotated objects are named
entities such as place names, people, etc. As can be seen from the results, the inclusion of named entities
significantly improves the matching accuracy, while the accuracy of matching just concept descriptors
between themselves lies below average. The ontological part of the dataset was used in full in the
experiments, while the annotation data was subsampled to 1,000 named entities annotated with about 60
concepts.

WordNet Error! Reference source not found.]. WordNet is a lexical database for the English language. The
smallest entities in the database are the so called “synsets” which are sets of synonymous words. Currently it
contains about 115,000 synsets which form over 200,000 word-sense pairs; a word-sense pair represents a
word with the corresponding meaning. Synsets are tagged as nouns, verbs, adjectives, or adverbs. Nouns are
linked together with hypernymy, hyponymy, holonymy, and meronymy (distinguishing between substance,
part-of, and member meronymy); verbs are linked together with hypernymy, troponymy, and entailment;
adjectives are linked to related nouns. This is by far the largest dataset we used. We performed over half of
the experiments on this data. Almost all of the relations were included in the evaluation and some of them
were included two times in order to test the effects of exchanging the left and right-hand side inputs to the
algorithms. Each experiment was performed on a sample of 1,000 related words which were selected
independently for every test.

Table 5 gives the overview of all the experiments. For each of the experiments it lists the corresponding
dataset, relationship, direction of the relationship, and an example. The experiment “gemet-bt”, for instance,
includes the “broader than” relation from the GEMET dataset. The example given in the table is “‘traffic
infrastructure’ is a broader term than ‘road network’”. The experiment “wordnet-hypn”, on the other hand,
includes the hypernymy relation from WordNet. The corresponding example given in the table is “the term
‘imaginary creature’ is a hypernym of the term ‘monster’”.

Experimental setting

With respect to the preliminary experiments presented in Section 7.3.1 we did not limit the search to
Wikipedia or Google definitions. We also excluded NGD from the evaluation. The main reason is in its non-
scalability as it queries the search engine for term co-occurrence frequencies which would result in
approximately 1,000,000 (i.e. 1,000 times 1,000) queries for each of the experiments. Text-based classifiers,
on the other hand, issue only about 2,000 queries per classifier per experiment.

We avoided specifying search contexts as specifying a context requires some domain-specific knowledge
and is therefore human-dependent. We wanted to see what accuracy can be achieved without the human
involvement.

We therefore varied the classification algorithm (either k-NN or the centroid classifier) and whether or not to
put terms into quotes when querying the search engine. Top 30 search results were considered when
querying the search engine. Only English search results were considered. In the case of k-NN, k was set
dynamically by taking all the documents within the cosine similarity range of more than (or equal to) 0.06
into account. The final outcome of a k-NN algorithm was computed as the sum of all the cosine similarities
measured between a target document and a document from the corresponding neighbourhood. The results are
presented in the following section.

Results

The results of the evaluation are given in Figure 8. Each bar in the chart represents one experiment. An
experiment is determined by the corresponding experimental setting (see Table 5), the corresponding
classification algorithm, and the flag indicating whether the search terms were put into quotes. The first bar,
for instance, corresponds to the “gemet-bt” experimental setting, the centroid classifier was used (indicated
with “cen”), and the search terms were not put into quotes (indicated with “f” as in “false”). The fourth bar,
on the other hand, corresponds to that same experimental setting, but k-NN was used (indicated with
“kNN”), and the search terms were put into quotes (indicated with “t” as in “true”).

The chart bars are sorted according to the experimental setting so that four consecutive bars correspond to
one experimental setting. The first two bars in the quadruple represent the accuracy of the centroid classifier
while the second two represent the accuracy of k-NN. The first bar in each of the two pairs corresponds to
not putting terms into quotes while the second corresponds to putting terms into quotes.


                                                                     Left hand     side Right hand side
 Experiment        Dataset           Relationship       Direction
                                                                     example            example
                                                                                        traffic
 gemet-bt          GEMET             broader than           ←        road network
                                                                                        infrastructure
 gemet-rel         GEMET             related to             ↔        mineral deposit    mineral resource
 stinet-nt         STINET            narrower than          ←        alkali metals      potassium
                                                                                        numerical
                                                                     gauss-seidel
 stinet-uaf        STINET            used alone for         ←                           methods        and
                                                                     method
                                                                                        procedures
                   Tourism
 tourism-onto                        is-a                   →        gliding field         sports institution
                   ontology
                   Tourism
 tourism-annot                       instance-of            →        Maastricht            city
                   ontology
 wordnet-csv       WordNet           cause (verbs)          →        do drugs              trip out
 wordnet-entv      WordNet           entails (verbs)        →        snore                 sleep
                                     hypernym                                              imaginary
 wordnet-hypn      WordNet                                  ←        monster
                                     (nouns)                                               creature
                                     hypernym
 wordnet-hypv      WordNet                                  ←        Europeanize           modify
                                     (verbs)
                                     instance-of
 wordnet-insn      WordNet                                  →        Cretaceous period     geological period
                                     (nouns)
                                     member
 wordnet-mmn       WordNet           meronym                →        Neptune               solar system
                                     (nouns)
                                     part meronym
 wordnet-mpn       WordNet                                  →        shuffling             card game
                                     (nouns)
                                     substance
 wordnet-msn       WordNet           meronym                →        water                 frost snow
                                     (nouns)
                                     member
                                     meronym
 wordnet-mmni      WordNet                                  ←        Girl Scouts           Girl Scout
                                     (nouns,
                                     inverted)
                                     part meronym
 wordnet-mpni      WordNet           (nouns,                ←        pressure feed         oil pump
                                     inverted)
                                     substance
                                     meronym
 wordnet-msni      WordNet                                  ←        rum cocktail          rum
                                     (nouns,
                                     inverted)
 wordnet-syn       WordNet           synonym                ↔        homemaker             housewife

                                 Table 5: The overview of the experiments.


Each bar consists of 6 sections – differentiated by different colours – corresponding to different accuracy
metrics (see the legend in Figure 8). Since the accuracy of a wider-range metric is always higher than that of
a narrower-range metric (e.g. the accuracy at top 10 is always higher than the accuracy at top 5), it is
convenient to present the results according to all 6 different metrics in a single bar. Note that the top of the
bar section indicates the accuracy according to the corresponding evaluation metric. If we consider the first
bar, for example, we can see that when employing the centroid classifier without putting terms into quotes in
the “gemet-bt” experimental setting, the correct annotation is automatically determined in roughly 28 % of
all the cases. Furthermore, the correct annotation is one of the top 3 suggestions in the sorted list in roughly
45 % of all the cases. If we are willing to consider top 40 suggestions each time, the correct annotation can
be determined in roughly 78 % of the cases.

From the evaluation results we can conclude that the term’s lexical category (i.e. noun, verb …) has the
largest impact on the accuracy of the evaluated algorithms. It can be seen that the datasets containing verbs
yield the poorest performance by far. This happens due to the fact that the algorithms induce similarities
from the contexts defined by the documents corresponding to terms. These contexts are very heterogeneous
in the case of a verb.

It is also possible to conclude that the selection of the dataset has a much larger influence on the accuracy of
the model than the approach used for constructing it. Within each dataset, the performance of the k-NN
classifier is usually slightly worse than that of the centroid-based classifier which is also significantly faster.
The use of quotes is beneficial mostly with the datasets that contain a large number of expressions, such as
the STINET thesaurus which contains mostly technical expressions. On the other hand, quotes can be
detrimental to the accuracy when the terms are already fully determined by a single word and are thus
ordinarily referred to by only these words, e.g. “genus Zygnematales” is often referred to as “Zygnematales”.

Swapping the left and right-hand side inputs had no significant impact on the accuracy, as can be seen in the
WordNet’s meronymy experiments.

We expected high performance on the datasets where the contexts of matching terms have large overlaps.
This is especially true for synonyms and the STINETS’s “used alone for” relation. The experiments
confirmed our expectations, especially in the case of the “used alone for” relation which yielded a 95 %
accuracy at top 40. Named entities also have very well defined contexts, especially compared to ordinary
nouns, therefore datasets that contain many named entities exhibit a better accuracy, as it is the case with the
WordNet’s member meronymy relationship.

To sum up, the centroid classifier outperforms k-NN in most of the cases. The accuracy is high for
synonymous named entities and nouns, and lower for the other kinds of relations (meronymy, hypernimy
…). The accuracy is especially low when dealing with verbs instead of nouns (regardless of the relation).
 1


0,9


0,8


0,7


0,6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TOP 40
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           TOP 20
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           TOP 10
0,5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           TOP 5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           TOP 3
0,4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        TOP 1

0,3


0,2


0,1


 0                ,f       ,t       ,f       ,t        ,f        ,t      ,f       ,t        f      t   ,f   ,t     f     t        ,f      ,t        f         t    ,f      ,t     n,
                                                                                                                                                                                    f
                                                                                                                                                                                           n,
                                                                                                                                                                                             t      ,f     ,t   n,
                                                                                                                                                                                                                  f
                                                                                                                                                                                                                       n,
                                                                                                                                                                                                                          t        ,f     ,t      n,
                                                                                                                                                                                                                                                    f
                                                                                                                                                                                                                                                            n,
                                                                                                                                                                                                                                                              t
                                                                                                                                                                                                                                                                N
                                                                                                                                                                                                                                                                  ,f
                                                                                                                                                                                                                                                                       N
                                                                                                                                                                                                                                                                         ,t         f        t
                                                                                                                                                                                                                                                                                                    N
                                                                                                                                                                                                                                                                                                      ,f
                                                                                                                                                                                                                                                                                                              N
                                                                                                                                                                                                                                                                                                                ,t      f       t
                                                                                                                                                                                                                                                                                                                                       N
                                                                                                                                                                                                                                                                                                                                         ,f         ,t       f     t         f        t        ,f      ,t  ,f   ,t       ,f       ,t     ,f     ,t       f      t      ,f   ,t    n,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     f
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          n,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             t     ,f         ,t      n,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         f
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                n,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  t   ,f     ,t     n,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       f
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              n,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                t   ,f       ,t     ,f     ,t    ,f   ,t
                                                                                         n,     n,               n,    n,                        n,        n,                                                                                                                    n, e n,
                                                                                                                                                                                                                                                                                                  N         N
                                                                                                                                                                                                                                                                                                                     n , en ,
                                                                                                                                                                                                                                                                                                                                     N         N
                                                                                                                                                                                                                                                                                                                                                 N en , en ,              N,                                                                          n , en ,                                   N          N                       N      N                      N        N     en en         N    N
               en cen NN NN                         en cen NN N N , ce , ce kNN kNN ,ce ,ce kNN kNN , ce , ce kNN kNN ,ce ,ce kN N kNN ,ce ,ce kN N kNN ,ce ,ce ,kN ,kN                                                                                                       ce        c                          ce       c                                           N         N
                                                                                                                                                                                                                                                                                                                                                                                    N, cen cen NN NN cen cen NN NN
                                                                                                                                                                                                                                                                                                                                                                                                                              ,                    ce      c         NN NN i,ce i,ce kN                   N
          t, c       t,       ,k       ,k       l,c       l,         ,k      ,k      t        t      ,    ,    f     f     ,          ,      s        s          ,     ,      o       o         ,       ,     v     v       ,         ,      v         v                    n, p n, n,k n,k p v, p v, v,k v,k n ,c n ,c                                              ,k        ,k       n,        n , n ,k n ,k n i,        n i n i, k n i, k pn, pn,             , k n, k                     ,      i,k        n,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ce ,ce ,kN ,kN i,ce i,ce ,kN ,kN n, c n, c , kN , kN
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           n                                           i
       _ b t_ b _ bt _ bt _ re _ re                               el      el       _n     _n      nt    nt   ua   ua    af         af      ot n ot            ts    ts      nt o nt          to      to    cs    cs      sv        sv      nt       n t n tv n tv       yp        y         p         p        y      y        p         p        ns                                                   m   m    m        m                                     pn      p     pn     pn pni        pn m s ms                      sn   sn      sn       sn sni      sn t_ sy t_ sy syn syn
     et         e       et       et       et        et         _ r t_ r ne t ne t e t_ e t_ e t_ e t_ t_ u t_ u nn                                         n o nn o _o          _       _
                                                                                                                                                                                                                                                                                                                                                          ns insn insn m m mm
                                                                                                                                                                                                                                                                                                                                                                                  _        _                                                                                                              t_        t_       _ m _ m t_ m t_ m                           e      e      t_     t_
                                                             et        e        i      i                                                a      an an
                                                                                                                                                                                         o n _o n et_ et_ t_ c t_ c et_ e et_ e t_ e t_ e t_ h t_ h _ hy _ hy t_ h t_ h _ hy _ hy t_ i t_ i
                                                                                                                                                                                                                     e        e                                      e                t        et                         t       et                           t_ et_                               _ m _ m _ m _ m mm mm t_ m t_ m _ m _ m t_ m t_ m _ m _ m
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          e     e                                                                          _ m _m
 ge m ge m e m e m e m e m                                m em               st      st stin stin stin stin tine tine m_ m_                           _         _a rism rism ism ism rdn r dn dn dn rdn r dn dn e dn e dn                                                 d ne ne                        n e dn e n e           n         d n e dn e n e                        et n et
                                                                                                                                                                                                                                                                                                                                                                             dn d
                                                                                                                                                                                                                                                                                                                                                                                                  et n et n et n et et_ et_ dn e dn e n et n et
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       dn    dn     n  et n et dn e dn e n et n et
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       dn
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          e ne
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             d
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         et n et       dn rdn dn e dn e
                  g        g         g        g        ge       g                                                 s     s                                                tu        r         r      o      o     r      r         o       o                                      d          dn ord or                rd       rd       or        or         d      dn                                                                   or     or       d      d                                   r          r         d        d                    dn rd         or      o     or      or
                                                                                                                                ris u ris rism rism tu                          tu        tu      w       w w o wo              w       w        or        or w or wor or                or       w                                  w         w         or                                or
                                                                                                                                                                                                                                                                                                                                                                                              dn rd ord ord
                                                                                                                                                                                                                                                                                                                                                                                                                     or
                                                                                                                                                                                                                                                                                                                                                                                                                        dn r dn
                                                                                                                                                                                                                                                                                                                                                                                                                                o     w      w       or                                                              or                            or        o    w       w     w      w
                                                                                                                             tu         t     tu        tu                                                                                     w         w                   w        w                     w wo w o                                   w       w
                                                                                                                                                                                                                                                                                                                                                                 or w or wor
                                                                                                                                                                                                                                                                                                                                                                                         w          w
                                                                                                                                                                                                                                                                                                                                                                                                      o w     w                                    w       w
                                                                                                                                                                                                                                                                                                                                                                                                                                                             or w or wor ord or d w o wo
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                w       w                          w         w
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               or wor w or
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 w        w
                                                                                                                                                                                                                                                                                                                                                                                                                   w        w




                                                                                                                                                                                                                    Figure 8: The results of the large-scale evaluation
7.4     Cross-language Term Matching

When dealing with ontology annotation, mediation, and/or mapping, the ontologies participating in the
process may be in different languages. This presents a difficulty for the ontology engineer who is responsible
for establishing and/or verifying the annotations or the mappings. In SWING we need to annotate Feature
Type Ontologies (FTO) Error! Reference source not found.] with “systems” of concepts from the domain
ontologies (i.e. annotations). One of the problems occurs when a FTO is in a language other than the
language of the domain ontologies. If this is not handled properly, it prevents the text-mining module from
providing useful support to the user. Another problem can occur when the ontology engineer does not
understand one of the languages involved. It is clearly very hard for him/her to make sense of it.

In this report we discuss two different techniques for dealing with multi-lingual ontologies. We show how
these techniques can be used to employ the text-mining algorithms – discussed and implemented as part of
SWING D4.2 Error! Reference source not found.] – even in a multi-lingual scenario.

The first technique we present is the so called statistical machine translation (SMT) (see Section Error!
Reference source not found.). Translators based on SMT are trained from a corpus of aligned sentences
(one sentence being in the source language the other being in the target language; e.g. an English sentence
and the corresponding French translation) and a corpus of documents in the target language. The former is
used to train the so called translation model and the latter is used to train the so called language model.

Combining these two models leads to an optimization problem where the task is to find the translation which
is (a) most probable given the source sentence (determined by the translation model) and (b) most probable
given the nature of the target language (determined by the language model).

The other approach is canonical correlation analysis (CCA) (see Section Error! Reference source not
found.). Here, the idea is to first represent each document with the corresponding TF-IDF feature vector.
Having two languages, this results in two vector spaces. By using CCA we are then able to compute a
common space and two mapping functions (one for each language) that enable us to project a document
from its native space into the common space. One of the characteristics of the common space is that
documents that are similar in content, even if they are in different languages, lie close together. In order to be
able to compute the common space and the mapping functions, the (training) documents need to be aligned,
i.e. each document in one language needs to have its corresponding translation in the other language.

7.4.1    Different Applications of Machine Translation in SWING

The core idea of the machine-aided annotation in SWING is based on term matching through grounding
terms by querying a Web search engine. Grounding a term means that we collect a set of documents and
assign them to the term. In our case, the terms are the concept and relation labels in the domain ontology.
With the ontology grounded, it is possible to compare a user query, being either a feature type/property name
or a query in a natural language, to the grounded concepts and relations by using the cosine similarity
measure (see Error! Reference source not found.] and Error! Reference source not found.] for more
details). It is also possible to define a sub-ontology that corresponds to the query by employing a PageRank-
like algorithm, interpreting the ontology as a graph and drawing the cross-relations between the query and
the ontology by using the cosine similarity measure. The latter will be explored in more depth in the third
project year as part of SWING D4.5: Software Module for Semantic Annotation.

The problem occurs when the query is in a different language than the ontology. It is not possible to compare
two terms (their groundings, to be exact) that are not in the same language, as the cosine similarity between
two documents in different languages is always [close to] zero. This is because the cosine similarity is based
on the mutual term occurrence and if the two documents are in different languages, almost none of the terms
occur in both documents. This situation occurs when the service the user is about to annotate is in a different
language than the ontology, and also when the user writes a natural language query in his/her native
language that is different than that of the ontology. In this section we show how MCCA (see Section Error!
Reference source not found., esp. Error! Reference source not found.) and SMT (see Section Error!
Reference source not found.) can be applied to deal with these issues in SWING.

We will first show how MCCA can be employed for multi-lingual querying of an ontology. For simplicity let
us assume that we need to post a French query into an English ontology. For this we need to have an aligned
corpus of French and English documents where every English document has its corresponding French
translation (and vice versa, obviously). MCCA uses this corpus to train two mapping functions wFr and wEn
that project French and English documents, respectively, into a common space. One of the characteristics of
the common space is that documents that are similar in content lie close together even if they are in different
languages; therefore we are able to find English documents that are similar to the French query. Figure 9 and
Figure 10 illustrate the training of the mapping functions with MCCA, and the application of MCCA in the
SWING scenario, respectively.




         Figure 9: Training mapping functions with             Figure 10: Employing MCCA in SWING.
         MCCA.



In SWING we plan to employ SMT instead of MCCA. The main reason for this decision is the lack of an
aligned multilingual corpus in SWING. To train MCCA, we would need a fairly large corpus of documents
where each document is translated into all the supported languages. The documents in the corpus should talk
about the domain (i.e. quarrying and protected regions) rather than discuss other topics. In addition to this,
SMT has another advantage. If applied properly, it enables the user to view the ontology in a language other
than English. This is a desirable property if the user is a non-English speaker.

There are at least two ways to apply SMT. In the first we take the grounded English ontology and translate
all the groundings into French (we assume here that we are dealing with the same setting as before –
querying the English ontology with a French query). We therefore get a French-grounded ontology which is
compatible with the query. By using this approach – illustrated in Figure 11 – the concept and relation labels
were not translated and thus the advantage of viewing the ontology in the French language does not come as
a side effect.
 Figure 11: The first SMT-based approach:           Figure 12: The second SMT-based approach: translating
 translating groundings. The translated ontology    concept and relation labels, and grounding the translations
 is in both languages (hence “En/Fr”) in terms of   through a French Web search engine.
 the labels being in English and the groundings
 being in French.


We could use SMT to translate the labels in addition to translating groundings; however, it is possible to
follow another SMT-based approach that produces French concept and relation labels in the process.

Figure 12 illustrates the second SMT approach. Here, SMT is only used for translating the concept and
relation labels to French and the French groundings are obtained by querying a French search engine (e.g. we
limit Google to French pages). Having the labels translated, the user can view the ontology in French. This
approach has two more advantages over the first SMT-based one. SMT is computationally more demanding
than the inverted index employed in the search engine. In addition, one concept/relation has many (e.g. 100)
documents in the grounding. Therefore it is favourable to send the concept and relation labels (i.e. N words
or phrases) through SMT to translating all the groundings (i.e. N times 100 documents). The second clear
advantage is that we do not even need to use a full-blown SMT if we are only translating words and phrases;
we could use a dictionary coupled with several heuristics instead. However, this approach has one weakness:
if the translation of a label is incorrect, the grounding is false as well. Luckily we can overcome this risk by
allowing the user to edit the translations prior to grounding.

In SWING we will employ the second SMT-based approach as it has several advantages over the other two
approaches discussed in this section. The reader is encouraged to try out a simple demo available at
http://ropot.ijs.si/SwingTranslationTool/ . The translator Error! Reference source not found.] employed in
this demo will also be used in the SWING software prototype.

7.5   Automating the Annotation Process

In D4.2 (Chapters 3 and 4) Error! Reference source not found.], we developed and evaluated several term
matching techniques that will serve as building blocks for automating the annotation process in SWING. The
term matching is envisioned as follows:

         •   Each term is assigned a set of documents. We say that the term is grounded with a set of
             documents.

         •   The documents are converted into their bag-of-words representations (i.e. TF-IDF vectors) and
             thus the term similarity assessment is transitioned into the bag-of-words space (a.k.a. vector
             space, semantic space).

         •   A classifier is trained to distinguish between the terms. A new term can then be grounded,
             represented as a bag of words, and fed to the classifier. The classifier has the ability to tell how
             similar the new term is to each of the terms on which the classifier was trained.

This process is illustrated in Figure 13.
                 1.


                        2.


               3.




                                  Figure 13: The term matching process.

In D4.5 (this report and the accompanying software), we employ and evaluate the term matching techniques
for the annotation in SWING. In addition, we explore how the ontology structure can be taken into account
to improve the accuracy of the annotation recommender system. But first, we discuss how the documents for
groundings can be automatically collected through a Web search engine and included into the ontology
description.

7.5.1    Labels and Groundings in Resource Description Framework (RDF)

In the project, we recently decided to “communicate” ontologies to and from Visual OntoBridge (VOB) by
describing them in Resource Description Framework (RDF) language Error! Reference source not found.].
This decision was made mainly due to a greater availability of software libraries to handle RDF in contrast to
those to handle WSML. SWING Ontology Repository [ref] is expected to serve RDF in addition to/rather
than WSML. There exist a standardized way to encode a WSML ontology description as a set of RDF triples
and vice versa Error! Reference source not found.] thus this decision does not represent a conceptual
change.

When a RDF or a set of RDF’s is loaded into VOB, the following procedure is executed:

    1. Each concept/relation is ensured to have an English label.

    2. Each concept/relation is ensured to have a label in every supported language.

    3. Each concept/relation is ensured to be grounded in the selected language.

In the following we look at this process in more details.

Labels

#...#

Groundings

#...#

7.5.2    Training a Classifier

•   Preprocessing of documents

•   Classifiers: centroid, k-NN, SVM
7.5.3      Incorporating Ontology Structure

      •    Page Rank in general (the Web)

      •    Ontologies as networks (concept networks)

              o   Concepts only

              o   Concepts and relations

              o   Concepts and relations plus direct connections between concepts

              o   Which one to choose, depends on what works best

7.6       Evaluation of Automatic Annotation Methods

      •    Under evaluation:

              o   Grounding method (heuristics, grounding set size)

              o   Text preprocessing method (best practices)

              o   Classification algorithm (centroid, k-NN, SVM)

              o   Network type and Page Rank parameters

7.6.1      Golden Standard

      •    UC1 annotations

      •    UC2 annotations (?)

7.6.2      Evaluation Metric

#Glue#

We evaluated the quality of the annotation recommender system by computing the area under the ROC curve
with respect to the provided golden standard. Given the top n items of a ranked list of recommendations, the
ROC curve tells us the true positive rate TPR (the percentage of golden-standard items among top n items)
versus the false positive rate FPR (the percentage of non-golden-standard items among top n items). The
ROC curve is defined as ROC(n) = (TPR, FPR). Obviously ROC(0) = (0%, 0%) and ROC(N) = (100%,
100%), where N is the number of all items. If the list is randomly shuffled, TPR is close to (or equals) FPR at
each n. In such case, the area under the curve is 50% of the optimal area. The optimal area is achieved if all
golden-standard items are at the top of the list. In such case, there exists m, 0 < m < N, such that ROC(m) =
(100%, 0%). These properties of the ROC curve are illustrated in Figure 14.
                                 Figure 14: Basic properties of the ROC curve.

7.6.3      Experimental Results

      •    Grounding methods and classifier

      •    Network type and Page Rank parameters

7.6.4      Conclusions

Depend on the experimental results.

7.7       Visual OntoBridge

…

7.8       Geo-data Mining

7.9       Conclusions and Lessons Learned
8       Geospatial Catalogue, Semantic Discovery - Philippe Duchesne, ERDAS
This chapter explains why and how the geospatial domain can benefit from the use of semantic technologies.
It first describes the different use cases that justify the use of those technologies, particularly for the
discovery of services and other resources. Then it describes technically how to insert ontologies devised in
the previous chapter in a CS/W Catalogue, and how to enhance the query language of this Catalogue to be
able to efficiently query it semantically. Finally, the implementation of this new query language using the
WSMX platform is described.

8.1      The need for semantic metadata / Semantic tagging

#...#

8.2      Importing Ontologies

#...#

8.2.1     The needs

The web services annotations, as produced by the annotation engine, are expressed using domain ontologies
(Figure 15). Therefore, before being able to actually store and query the annotations in the catalogue, the
domain ontologies must be made available in the catalogue.


                  Location      hasLocation      IndustrialSite      hasIndustrialActivity    Production                  Quantity




                                                                                             produces hasProductionRate
                                                      Quarry

                                 rangeConstraint on            rangeContraint on
                                    hasLocation                    produces
               QuarryLocation                                                          QuarryProduct                ProductionRate




                concept exploitationsponctualsproduction subConceptOf gml#Feature
                  msgeometry impliesType (1 1) gml#GeometryPropertyType
                  exploitationname impliesType (1 1) _string
                  communities impliesType (1 1) _string
                  substance impliesType (1 1) _string
                  year impliesType (1 1) _string
                  allowedproduction impliesType (1 1) _string
                  sitename impliesType (1 1) _string
                  sitetype impliesType (1 1) _string


           Figure 15 : The annotations are related to the domain ontologies (source: D4.1).


The following paragraphs will detail what exactly are the issues to take into account when trying to integrate
ontologies in an ebRIM catalogue, before actually proposing solutions to achieve it.

ebRIM modelling

To reference ontology concepts when storing annotations, the ontologies must first be themselves stored in
the catalogue. The OGC catalogue specification Error! Reference source not found., built on the ebRIM
metamodel, does not explicitly provide a way to store such ontologies.

The ebRIM specification provides a set of object types (namely the ClassificationSchemes, Concepts and
Classifications) that are meant to model simple concept trees and to classify other ebRIM objects using those
concepts. Those ebRIM constructs are already used in the OGC specifications, for instance to classify
resources using the ISO taxonomies.

However, these ebRIM constructs allow only for very simple ontologies, because :
    •   They can only represent concept trees, as opposed to graphs in most ontology representation
        languages
    •   They do not provide for the expression of equivalence of concepts, multi-subclassing, transitivity of
        properties, and other properties that are usually found in languages like OWL

Referencing external concepts

Most use cases involve the use of widely accepted domain ontologies, published online and identified by
URIs. This is the case in SWING as the ontologies produced by Work Package 3 will be published and
potentially referenced using online URIs. There is therefore a need to be able to classify metadata resources
(such as the annotations) using well known concepts’ URIs.

Although the ebRIM model provides a way to reference external resources, there is no standard way to
reference concepts and ontologies defined in external documents (OWL, WSML, …), recognizing the
specific semantic relations that bind them.

There should be a way to express such external semantic references, using the same naming scheme as in
OWL or WSML, i.e. a namespace identifying the ontology, augmented by a concept name to uniquely
identify a concept :


               http://swing.brgm.fr/ontologies/quarryFTO # Quarry
                               Ontology namespace                         Concept ID

Performing basic semantic inference

Lastly, there is a need to perform some semantic inference within the catalogue.

Although in the SWING architecture, the catalogue delegates to the WSMX platform the responsibility for
performing the most of the semantic discovery, adding basic semantic inference capabilities to the catalogue
is relevant because:

    •   For the same reason as why the catalogue is responsible for performing spatial filtering, performing
        basic inference at the catalogue level can significantly pre-filter the results before delegating to
        WSMX which is known to be time-consuming.
    •   There is demand, in the GIS and OGC communities, for a semantic extension to the OGC catalogue;
        as standardization issues are one of the topics of the SWING project, exploring the possible semantic
        extensions of the OGC catalogue implementation and specification is part of the scope of this
        deliverable.

Before plunging into the technical details of the implementation of such semantic inference capabilities, one
must define what are such ‘basic’ inference capabilities, and what is reasonable to expect in an OGC
catalogue.

Regardless of the Swing project specificities, there is a growing need is the GIS community to be able to tag
and then discover resources (services, datasets, styling information, …) using domain ontologies (new or
pre-existing) that are already widely accepted in the domain at stake. Furthermore, in domains such as
Decision Support, Emergency Response, Urban Planning, Military Mission Planning (to name a few), there
is a need to be able to use several domain-specific taxonomies together in the same usecase, linking them
with equivalencies, super-concept or sub-concept relationships.

Considering the needs of these real-life use cases of the OGC catalogue, the optimization needs in the
SWING project, and the fact that concept trees in the catalogue are used almost exclusively to classify and
then discover instances of metadata, one can list the following semantic capabilities:

    •   find by concept, or any subconcept
    •   find by concept, or any equivalent concept
    •    more generically, find by a concept satisfying a specific relationship to another concept, taking into
         account the transitivity or reciprocity of relationships where they apply

Therefore, the semantic inference requirements are limited to sets of concepts, and never involve reasoning
on instances of data. Performing inference only on the concept graphs, while ignoring the metadata instances
is thus sufficient.

The following modelling proposals will detail how the implementation of such inference capabilities can be
implemented in an ebRIM registry.

8.2.2    Using the OASIS ebRIM/OWL profile

The first solution involves using an ebRIM profile for OWL representation, that has been recently published
as an OASIS recommandation paper Error! Reference source not found..

This OASIS recommandation paper specifically addresses the mapping of OWL-Lite ontology structures in
ebRIM databases and proposes a standardized ebRIM/OWL mapping. It defines in details how to store OWL
constructs in an ebRIM registry. It also defines a set of stored procedures at the database level, to implement
semantic operators within the ebRIM model.

This recommandation is therefore of great interest to the SWING project, since it directly addresses the
problem discussed here.

Object model

The object model proposed in the OASIS recommandation tries to represent every possible statement of
OWL-Lite using ebRIM constructs. In particular, it re-uses the built-in ebRIM ClassificationScheme and
Concept object types to represent the OWL Class and Ontology statements, respectively.

 <rdf:RDF
     xmlns="http://example.com/ontologies/gis_datatypes">

        <owl:Class rdf:ID=“DataFormat“/>

        <owl:Class                                  rdf:ID="GML">
        <rdfs:subClassOf                 rdf:resource="#Vector"/>
        </owl:Class>
        <owl:Class                              rdf:ID="Vector">
        <rdfs:subClassOf            rdf:resource="#DataFormat"/>
        </owl:Class>
        <owl:Class                                rdf:ID="Shape">
        <rdfs:subClassOf                 rdf:resource="#Vector"/>
        </owl:Class>
        <owl:Class                                 rdf:ID="GML3">
        <rdfs:subClassOf                    rdf:resource="#GML"/>
        </owl:Class>
        <owl:Class                                 rdf:ID="GML2">
        <rdfs:subClassOf                    rdf:resource="#GML"/>
        </owl:Class>
 </rdf:RDF>

                   Figure 16 : The mapping of a simple OWL-Lite construct maps into the ebRIM
                   model

The recommendation also defines a set of new ebRIM Association types, to properly store in ebRIM the
various OWL statements that contain the knowledge needed to later perform inference reasoning, such as
equivalence, transitivity, intersection, restriction (to name a few). With this set of new ebRIM constructs, the
full expressivity range of OWL-Lite can be mapped into ebRIM.

Since this mapping uses the standard ebRIM Concept objects, any OWL-Lite ontology mapped into an
ebRIM registry using this technique, can then be used to classify other ebRIM objects, such as annotation
objects for instance.
Stored procedures

The OASIS recommendation paper also proposes a set of stored procedures, defined at the database level, to
perform standard semantic operations, like finding subclasses, finding equivalencies, taking transitivity
relations into account, and other capabilities typically found in an inference engine. These procedures are
sufficient to implement the semantic inference capabilities described in paragraph 0, and imply that the full
OWL-Lite structure is mapped into the ebRIM model.

Because they are defined at the database level, these operators can be applied to the whole database,
including the metadata instances (resulting from the insertion of annotations, service metadata or other forms
of metadata). However, inference engine capabilities can be very time consuming, and implementing them at
the database level can cause poor performances when dealing with large amounts of data, which is likely to
happen in a production catalogue. An alternative solution will be described in section 8.2.3 .

OWL-Lite

As stated, the OASIS recommendation is dedicated to the modelling of OWL-Lite constructs. Such
ontologies have far less expressivity than full description logic ontology languages such as OWL-DL or
OWL-Full, not to mention languages such as WSML which also supports logic programming.

But, as described earlier, the use of ontologies within the catalogue is limited to a very narrow and simple set
of inference operators, for which OWL-Lite ontologies are sufficient. So, even though the ontologies used
within SWING, and in particular within the WSMX platform, cannot be expressed using OWL-Lite
statements, they can be reduced to OWL-Lite ontologies as long as the concepts and ontology URIs are kept
to make the link across the various platforms (Figure 17).



         Catalogue                                   WSMX


          OWL-Lite                                     WSML
            swing.brgm.fr#Quarry                         swing.brgm.fr#Quarry




                                         Annotations




 Figure 17 : An annotation document referencing a concept (swing.brgm.fr#Quarry), represented
 using different ontology languages in the two platforms, but yet uniquely identified.

8.2.3   Using a third-party inference engine

Inspired by the previous technique, this section proposes a simpler model to store ontologies, using only the
ebRIM built-in constructs, i.e. simple concept trees, without any support for the OWL properties mentioned
above (equivalence, transitivity, …).

The main differences are:

    •   The OWL specific properties are not longer stored at all in the database,
    •   For each ontology, only the concept tree is stored, i.e. the minimal set of knowledge needed to be
        able to classify registry objects,
    •   To perform inference, a third-party inference engine (e.g. Jena) is bundled with the catalogue, and

    •   ontology documents (OWL, WSML) are stored as text files in the catalogue and linked to the
        corresponding concept trees, to be available when the full ontology is needed by the inference
        engine.

Using this technique, when sending a query to the catalogue involving semantic inference, the built-in
inference engine in instantiated with all the ontology documents, and the semantic inference capabilities are
implemented by delegating them to the inference engine.

A catalogue query with semantic operators is therefore processed in three steps :

      1. first the inference engine is initialized with the ontologies documents stored in the catalogue;

      2. then the reasoning implied by the semantic operators in the query is carried out by the inference
         engine, and results in a set of discovered concepts;

      3. finally, those concepts are used when generating the actual database queries.

By decoupling the semantic inference and the database querying, the performance is greatly improved, since
the database doesn’t have to cope with complex inference procedures that don’t fit into traditional database
query languages, and the inference engine doesn’t have to take the whole set of metadata into account – only
the concepts trees – thereby limiting the solution space. This improvement is possible only with the
assumption, expressed earlier, that the semantic inference is done only on the concept-tree level.

8.2.4      Extending to WSML

So far, the proposed solutions have not considered the ontology representational language specific to the
SWING project, namely WSML. However, as stated in section 0, the solutions proposed here are sufficient
to model OWL-Lite or equivalent languages, and this is sufficient for the semantic processing done at the
catalogue level.

It must also be noted that those solutions are not bound to OWL-Lite per se, but only to the expressiveness of
OWL-Lite.

8.2.5      Standardization

To ensure interoperability between ‘semantic’ catalogues, the object model used to store ontologies should
be proposed as an extension to the OGC CS/W ebRIM specification.

The CS/W ebRIM specification provides a way to extend it, in the form of extension packages that can
define profiles of the ebRIM model tailored for certain use cases. This notion of extension packages has been
introduced during 2007, and is now part of the freshly voted new CS/W specification (accepted in February
2008). Several extension packages are already under development, most notably for the ISO19115 object
model or for the FeatureTypes object model.

Such an extension package should also be defined for ontologies, and should specify:

      •    How to reference external concepts in ebRIM objects, using URIs.

      •    Which subset of the OWL model to explicitly store in ebRIM (e.g. only concept trees if the second
           proposal – using an inference engine – is retained).

      •    Which operators or adhoc procedures should be added to perform semantic inference (the need for
           such operators will be covered in section Error! Reference source not found.).

8.3       Storing Annotations in the catalogue

#...#

8.3.1      Annotation object model

Below is a sample of a service annotation outputted by the annotation engine. In this example, the
annotations describe a WFS offering the basins features.
   wsmlVariant _"http://www.wsmo.org/wsml/wsml-syntax/wsml-flight"
   namespace {...}

   webService quarryWFS1
           nonFunctionalProperties
                  dc#creator hasValue "s.schade (IFGI)"
                  dc#title hasValue "TestWFS Basins"
                  dc#type hasValue "WFS Webservice Description"
                  dc#subject hasValue {"schemaca", "brgm","quarries", "quarry", "france"}
                  dc#source               hasValue                "http://swing.brgm.fr/cgi-
   bin/carrieres?service=wfs&version=1.0.0&request=DescribeFeatureType&typename=basins"
           endNonFunctionalProperties

   importsOntology {fto#quarryFTO, annot#Annotation, qu#Quarries}

   capability testWFS_capability

        sharedVariables {?feature}
        postcondition getFeature_postcondition
        definedBy
                     ?x[msGeometry hasValue ?y, basinName hasValue ?z] memberOf fto#Basins and
                     ?r[hasLocation hasValue ?p] memberOf qu#ConsumptionBasin and
                     qu#names(?q, ?r) and [...]




In these annotations, one can recognize DublinCore properties (prefixed with dc#) and, most importantly,
the capability element containing the result of the annotation process, and referencing concepts in the
domain ontologies (such as fto#Basins or qu#ConsumptionBasin) to describe the WFS.

To be able to store annotations in the catalogue, new ebRIM object types are created, namely:

    •     WSML#WebService : will contain the DublinCore non functional properties, and will be linked:
              o    on one hand to the regular WFS ebRIM object, resulting from the standard harvesting of an
                   OGC WFS in the catalogue, and
              o    on the other, to a Capability instance, as described below.
    •     WSML#Capability : will contain the original WSML document, and will be linked to the domain
          ontology concepts that describe the capabilities.

Also, two new ebRIM association types are created:

    •     WSML#HasAnnotation : describe associations that link a standard OGC Service object to a
          WebService object
    •     WSML#HasCapability : describe associations that link a WebService object to a Capability
          object
(In this notation, the WSML prefix references the namespace in which the new types are created.)

8.3.2     Harvesting annotations

Harvesting, in an OGC catalogue, is the name of the process that fetches a resource, parses it and stores it in
the ebRIM model using the appropriate mapping.

To harvest annotations, the catalogue deployed in the scope of the SWING project was enhanced with a
specific module that can parse WSML documents. When harvesting such a document, the catalogue not only
stores it using the mapping defined above, but also:

    •     reads the dc#source property, that is supposed to contain the service URL, and checks whether that
          service is already registered in the catalogue; if not, the service too is automatically harvested using
          the standard OGC harvester;
      •    (optionally) checks if the domain ontologies are already known in the catalogue, and harvest them if
           needed;

      •    links the new WSML#WebService object to the OGC Service object (that was either created at the
           first step, or was already existing in the catalogue).

Thus, after the harvesting an annotation document, the following structure must be present in the catalogue:



ogc#Service                              WSML#WebService                            WSML#Capability
                   WSML#HasAnnotation                        WSML#HasCapability




8.3.3      Synchronizing with WSMX

In the global architecture of the SWING project, it is planned that the catalogue and the WSMX platform
share a common repository of resources (annotations, services, ontologies). However, the medium-term
solution still relies on distinct repositories. Hence, there is a need for a synchronisation mechanism that
guarantees the equivalence of the two repositories.

To tackle this issue, the catalogue annotation harvesting component also interacts with the WSMX platform
to check whether the annotation document is already registered within the WSMX platform, and registers it
if needed.

#add links to chapter 6#

8.4       Interfacing a CS/W Catalogue with the WSMX platform

Deliverable D5.3 (to be written – section regarding the semantic query extensions)

8.5       Query Language extensions

Deliverable D5.3
9     Geospatial Service Composition and Execution -Andreas Limyr, SINTEF
This chapter will give an introduction to modelling of service compositions with emphasis on how this can
also be used on geospatial services. Some special considerations dealing with modelling of geospatial
services are discussed. This chapter will also deal with how it is possible to execute geospatial service
compositions based on execution of general service compositions.

9.1    Modelling Service Compositions

When developing software it is always necessary to have tools and methods. The more complex the
software, the more tools and methods are required. When it comes to developing software for Service
Oriented.

Architectures (SOA) (see appendix A.2), it is crucial to have tool support for web service discovery,
specification of web services, and composition of several web services. In this work we are using the
COMET methodology [COMET 2006] to identify the use cases and roles involved in using the DE. By
analysing the use cases we are able to infer the components of the architecture, their interactions and
subsequently their requirements.

The Eclipse platform [Gamma and Beck, 2004; des Rivieres and Beaton, 2006] is an integrated development
environment that is made to be extensible. Originally the platform was developed at IBM, but in 2001 the
platform was released into the open source domain. Since that Eclipse has gained an enormous momentum
and attracted numerous developers, commercial, academic and even private. From the start Eclipse has been
a development environment for programming languages like Java and C, but recently the platform has been
extended with tools for graphic modelling, code generation and web service development. Eclipse platform
is well-suited for integration of different tools in the SOA and web services domain because it has an
extensible architecture, which reduces the development time and because there are existing extensions to
Eclipse dealing with the SOA and web service domain. Many of these extensions are also extendable.

Because there are a multitude of technologies and languages dealing with the different layers of the web
service architecture, development of web services are often coupled with waste knowledge of many of these
technologies and languages. OMG's Model-Driven Architecture (MDA) [OMG, 2003a; Kleppe et al, 2003]
is a way of developing software with models at a platform-independent level; this gives the benefit of not
having to know all the details of the underlying technologies. The models on the platform-independent level
are then transformed, using model transformations, into something more platform-specific and technology
dependent.

Unified Modeling Language (UML) [OMG, 2003b], the de facto modelling language for software
development, can be used in the MDA context to create platform-independent models. It can therefore be
possible to use both MDA and UML to help the web service developer to use more time on the conceptual
parts of the web service than on the technical details.



SWING aims at developing tools and methods that can hide the complexity of the semantic web service
technologies. The SWING Development Environment will therefore use Eclipse, MDA and UML to realize
a development environment for semantic web services. This Development Environment will enable service
developers to annotate, discover, compose and execute web services. The idea is to bring together several
components with the Eclipse platform to get a seamless integration of different tools. Having a common
platform to operate from will hopefully reduce development time and costs as the developer only has to
consider a family of tools, instead of many separate and non-interoperable tools. It could also be interesting
to see if the SWING Development Environment can build on or even become part of existing Eclipse
projects. In this way the ideas and tools from SWING can reach a wide Eclipse-audience. With use of MDA
and UML knowledge of the underlying semantic web services technologies will not be required because the
technological details are not exposed to the developer. This can also lead to a lower point of entry for future
web service developers unfamiliar with different lower level web service technologies.

A service composition is a specification of how existing (fine grained) services can be combined into a more
coarse grained service. A synonym term used in SOA is orchestration, which indicates that there is some sort
of controller which decides the sequence of how the services should be called and how their parameters and
results should be handled and combined. A dedicated software tool, in this project called an execution
engine, can be configured to execute the service composition when needed. Service compositions may be
defined in a lexical or graphical notation. Graphical is often easier for humans, while machines prefer
textual. The textual descriptions can be used as scripts for process execution engines. There are several
proposed languages that can be used to model or specify different aspects of service compositions. These are
some of the most promising approaches:

•   UML 2.0 activity diagrams [OMG 2003b]. A graphical language which can be used to model control
    flow and data flow. It can be used quite freely resulting in imprecise models with unclear semantics, e.g.
    by missing types for input and output parameter objects. However, it has been shown that UML activity
    diagrams can be used to model web service compositions [Skogan et al. 2004, Grønmo and Solheim
    2004] by providing guidelines and UML extensions. In [Hoff et al. 2005] we provide support for service
    composition modeling in UML 2.0.

•   WS-BPEL / BPEL4WS [OASIS 2004]. An XML notation that is the leading specification for executable
    web service compositions. It has been criticized for the lack of semantics [Wodman 2004] and that there
    are unnecessary many syntactic ways to achieve the same logical constructions.

•   BPMN [BPMI 2003]. A graphical notation that has been shown to be at least as good as UML 2.0
    activity diagrams for modeling control flow patterns [White 2004]. It has graphical notations that make it
    intuitive to understand. However, it is unsuited to be used for defining service composition in the current
    version due to lack of precise data object and data flow modeling.

•   OWL-S [OWL Services Coalition 2004] and WSML [WSMO 2005]. These are the two leading semantic
    web service languages. Both are textual, OWL-S uses XML, and WSML uses a logical language. They
    are used to describe the semantics of Web services. In addition they also describe (at least OWL-S) the
    composition itself and thus has a great overlap with BPEL4WS. A WSML orchestration describes which
    other Web services have to be used or what other goals have to be fulfilled to provide a higher level
    service, while OWL-S does not model this aspect. WSML allows the definition of multiple interfaces
    and, therefore, choreographies for a Web service, while OWL-S only allows a single service model for a
    Web service, i.e. a unique way to interact with it. OWL-S uses a single modelling element for
    representing requests and services provided, while WSML explicitly separates them by defining goals
    and Web service capabilities. OWL-S does not explicitly consider the heterogeneity problem in the
    language itself, treating it as an architectural issue i.e., mediators are not an element of the ontology but
    are part of the underlying Web service infrastructure.

Today, BPEL can be used to define service compositions and many vendors provide BPEL design tools with
accompanying BPEL execution engines. Unfortunately, BPEL is an XML-based language, it does only
support static compositions and does not handle semantic web services.

The only tool that is available and under active development in the context of Semantic Web Service
composition is WSXM, which supports the WMSL language. Therefore it is a natural choice for our
execution engine (WP2). Since we have done previous work on using UML 2.0 Activity diagram it is natural
to continue the development and extend the service development environment with semantic capabilities and
with the functionality to produce service compositions in WSML.

Figure 19 shows three different abstract levels for modelling, describing and executing compositions. At the
most abstract level we may use UML2 activity diagrams for business process modelling. The high level
processes involved are modelled without details about concrete services or data flow. The business process
model contains processes that can be decomposed as service compositions themselves and where some may
be implemented manually and others automatically. At the service composition level we detail the processes
that can be automated and specify control and data flow and any necessary data mediation. This can be done
graphically using the Development Environment in UML 2.0, or it can be done lexical in WSML using a text
editor. The idea is that we will implement support for translating the graphical composition into WSML
automatically. At the process execution level the WSML-based service composition is executed by the
WSMX execution engine.

The Visual Service Composition Studio (VSC) [Hoff et al, 2006] has been developed in the SODIUM
project (IST-FP6-004559) by SINTEF. It is a toolkit for creating, editing, storing, and loading service
compositions. It uses a variant of UML 2 activity diagrams, which is called the Visual Service Composition
Language (VSCL), to describe compositions. The main difference is that VSCL introduces a special
transformation node, which is not found in UML2. VSCL includes the following model elements:

     •   Task nodes which are similar to action nodes in UML and represent a service operation.

     •   Initial nodes, which define the start of the execution path.

     •   Final nodes, which define the end of the execution path.

     •   Control flow, which represents a step in the execution path.

     •   Object flow, which represents a step in data path, i.e. the path that a data object may take.

     •   Transformation nodes, which transform data objects from one type to another.

     •   Input and output pins, which represent the input and output parameters of a task.

     •   Decision nodes, which represent a choice in the execution path. A decision node may have

     •   several outgoing control flows, and the execution path is decided by guard conditions on these

     •   control flows.

     •   Merge nodes, which join the alternative execution paths.

     •   Fork nodes, which split a single execution paths in to several parallel paths.

     •   Join nodes, which combine and synchronize parallel execution paths.




                                                                                              Composition Studio
                                                                                              menu

                                                                                              Palette with available
                                                                                              model element types




                                                                                              Visual editor with the
                                                                                              composition



                                                                                              Tree view of the
                                                                                              composition




                                                                                              Eclipse project view


                                                                                              Property view for the
                                                                                              currently selected model
                                                                                              element



                                                                                              Local dictionary with
                                                                                              imported services,
                                                                                              service operations and
                                                                                              data types



The figure depicts the graphical user interface of the Visual Service Composition Studio. The main
components are:

     •   The project view, which lists the projects and compositions that a user is working on. Eclipse is
         projected oriented and a composition is created within one project. Hence, the user must create a
              project before creating a composition. The project view offers also standard functionality for
              deleting, renaming, and moving a project.

         •    The palette, which contains the model constructs from the Service Composition Language.          .

         •    The visual editor where the service composition is defined. The user creates a composition by
              selecting model constructs from palette and putting these inside the composition.

         •    The local dictionary, which is a repository for storing service descriptions. The user may import
              service descriptions in WSDL, and OWL format. A service operation in the dictionary can be
              added to composition by selecting the operation and drop it onto the visual editor. A task will then
              be created with the selected service operation.

         •    The property view, which allows editing of properties of the currently selected model element.

9.2       Abstracting Geospatial Aspects

The Development Environment will be used to create web service compositions that include OGC web
services. This chapter gives a very brief introduction to OGC services. It discusses possible problems and
requirements that must be handled by the Development Environment and other parts of the SWING
framework in order to support OGC services.

9.2.1        A brief introduction to OGC web services

The open geospatial consortium (OGC) has defined a set of specifications for geospatial web services. These
specify an infrastructure that provides the means to discover and visualize geospatial data. The specifications
define several kinds of web services. Some examples are:

•      Web Feature Services (WFS) which allow a client to retrieve and update geographical feature data. A
       geographical feature is an abstraction of a real world phenomenon, which is associated with a location
       relative to Earth [OGC, 2003].

•      Web Map Services (WMS) which allow a client to retrieve maps generated from geographical data.

The reader may find more information on OGC services in SWING deliverable D2.1 Geospatial Semantics
Web Services or in the OGC specifications34.

9.3       Modelling Geospatial Service Compositions

The SWING project will use OGC services, in particular WFSes, as parts of a composition. An example
taken from one of the pilot use cases for SWING is illustrated in Error! Reference source not found.. This
composition computes aggregate production and consumption for departments (i.e., administrative regions in
France) by using existing WFSes and standard web services. The composition takes as input a bounding box,
which defines a geographical area and looks up the set of departments inside the bounding box. For each
department inside the bounding box, it computes the production and consumption. The production is
computed by looking up the quarries inside the department and summing up the production of these quarries.
The consumption is estimated by finding the population density in the department and multiplying it with a
constant. The consumption and production estimates are then combined with the information about the
department. The output of the composition is a list of departments with their aggregate production and
consumption.

The composition in Figure 18 selects data from two WFSes. One WFS contains information about
departments. The other contains information about quarries. In both cases a spatial operation is required in
order to determine which departments are inside the bounding box or which quarries are inside a department.
OGC has defined a standard for filter expressions that may be used for this purpose. A filter expression
defines a constraint similar to a WHERE clause in a SQL SELECT statement and is encoded in XML. It may


34
     The OGC standards are available at the following URL http://www.opengeospatial.org/standards.
be sent as a part of a request to a WFS, which will use the filter expression to select a particular set of feature
instances from its internal database.

The filter expressions introduce several requirements for the SWING framework. Firstly, the Development
Environment must allow a user to associate a filter expression with a WFS service. A special dialogue
window where a filter expression can be entered will thus be required. Secondly, it must be possible to
represent a filter expression in a WSML orchestration so that a composition with filter expressions can be
translated into WSML. Thirdly, WSMX must be able to execute an orchestration with filter expressions.
When WSMX needs to invoke a WFS with a filter expression, it must create a request with an XML
representation for the filter expression and send the request to the WFS.

                                 Bounding Box


                                                    Consumption Production Map

                                 Bounding
                                 Box

                                          «WFS»

                                    Get Departments


                         List of msGeometry
                         (boundary),
                         department, code


             parallel                    Compute Consumption Production for each department




             msGeometry                                                                       InseeCode

                            «WFS»                                                      «WS»

                        Get Quarries                                               Get Population

             list of quarries                                                                 Population
             with production




                            «WS»                                                       «WS»

                   Add quarry production                                        Compute Consumption

                                                      msGeometry,                           Consumption
               production
                                                      department,
                                                      code




                                                         Combine information/
                                                           create instance


                                                  msGeometry,
                                                  department,
                                                  code,
                                                  production,
                                                  consumption




                                       Figure 18 A composition with OGC web services

The response from a WFS requires also special treatment. A WFS response contains a set of features
instances. These instances are represented in the Geography Markup Language (GML), which is an XML
encoding for geographical information. WSMX, on the other hand, represents the information space of a web
service by ontologies. The actual input and output parameters of a service are thus instances of concepts.
Hence, the WFS response cannot be handled directly by an orchestration. WSMX must first translate the
GML feature instances into instances of the ontology concepts used by the orchestration. In SWING this will
be achieved by using an intermediate application ontology. This application ontology will correspond to the
schema of the WFS. WSMX will translate the GML response from the WFS into instances of this application
ontology using an adaptor.A mediator will then used to map the instances of the application ontology into
instance of the ontology used by the orchestration.

OGC web services introduce other requirements as well. Traditionally, they have been REST based. This
means that they only support plain HTTP GET and POST requests and not the SOAP protocol, which is used
by most web services. There are, however, efforts that consider SOAP and WSDL support for OGC web
services [OGC, 2003]. Still, these technologies are only supported by the WFS standard and many WFS
implementations do not support SOAP. This means that WSMX must be able to invoke REST based services
in order to execute compositions with OGC services.

The geospatial decision-making application will use web service compositions to create maps. This means
that the output of a composition must be visualized. This may be achieved in several ways. One alternative
would be to implement a visualization component in the application. The application would then invoke a
composition in WSMX and use the visualization component to portray the output.

Another alternative would be to create a WFS adapter for WSMX so that it could receive WFS requests from
an OGC compliant client and invoke compositions based on such requests. The output of the composition
could then be visualized by an SLD-enabled WMS or some other OGC compliant software. This adapter
could be realized in two different ways. One way would be to create a general adapter, which would create a
WSML goal from an incoming feature request. The goal could then be used to identify and execute an
appropriate composition. Alternatively, an adapter could be created for each composition. In this case, the
adapter would have a reference to its composition so that the composition could be invoked directly without
being discovered.

It is expected that a WFS adapter may be implemented with less effort that an internal visualization
component. Thus, this solution will probably be implemented in one of its instantiations. However, the
SWING consortium has not made a final decision at time of writing this document. The interface and
internal components of the Geospatial decision-making application remains to be defined.

Unfortunately, the solution may have impact on the Development Environment. Some extra functionality
may have to be added. For example a user of DE may need to create an WFS adapter for a composition and
register it in the Catalogue. Since a final decision has not been made, it is not possible to specify this
functionality in this document.

9.4       Execution of Service Compositions

After modelling the composition in the Composition Studio, it should be possible for the user to test and
execute it. The composition is described using VSCL (Visual Service Composition Language) [SWING
D6.1] which is a variant of UML 2 activity diagrams that among other things introduces a special
transformation node, which is not found in UML2.

The purpose of this module is to transform the composition represented in VSCL into an equivalent
composition represented as an Abstract State Machine (ASM) described in the WSMX format which can be
executed against a WSMX server [Bussler et al, 2005]. This transformation was implemented using the
MOFScript tool [Oldevik and Olsen, 2006] which is an implementation of the MOFScript model to text
transformation language. MOFScript is an Eclipse-based text transformation tool and engine. It was
developed as part of the EU projects ModelWare35 and ModelPlex36. It includes an Eclipse lexical editor and
a transformation engine for generating text.



35
     ModelWare official website, http://www.modelware-ist.org/
36
     ModelPlex official website, http://www.modelplex-ist.org/
The following sub-sections describe details of the transformation in different parts of the ASM. For details
about the representation of web services in WSML please check [Roman et al., 2005].

9.4.1   Namespace References and Web Service Header

The namespace references are obtained from the semantic types of the input and output parameters of all the
tasks included in the composition. Figure 5 shows the dialogue Parameter Properties where the semantic
types of the parameters in a task are specified.




                         Figure 5 Parameter properties of a task in Composition Studio.

The composed web service name is obtained from the main task name in the composition and used to
complete the fields webService and capability. In the example in Figure 4 this is
“ConsumptionProductionMap”.

To define the precondition and post condition one must check the semantic type of the first and last tasks in
the composition.

9.4.2   Input and output parameters

All the input and output parameters of all the services in the composition should be listed in the state
signature as follows:

                                    (in|out) X withGrounding Y

where in or out declares whether the parameter is an input parameter or an output parameter respectively.

•   X is a semantic type (ex: INSEEgetPopulationByDepartmentRequest )

•   Y is the address of a service (ex: _http://swing.brgm.fr/cgi-bin/limitesadm?)

The services in the composition can be obtained in the task properties (Service Operations) of each sub-task
in the composition (see Figure 6). The semantic types can also be obtained in the task properties (Parameter
Properties dialogue).
                           Figure 6 Service operations of a task in Composition Studio.

9.4.3   Transition Rules

To map the data flow in the composed service to transition rules in WSML one needs to loop over all the
sub-tasks in the composition until all of them are processed.




                      Figure 7 Iterations for the creation of the Transition Rules in use case 1.

For each iteration of this loop the following should be done:

1. Identify which sub-tasks can be processed in the particular iteration i.e. sub-tasks that only receive the
   input parameters of the whole composition or sub-tasks where all the input parameters are supplied by
   sub-tasks that were already processed. Figure 7 shows which sub-tasks are processed in each of the four
   iterations that are needed to create the transition rules for the exemplified composition.
2. Create a control state representing the parallel execution of the sub-tasks to be processed in the particular
   iteration.

3. Use the default values and the structure of the input and output parameters of the sub-tasks to be
   processed in the particular iteration to define the composition data flow in WSML.

As an example, for Iteration 3 the generated WSML code should be equivalent to the following:

forall {?controlstate} with (?controlstate[oasm#value hasValue oasm#State3]
        memberOf oasm#ControlState )
do

      /* Compute Comsumption */

  forall {?constResp, ?const1, ?popResp, ?pop, ?code} with (
          ?constResp[brgm#return hasValue ?const1] memberOf
          brgm#SocioEconomicConstantsResponse and
          ?popResp[ins#INSEEgetPopulationByDepartmentReturn hasValue ?pop,
          adm#code hasValue ?code] memberOf
          INSEEgetPopulationByDepartmentResponse)
  do
    add(_#[sso#Operand1 hasValue ?const1,sso#Operand2 hasValue ?pop] memberOf
        sso#MultiplyRequest)
    add(_#[adm#code hasValue ?code] memberOf sso#MultiplyResponse)

  endForall

  /* Add allowedproduction for all quarries of the department */

  forall {?collection, ?code} with (
          ?collection[adm#code hasValue ?code] memberOf wfs#FeatureCollection)
  do

      add(_#[soa#gml hasValue ?collection,
             soa#feature hasValue "exploitationsponctualsproduction",
             soa#property hasValue "AllowedProduction"] memberOf
             soa#AggregateRequest)
      add(_#[adm#code hasValue ?code] memberOf soa#AggregateResponse)

  endForall

  delete(?controlstate[oasm#value hasValue oasm#State3])
  add(?controlstate[oasm#value hasValue virtual#State4])

endForall


9.5    Execution of Geospatial Service Compositions

The idea of doing service composition on an abstract visual level is to lower the complexity for a developer.
To do this the Composition Studio creates high-level representations of all the different artefacts used in the
composition. This is performed by transforming an artefact, for instance a web service, into a UML
representation of the artefact. The UML representation contains enough information to create the service
composition along with a reference to where the original artefact is located.

The use cases in SWING [SWING D1.1] all rely upon Web Feature Services. There where a few
functionalities that had to be added to the Composition Studio to be able to create compositions with Web
Feature Services. Introducing Web Feature Services into the Composition Studio means to create a UML
representation of the Web Feature Service. A regular web service typically has a set of operations that can be
called by a client. Web Feature Services do not have the same operation concept. Web Feature Services can
almost be seen as a database with a web service interface, it is therefore necessary to create some kind of
query or filter expression to call a Web Feature Service. Because Web Feature Services are in many ways
different from regular web services it is necessary to accommodate for specification of some kind of filter
expression or query.

9.5.1   Import of Web Feature Service

The Composition Studio has a local repository for resources transformed into UML representations, called
Local Dictionary. With use of the import of service description menu it is possible to import a Web Feature
Service. Figure 19 shows how to access this menu.




                                     Figure 19 Import of Service Description.

After activating the import service description menu item a dialog window will appear, as seen in Figure 20.
To import a WFS enter the base URL to the service. This will be used to parse the WFS description and
create a UML representation in the Local Dictionary of the WFS. The local name is used for better naming
of the UML resource.




                                    Figure 20 Enter an URL and a local name.

The transformation from WFS to UML is quite simple. Given the URL to the WFS the WFS document is
parsed and transformed into a UML package containing a UML class with a WFS stereotype. Then each of
the features of a WFS is transformed into a UML property contained in the UML class. Table 6 shows how a
Web Feature Service is represented in UML.


                        WFS                                                       UML
Web Feature Service                                       Class with a WFS stereotype
Feature                                                   Property with a Feature stereotype
                          Table 6 An overview of the mapping between WFS and UML.

9.5.2    Web Feature Service filter expressions

To be able to access the data inside the Web Feature Service it is necessary to specify some kind of filter
expression. The easiest way of specifying a filter is to select one feature and get all the information contained
in a feature. But it is possible to add additional constraints to reduce the data set; these constraints can for
instance be spatial operators confiding the geographical area.
                   Figure 21 WFS configuration inside a task with WFS as its service operation.

Composition Studio adds support for selecting one feature type and one spatial operator to create a filter
expression. This can be seen in Figure 21. The filter expression is of course not very complex, but it is
sufficient to create some easy expression that work.

Among all the geospatial services presented earlier, a distinction must be made, in terms of how they are
involved in the SWING solution, and how they interact with the WSMX platform.

Most of these services, namely WMS, WFS, WCS, WPS, can be considered as instances of web services that
are available to the WSMX platform to perform its tasks. From that point of view, the fact that those services
are geospatially enabled is not relevant; they are part of the whole set of web services that may be available
to the WSMX platform.

The catalogue service, however, has a special status in the sense that it constitutes a building block that
should interact very closely with the WSMX platform. It is not a resource available to WSMX to be part of
the service composition process, but it is the repository that will hold the metadata necessary for WSMX to
discover the services required for the composition process.

Indeed, the WSMX architecture relies on a repository that holds metadata about the various resources
(services, data, ontologies) needed for the WSMX platform to perform its tasks. This repository is queried,
according to the goals submitted to the WSMX platform, to find the appropriate resources that will be
composed to achieve the goals. Because of the specificities of geospatial services cataloguing, using the
OGC Catalogue as the metadata repository of the WSMX platform will enable WSMX to perform discover
queries with specific OGC filters, and make a better use of the metadata available in OGC services.

Finding the right interface between WSMX and the Catalogue is a complex task, and a final answer to this
question is subject to the last deliverable (D5.3: SWS Support – integration with the WSMX platform).
However, since many of the SWING processes rely on this WSMX-Catalogue interaction, an iterative
approach has been chosen, where a first implementation, with partial integration, will be delivered as part of
D5.1. This will enable other work packages to immediately interact with the Catalogue, while continuing
research towards a high integration between WSMX and the Catalogue.
9.6     The SWING execution life cycle




From the point of view of the catalogue, a typical SWING use case can be seen as a sequence of three main
phases: publication of services metadata, discovery of those services, and invocation of a composition that
involves those services.

9.6.1    Publication

During the publication phase, the annotation component (WP4) is responsible for creating annotated
metadata that will be stored in the catalogue. This is done by calling the catalogue operation responsible for
registering services, as defined in the OGC specification (namely, the Harvest or Transaction
operations).

It has been agreed that the annotated metadata coming from the annotation engine will be encoded in a
WSML document. The current Ionic Catalogue supports the storage of ebXML metadata, and can easily
store attached files. A first and easy to deploy solution is thus to keep the current mechanism of service
registration, and store the WSML annotations as a whole file attachment in the catalogue. As explained in the
next paragraph, this solves the main issue, i.e. providing the WSMX platform with the service WSML
descriptions to perform the reasoning.

However, it would be better to find a better integration of those annotations in the catalogue. This implies
parsing the WSML descriptions by the catalogue at insertion time, and defining an internal data model for
those WSML annotations that fits within the ebRIM datamodel of the catalogue, so as to have a finer grained
representation of metadata in the catalogue, and be able to better exploit them. Such a solution will be
studied in the deliverable D5.2 .

Short term : store WSML descriptions as is in the catalogue.

Long term : WSML parsing and fine-grained storing of the semantic metadata in the catalogue.
9.6.2    Discovery

The discovery phase happens when some component (the MiMS environment from WP1, or the
Development Environment from WP6) needs to query the catalogue for a set of services that match certain
criteria. In this case, the catalogue is accessed through the OGC interface, using the GetRecord operation.

Again, as for publication, the SWING use case implies using the catalogue interface with added information,
in this case some semantic criteria in the query. The short term solution to this is to slightly extend the
catalogue query language, to be able to attach a WSML query to the OGC query. The catalogue
implementation needs to be also modified to be able to extract the WSML query, and forward it to the
underlying WSMX platform, while the catalogue uses the rest of the query to perform its standard search.

Again, albeit this short term solution is a viable way to obtain results in the SWING project, it would be
much better practice to define a query language that integrates coherently the needs of the current catalogue
query language, and the expressiveness needed to issue full semantic queries. This will be the subject of the
deliverable D5.2, for which studies have started already, as will be described later in this document.

Short term : WSML query is attached to the OGC catalogue query.

Long term : OGC query language is extended to be able to express WSMX queries, while keeping a
language syntax that fits within the OGC specification.

9.6.3    Invocation

Usually the discovery phase is meant to find the services needed to create a composition in the Development
environment and store that composition in some dedicated repository.

Then comes the last phase involving OGC interfaces, i.e. when those compositions are invoked by the
mapping client that is contained in the MiMS environment. Since the mapping client is a regular OGC client,
it doesn’t know about the specificities of a WSMX composition, and knows only how to invoke OGC
services. Therefore, the WSMX composition must be hidden behind an OGC-compliant façade (the so-called
OGC adapter), that will achieve the mediation
between OGC messages and WSMX calls.

Technically, this phase is not directly linked to
the catalogue service (there is no interaction with
the catalogue); it is rather related to the fact that
resulting WSMX compositions must expose an
OGC interface, and therefore belongs to WP5,
which is responsible for dealing with OGC
interfaces specificities.

Short term : only the WFS adapter is developed.

Long term : explore the feasibility of other
adapters.

9.7     Metadata model

The metadata are stored in the current catalogue
in a data model that is similar to ISO19115.

As was stated above, the ebRIM model is flexible enough to be able to represent any kind of data structure.
At this point in time, WSML annotations will be stored in the catalogue as a plain document, attached to the
corresponding service instance metadata, but we may later feel the need to store service metadata using some
other data model, f.i. a data model that fits the expressivity needs of WSML, which shall be used to model
the annotations produced by the service annotation engine.
10 SWING Demonstration, Walkthrough the SWING Use Case - Arne J
   Berre, SINTEF, Marc Urvois, BRGM

In this chapter, we will focus on the use case demonstration here. After presenting the end-user context, we
will present the MiMS application that allow the end-user to operate. Going into the use case, we will
explain the detailed proceeding. At the end, a feedback is given from the end-user point of view.

10.1 The end-user context

From Description of Work

major end-user groups

1/ Geospatial decision makers, citizens and providers => BRGM

        2/ Service developers for geospatial applications => SINTEF

        3/ Service developers for semantic services => SINTEF


10.2 The SWING end user application presentation

From DOW and D1.1 presentation => BRGM

10.3 The use case/show case

From D1.1 Use Case Definition => BRGM

10.4 Experience feedback

From D1.3 (experience report). => BRGM/SINTEF
11 Review and Outlook - Sven Schade, UOM
Summarizes the assumptions and achievements and gives a critical view on the current state of play. The
main remaining areas for research (including semantic reference systems) are identified.

								
To top