Learning Center
Plans & pricing Sign in
Sign Out

Conceptual modelling for domain specific document description and

VIEWS: 33 PAGES: 271

									Conceptual modelling for domain specific
document description and retrieval
- An approach to semantic document modelling

                                     Terje Brasethvik
                                          IDI, NTNU
Organisations and individuals today are exposed to vast numbers of documents in their
daily work. Modern retrieval techniques, knowledge management systems, the
Semantic Web initiative and several related efforts all strive to improve information
sharing and to arrive at languages, methods and tools for semantic document retrieval.
Approaches from conceptual modeling have not been widely applied in these efforts.
This is somewhat disappointing, as these techniques possess properties and
mechanisms that appear intuitively useful in approaches to information sharing.
We propose a document retrieval system that brings together conceptual modelling,
controlled vocabularies and linguistic techniques in order to provide an approach to
semantic description and retrieval of documents. Particular to this system is the way
conceptual modelling and linguistic techniques are combined to create conceptual
models that serve as controlled vocabularies and are applied in the tasks of describing
and retrieving documents, within cooperative document management settings, where
the users themselves have to perform the document management tasks they need in
order to support their own activities. This is in contrast with traditional retrieval
systems, where users normally only take on the role as searchers. The cooperative
effort puts an extra demand on the users; in our approach, they both have to
participate in the definition of the domain vocabulary as well as in the classification of
documents. A fundamental assumption behind our approach is that this extra effort
can be demanded of users, within a limited controllable domain.
In our approach, a semantic modelling language is used to construct a domain model
of the subject domain referred to by the document collection. A natural language
analysis process is specified in order to support modelling over a domain specific
document collection. The constructed domain model is then actively used for the tasks
of document classification and search. Users may browse the domain model and
interactively classify documents by selecting model fragments that describe the
contents of the documents. Natural language tools are used to analyse the domain
document collection and propose relevant model fragments in terms of selected
domain model concepts and named relations. The proposed fragments are refined by
the users and stored as document descriptions in the system. For document retrieval,
lexical analyses is used to pre-process natural language search expressions and map
these to the domain model for model-based query-execution and subsequent
The approach is evaluated through a comparative perceived relevance evaluation,
where a set of test users is provided with a domain model and a set of query topics.
The domain model and query topics are defined from a real case example from the
domain of public Norwegian health services, as provided by KITH. Users were asked to
formulate both regular text-only queries and model-based queries for each query topic,
and then to rank the top 10 returned documents for relevance. Two publicly available
search engines were used for comparison. Query scores were calculated for each
search strategy (model-based and regular) for both search engines. The result show a
tendency that the model-based retrieval performs better for queries generic in nature
and in general produces more specific results than regular text-only search. However,
the performance of the approach is dependent on the quality of the domain model and
domain model lexicon.
To Marte, Oskar and Jesper
This thesis is submitted to the Norwegian University of Science and
Technology (NTNU) in partial fulfilment of the requirements for the
degree doktor ingeniør. The work has been carried out at the Information
Systems Group (IS-gruppen), within the department of Computer and
Information Sciences (IDI), under the supervision of professors Arne
Sølvberg and Jon Atle Gulla. Parts of the work were done while on a
shorter research stay at the Centre de Recherce en Informatique (CRI),
Université de Paris 1 - Sorbonne, with professor Collette Rolland.
I thank my supervisors for their patience with an atypical Dr.Ing student,
for introducing me to this varied and interesting topic, for support,
comments and discussions, and for opening doors. I would not have
been here without them.
In addition to the supervisors, several others have read, commented and
offered insightful discussions regarding this work. In particular, I would
like to thank Sari Hakkarainen and Csaba Veres for their efforts at a
crucial stage.
Thanks to Collette Rolland, Georges Grosz, Camille Salinesi and the rest
of the members of CRI for hosting my visit to Paris and providing the
environment I needed to produce my first publications and for
introducing me to the Parisian style of work and life.
The prototype realisation described in this thesis would not have
existed, if it weren’t for the priceless help of colleagues and students.
Thanks to Audun Steinholm, Harald Kaada, Thomas Gundersen and
Hallvard Trætteberg. Special thanks to Arne Dag Fidjestøl for his
continuous contributions to the implementation and around the clock
problem solving.
Thanks to Iver Nordhuus and Bjørn Buan at KITH for their interest and
for providing the needed material to perform the experiments described
in this thesis.
Thanks to Torbjørn Nordgård and Arild Faxvåg for their interest in my
work, for introducing me to the worlds of computer linguistics and
health informatics respectively and for welcoming me in the Digimed
Thanks to all fellow members, past and present, of the IS-group for
providing the needed environment and atmosphere for this work.
         Dum Dum Boys
         “Nå er nå”
                                                          Table of Contents

1. Introduction                                                          1
 1.1.   Background                                                        1
 1.2.   Objective                                                         4
 1.3.   Approach                                                          5
 1.4.   Major contributions                                               6
 1.5.   Outline of the thesis                                             8

2. Theoretical background and State-of-the-art                           9
 2.1.   Sharing of documents                                             10
   2.1.1.    Common information spaces                                   13
 2.2.   Semantic document descriptions                                   14
   2.2.1.    The language problem in document retrieval                  14
   2.2.2.    Defining the domain – constructing the vocabulary           17
   2.2.3.    Applying the vocabulary                                     19
 2.3.   Principles for an approach                                       19
   2.3.1.    Support communication of meaning                            19
 2.4.   Research issues in document retrieval                            21
 2.5.   Information Retrieval techniques                                 22
   2.5.1.    Information Retrieval models                                23
   2.5.2.    The query process                                           26
   2.5.3.    Query formulation (notation support)                        28
   2.5.4.    Result presentation and query refinement                    29
 2.6.   Natural Language Processing and IR                               30
 2.7.   The Semantic Web                                                 35
   2.7.1.    Resource Description Framework (RDF) and Schema (RDF-S)     36
   2.7.2.    Meta data statement construction                            38
   2.7.3.    Examples of RDF applications and tools                      38
   2.7.4.    RDF querying                                                41
 2.8.   Ontologies                                                       44
   2.8.1.    Ontology languages                                          47
   2.8.2.    Ontology tools                                              62
   2.8.3.    Ontology based document retrieval                           66
   2.8.4.    Summing up                                                  67
 2.9.    Knowledge Management Systems                                    68
 2.10.   Summing up                                                      77

3. Semantic modelling of documents – the approach                       79
 3.1.    The Cooperative Semantic Document Retrieval Model               79
 3.2.    A conceptual modelling based approach                           82
   3.2.1.     The use of conceptual modeling                             83
   3.2.2.     The role of automation                                     83
   3.2.3.     Textually grounded model concepts                          84
 3.3.    Overall specification of the approach                           84
   3.3.1.     Constructing Domain Models from Document Collections       85
   3.3.2.     Classifying Documents with Conceptual Model Fragments      86
   3.3.3.     Retrieving Documents with NL Queries and Model Browsing    88
 3.4.    Possible modes of application                                   90
 3.5.    Overview of approach chapters                                   92

4. The Referent Model Language                                          93
 4.1.    The referent model language - introduction                      93
   4.1.1.     RML foundations                                            93
   4.1.2.     Class concepts, individual concepts and attributes         95
   4.1.3.     Attributes                                                 96
   4.1.4.     Relation concepts                                          96
   4.1.5.     Abstraction mechanisms – concept classification            97
   4.1.6.     Composition of concepts and relations                      98
 4.2.    The RML - Meta Model                                           101
   4.2.1.     Construct definition                                      102
 4.3.    RML Models and model fragments                                 104
   4.3.1.     A model and a model fragment                              104
   4.3.2.     Defining model views                                      104
 4.4.    Summing up                                                     105

5. RML based document classification and retrieval                      107
 5.1.    Referential semantics of model fragments                       107
 5.2.    Referential semantics of RML constructs                        110
   5.2.1.     Class concept                                             110
   5.2.2.     Generalisation operations                                 111
   5.2.3.     Binary Relations                                          111
   5.2.4.     Relation constraints                                      113
   5.2.5.     N-ary relations                                           114
   5.2.6.     Composed relations                                        114
 5.3.   Classification as instantiated model fragments       115
 5.4.   RML as query language                                120
   5.4.1.    Query formulation                               121
   5.4.2.    Model based query transformations               122
   5.4.3.    Result presentation and query refinement        123
 5.5.   Mapping from RML fragments to RDF statements         127
   5.5.1.    An RDF meta model                               128
   5.5.2.    Using RDF in the RML retrieval system           129
   5.5.3.    Constructing a mapping                          129
   5.5.4.    First trial: Direct mapping                     131
   5.5.5.    Second trial: Constructing RML classes in RDF   136
   5.5.6.    The example revisited                           140
   5.5.7.    Applying the mappings                           141
 5.6.   Summing up                                           143

6. Linguistic analysis for semantic document modelling       145
 6.1.   Domain model lexicon                                 146
 6.2.   The document analysis process                        149
   6.2.1.    Process results                                 150
   6.2.2.    Extraction                                      151
   6.2.3.    Language detection                              154
   6.2.4.    Text cleaning                                   155
   6.2.5.    Part of speech tagging                          157
   6.2.6.    Lemmatiser                                      160
   6.2.7.    Reduction (Filter stop words & word-classes)    161
   6.2.8.    Phrase detection                                163
   6.2.9.    Frequency analysis (Counting and Selection)     164
   6.2.10.   Manual interactive analysis                     165
 6.3.   Text to model matching                               169
 6.4.   Evaluating the analysis                              171

7. The Prototype Realisation                                 175
 7.1.   Components in the realisation                        175
 7.2.   The modelling environment                            176
 7.3.   Architecture                                         177
 7.4.   The CnS client                                       179
   7.4.1.    The example from Company N                      179
   7.4.2.    Classification                                  180
   7.4.3.    Retrieval                                       183
 7.5.   Linguistic Workbench                                    184
   7.5.1.     Controller and component design                   186
   7.5.2.     Result query interface                            188
   7.5.3.     Interface for defining the domain model lexicon   190
 7.6.   Summing up                                              191

8. Evaluating the approach                                      193
 8.1.   Aspects of retrieval evaluation                         194
 8.2.   Evaluation approach                                     195
   8.2.1.     Scope                                             196
   8.2.2.     Limitations                                       197
 8.3.   The Retrieval trial – set up and preparations           197
   8.3.1.     Comparative search-engines                        197
   8.3.2.     CnS Client set-up, modifications                  198
   8.3.3.     The domain model                                  200
   8.3.4.     User selection and preparation                    201
   8.3.5.     The Query set                                     203
   8.3.6.     Scoring and calculations                          207
 8.4.   Results                                                 208
   8.4.1.     Overall results                                   208
   8.4.2.     Uncertainty factors                               212
   8.4.3.     Results in detail                                 218
   8.4.4.     User feedback                                     224
 8.5.   Summing up                                              225

9. Concluding remarks                                           229

Appendix A.        Detecting relation names                     249
Appendix B.        Evaluation: Query topic descriptions         253
Appendix C.        Evaluation: Scoring sheet                    257
                                                       1. Introduction

Project groups, communities and organisations today use the Web to
distribute and exchange information. In their daily work, they create and
manage huge amounts of documents. Web technology - or other kinds of
intranet technologies - have enabled quick and easy publishing and
distribution of these documents. While the Web makes it easy to publish
documents, it is much more difficult to find an efficient way to organise,
describe, classify and present documents for the benefit of later retrieval
and use.

1.1.   Background
A document under production is guarded by its authors. It’s meaning
and purpose is closely related to the process it is a part of. It is often
subject to “local” distribution, for example shared via e-mail among its
authors. When a document is published onto the company’s intranet
however, the document is usually removed from its local context and its
semantics may not be evident for a user searching for information.
Document descriptions are needed to capture and communicate the
meaning of documents and to enable efficient retrieval. Unfortunately,
basic web or intranet technologies do not provide any mechanisms for
describing and organising documents. Furthermore, this task is often
left as an exercise for the document author himself. In intranet settings,
there are no librarians or library tools available. Hence, the shared
fileservers and intranets of many organisations are hosting a mass of
poorly organised and described documents, not easily retrieved but
easily forgotten.
The Semantic Web initiative works towards inferring semantic
descriptions of web resources, in order to enable semantic applications
and services. Work is in progress for creating standards for meta-data
representations and corresponding semantic specifications based on
constructed and shared ontologies. The focus is to arrive at machine-
readable descriptions that enable integration of data and offer formal
semantics and reasoning support for intelligent applications.
Within traditional IR systems, there is no explicit notion of a document
description; the description of a document is performed through
automatic text analysis and computation of an index.

Semantic modelling of documents                                          1
                                      Figure 1.1

                Document descriptive meta-data enables sharing of documents

Retrieval is performed by calculating a similarity measure between the
query and document terms. In a general sense, such methods perform
well, but shortcomings exist. Consider for example a search for the
definition of computer supported cooperative work (“CSCW”). Some
possible outcomes of such a search are:

   No documents contain the term “CSCW”.

   More likely, in case of the Web, several thousand documents
    contain the term “CSCW” and many different definitions are

   Many documents contain the term “CSCW”, but no explicit
    definition is found, and the term is used in several different ways.
Several issues hamper the task of information retrieval, particularly the
immense variation inherent in natural language and the vast differences
in users background and motivation behind a search.
Within a company or a community, sharing of information is a
prerequisite for cooperation. In such settings, a collaborative effort can
be applied in order to communicate semantics of information between
interested participants. As with the Semantic Web, the motivation
behind explicit document descriptions is to enable users to explicitly
define the semantics of their information for the benefit of retrieval and

2                                                                             Introduction
More specifically:

   Semantic document descriptions are intended to communicate
    the meaning of information between users. To some degree, this
    requires users to arrive at some kind of shared interpretation of
    the domain referred to by the information object. Within a limited
    domain, it is possible to engage users in a cooperative activity to
    explicitly define the semantics of the domain.

   The user publishing the document must in an efficient manner be
    able to express this interpretation of the document in a document
    description. The description must be able to capture and transfer
    the interpretation of the document with respect to the shared
    semantics of the domain.

   The shared semantics of the domain must be equally suitable for
    expressing the information need of the retriever. It should allow
    both for quick and easy browsing as well as for careful and
    precise query formulation and subsequent query refinement.

   To be practically useful, the document descriptions must be
    suitable for formalised treatment and reasoning within some kind
    of document management system or tool. In the settings
    described above, where there are no librarians available, this
    should be a tool targeted at the users.
The important aspect with semantic document descriptions is to be able
to represent document semantics and to be able to carry these
semantics between stakeholders, from publisher to receiver. This puts a
strong requirement on the language in which we represent and visualise
these document descriptions. In cooperative settings, the meta-data
descriptions are provided by the users themselves for the benefit of
other users.
Conceptual models are widely used in information systems
development. The models serve a dual role, both as vehicles to facilitate
communication between users, as well as a (semi-) formal
representation of system requirements that can be subject to formal
reasoning and serve as direct input for system implementation. An
emerging trend, is that these models are no longer only used for the
design of a system, but are also to an increasing extent used to actually
access and manage the information resident in the final information
Concept or semantic modelling languages are a particular kind of
conceptual modelling language intended to represent the semantics of
information by defining a conceptualisation of the domain that the
information refers to. These are visual languages often with a formal
basis stemming from set theory. Currently conceptual models are

Semantic modelling of documents                                           3
successfully applied in information management tasks, in particular for
managing the structural data found in most information systems using
database technology. So far, they have not been successfully applied to
the kind of unstructured data found in web-documents, even though the
properties listed here, indicates that they are intuitively suitable for the

1.2.   Objective
The research goal behind this thesis is to explore how semantic
modelling languages can be used in an approach to creating semantic
document descriptions. More specifically, the objectives are:

   To explore and understand the requirements for document
    management in such intranet settings as those outlined above.

   To investigate how a given semantic modelling language can be
    used to represent the semantics of documents.
and then:

   To investigate if the semantic modelling language can be used
    directly in a tool that will assist the users in their semantic
    classification and retrieval of documents.

                                      Figure 1.2

               Semantic modelling as basis for creating document descriptions

4                                                                               Introduction
1.3. Approach
The main investigation is directed to better understand the
requirements for situated semantic document descriptions. Our starting
point is previous research at the Information Systems group on
languages, tools and methods for conceptual modelling. We will apply
our knowledge about semantic modelling languages and explore the
idea that these languages intuitively exibit qualities that make them well
suited for such settings. The main deliverable is a proposal of an
approach to document description and retrieval that is based on
semantic modelling (Figure 1.2) as well as an evaluation of this
We will have to answer the following basic questions:

   1. What are the requirements for semantic document descriptions?
   First, we need to understand and define the concept of semantic
   document descriptions. For this, we will follow a theoretical
   approach. We will conduct a survey of existing and alternative
   approaches to the task and explore some of the theoretical
   discussions from related areas. We are interested in the general
   principles behind such situations and the overall requirements that a
   general approach to the problem has to fulfil.

   2. Can we design an approach that exploits the capabilities of semantic
      modelling languages?
   Based on the above, we will then design an approach to semantic
   document descriptions. In defining the approach, we will exploit our
   knowledge about semantic modelling, as well as semantic modelling
   languages, tools and techniques. In addition to using requirements,
   we will define the approach by applying it to real-world examples.
   The approach will be defined in detail by using system models and
   by specifying a system architecture to support it and realising a
   prototype implementation.

   3. Is the approach suitable for actual use?
   We will evaluate the approach against a real-world example. We are
   not capable of implementing a full-scale system and empirically
   examining its performance within an organisation, but will perform a
   verification experiment where we apply our approach with a domain
   model and documents from a real case, and compare our approach
   to existing systems of the domain.

Semantic modelling of documents                                          5
1.4.       Major contributions
The main achievement in this work is the development and specification
of an approach to semantic document classification and retrieval. In
more detail the main contributions of this thesis are:

   An approach to using a semantic modelling language as the basis
    for creating document descriptions. The semantic modelling
    language is used to create a model of the domain referred to by
    the documents. This domain model is then used directly in the
    interface of a tool for document classification and retrieval.
    Documents are classified by way of - possibly instantiated -
    fragments of the domain model. Regular text-only queries are
    normalised against the domain model before being executed on
    the stored document classifications, and queries may be refined
    by interacting directly with the domain model.

   As a necessary and integral part of the approach, we formalise the
    use of a semantic modelling language and its properties for the
    tasks of document description and retrieval.

   In order to enhance our approach and to facilitate the users
    interaction with the system, we specify a way of supporting our
    approach by using lexical analysis in several ways:

       -    The created domain model is supplied with information
            from a lexical dictionary in order to support automatic
            matching of the document text against the model.
            Based on this automatic matching, the system presents
            the user with an initial suggestion for the classification
            of the document at hand.

       -    When computing the suggestion for a document
            classification, the system selects sentences from the
            document that contain concepts found in the domain
            model. We propose an algorithm for suggesting names
            for relations in the domain model based on a linguistic
            analysis of these sentences.

   A product of our approach is the possibility to use the domain
    model for visualizing the semantic content of a document or a set
    of document. This is naturally used in classification and retrieval
    of documents, but we also exploit this in what we denote
    enhanced document reader, where a user is presented with
    semantic information from the model, while reading the document
    in a standard web-browser.

6                                                                   Introduction
   We propose an architecture for a system to support our approach.
    Parts of the system are realized in a prototype.

   We present the results from evaluation experiment that compares
    retrieval relevance measures achieved by our approach against
    two existing retrieval system. The experiment is carried out on a
    real-world domain model and document collection.
Document management is a complex task involving several important
components. The work presented in this thesis is focused on semantic
document classification and retrieval. This implies that we have chosen
not to focus on several aspects of document management that are of
equal importance and which must be considered in a holistic approach.
In particular, our work is subject to the following limitations and

   Our work is based on manual classification of documents
    performed by a user. However, our search strategy assumes the
    existence of proper retrieval machinery that may be used to
    search the stored document descriptions and for ranking the
    found documents. The specification and development of such
    machinery is not a part of our work.

   We approach semantic classification within a limited domain. Our
    main assumption is that it is feasible to create a domain model
    that is somehow related to the documents from this domain. This
    implies that unlike traditional IR machinery, we are not aiming to
    handle the whole web in general, nor highly dynamic
    environments where both the domain and the corresponding set
    of documents are rapidly changing.

   Much work on document management is done in the areas of
    Office Information Systems or Workflow Management Systems.
    Such work relies on contextual meta-data about the document for
    the purposes of routing documents according to the workflow or
    for supporting document production. We recognise the strong
    need for contextual meta-data attributes for such and other
    document management tasks but contextual meta-data is not a
    part of the work presented here.

   Several issues from Digital Library research are not covered by
    this thesis, such as preservation of digital content, protection of
    copyrights and legal issues, standard document formats,
    persistent identification of online documents, etc. Again, such
    issues are considered beyond the scope of this thesis.

Semantic modelling of documents                                           7
1.5. Outline of the thesis
This thesis is divided into 4 parts:

   Part I: Introduction & state of the art

     -   Chapter 1: Introduction

     -   Chapter 2: Theoretical background & state-of-the-art

   Part II: The approach

     -   Chapter 3: Semantic modelling of documents; the
         overall approach

     -   Chapter 4: The Referent Model Language

     -   Chapter 5: Referential semantics of the Referent Model
         language for document description and retrieval.

     -   Chapter 6: Lexical analysis for semantic document

   Part III: Evaluation of the approach

     -   Chapter 7: The prototype realisation

     -   Chapter 8: Evaluation

   Part IV: Extensions and conclusions

     -   Chapter 9: Conclusions and further work

8                                                               Introduction
               2. Theoretical background and State-of-the-art

                  “The effectiveness of a system for accessing information is a direct function
                                                     of the intelligence put into organizing it.”
                                                                             (Svenonius, 2000)

The main goal of our approach is to support sharing of documents
within a cooperative setting. Our idea is to apply a conceptual modelling
language in a tool where users themselves may define the semantics of
the domain in a domain model and use this model directly in a
cooperative effort to describe and retrieve documents. The first part of
this chapter describes the theoretical background that motivates such
an approach (Sections 2.1-2.3).
Document retrieval has been in the focus of research communities for
more than 50 years. Much work has been done in the early years on
statistically based information retrieval; this work has been revitalised
and gained a renewed focus following the arrival of the World Wide Web.
Most commercial IR systems build on the principles from these early
approaches. For smaller and more limited settings than the whole web,
advanced IR techniques have entered into the area of corporate
knowledge management systems. Still, users are not presented with
passable tools to actively participate in describing the semantics of
information for the benefit of information sharing in cooperative
settings. With statistical IR, indexing and classification takes place
automatically behind the scenes and users are left with a relatively
simple key-word interface to retrieval. Methods and techniques from
library science, such as thesauri and controlled vocabularies enable
such active participation in classification and retrieval, but in the general
case, these tools have been targeted towards skilled librarians and not
The Semantic Web initiative is targeted towards inferring semantic
descriptions of data on the web in order to enable intelligent
applications, including information retrieval. Work is in progress for
creating standards for meta data representation and corresponding
semantic specifications based on constructed and shared ontologies.
The general idea is similar to ours, i.e. to enable users to describe the
semantics of their information themselves, but the realization of the
semantic web is focused towards machine-readable semantic
descriptions that enable intelligent applications and services. Sections
2.4 – 2.9 of this chapter describes a wide range of related work, from
traditional information retrieval approaches, knowledge management
systems and emerging semantic web technology.

Semantic modelling of documents                                                                9
2.1.    Sharing of documents
An organisation today is exposed to a vast amount of documents. In
general, documents are produced or collected by users during everyday
work situations and is often closely related to “local tasks” such as the
daily activities within project work, as part of ongoing studies or
processes, or even as a result of the interests of workers. During these
everyday work situations, the documents are organised and guarded by
its authors or users – a situation we may call “local management” of
documents. The need for putting an effort into organising these
documents arises when the amount of documents becomes large
enough to cause a problem even locally, or the moment these
documents are to be shared with actors outside the local management
To support sharing of documents across the organisation, the
documents will have to undergo some kind of publishing process where
they are removed from their local context and made available to the
outside audience. (Harmze and Kirzc, 1998) – describes such a
publishing process for “the electronic age”:

    Modularization: Capture the information and convert it into a
     suitable storage format for presentation in the new setting.

    Description: Describe the information for the benefit of later
     retrieval and use. Document descriptions may contain a range of
     information about the document, depending on the situation and
     the functionality of the document management system.
     Document-descriptions are often referred to as metadata (Weibel,
     1995; W3C-Metadata, 1997) – we will define metadata in more
     detail later.

    Reading-functions: Provide advanced reading-functions such as
     searching, browsing, automatic routing of documents, notification
     or awareness of document changes or the ability to provide
     comments or annotate documents. These reading functions gives
     added value for the reader and are made available through the
     basic web technology.
Emphasised in this definition is that sharing of documents is not just a
matter of making documents available. True sharing of documents
implies some effort to facilitate later use and reuse of the document –
outside its original context. As we can see from the definition above,
such a publishing effort includes several steps. In many ways, this is
similar to a “library process”. However, in today’s intranet and web
settings, there are no librarians or library tools available. Both the
publishing process, and later also the retrieval process, have to be
performed by the users themselves. In most cases, users are presented

10                                       Theoretical background and State-of-the-art
with tools that address only the first of these steps, i.e. the capture and
storage of a document onto the web.
A document description contains metadata. Metadata (Arms, 1995 ;
Aberer & Read, 1998 ; Weibel, 1995) , literally data about data, is
defined in digital library work as:
         (…) metadata inlucde data associated with either an information
           system or an information object for purposes of description,
           administration, legal requirements, technical functionality, use
           and usage, and preservation.”
                                                        (Baca, 1998, p.37)
Metadata is a widely used term, and in general, metadata may refer to
any property about the document being described. Which properties of
the document it is necessary to describe, is dependent on the setting
and the desired functionality. In the fusion of digital libraries and Web
settings, the most common standardisation initiative is the Dublin Core
proposal (Weibel et. al, 1995)1 for a “core set of elements” (Borgmann,
2000) proposing 15 basic elements of a description. It is interesting to
note that even if the Dublin Core proposal specifies elements such as
“subject and keywords”, “description”, “author or creator”, the proposal
and accompanying guidelines still argues that any users or applications
of these have to define a “local scheme” that define the semantics of
these elements and how they are to be used.
On the Web in general, the “Semantic Web” (Berners-Lee et. al., 2001 ;
W3C-SW, 2003) and the Resource Description Framework (RDF) (W3C-
RDF, 1999) are the most potent initiatives at the moment. We will
describe these in Section 2.7.
In our work, we find it useful to divide a definition of document
descriptive metadata into “Contextual” and “Semantic” metadata:

     Contextual metadata: The context of an item is information
      related to but not necessarily contained in the item. In our
      definition, contextual meta data describes any contextual property
      of the document like its author, title, modified date, location,
      owner, etc. and should in general strive to link the document to
      any related aspect such as projects, people, group, tasks etc.

     Semantic metadata: Information describing the subject or the
      “intellectual” content of the document. This can also be a
      multitude of information such as selected keywords, written
      abstracts and text-indices. In cooperative settings or in
      organizational memory approaches, also free-text descriptions,
      collectively created annotations or attached communications and
      discussions may be used.


Semantic modelling of documents                                               11
Our approach to semantic document retrieval is focused towards the
semantic metadata. Describing document semantics is part of what are
denoted the “intellectual access” functions of a document management
system. Still – the “aboutness” of a document is a subjective property
and intellectual access to documents often involves some degree of
human judgement (Borgmann, 2000).
The document publishing process we have described is part of a
cooperative effort. In essence, sharing of documents is about
cooperation. In fact, sharing of information is viewed as a necessary
prerequisite for cooperative activities (Grudin, 1994). Based on this, we
define the settings we are looking into as “Cooperative Document
Management” - a phrase specified in detail by (Nakata et. al, 1998 ; Voss
et. al 1999a; Voss et al. 1999b). Cooperative document management
denotes the settings where cooperating users themselves perform the
document management tasks they need in order to support their own
activities. In other words, the users are the performing agents of both
the publishing and retrieval tasks of document management. Document
management in general is an integral part of a company’s Organisational
Memory (Ackerman, 1994; van Heijst et. al, 1998) and Knowledge
Management (Spek and Spijkervet, 1997 ; Borghoff & Pareschi, 1998)
efforts (Sullivan, 2001). More specifically, we may see that the
publishing process defined above is a part of the knowledge acquisition
effort of a knowledge management scheme. Therefore, we extend the
audience of our cooperative document management setting to include
also users of organisational memory and knowledge management
Intranet systems are developed, functionally specified and designed
according to the specific information need of their users. They normally
support a varied menu of news-bulletins and information about ongoing
events, contain pointers to specific information in support for activities
and recurring tasks, as well as a number of information retrieval and
discovering facilities. An important aspect of intranets is that they in
most cases are tailored according to the information need of their users.
Intranets can be considered a kind of “organisational level groupware”
or knowledge management system. Within an intranet setting, as
opposed to the World Wide Web in general, we may assume to find:

    Slightly less heterogeneity in documents. Even if there will be a
     huge number of documents on a company intranet, it will not be
     quite as varied in quality and content as the web itself.

    Domain-limited documents: In an intranet setting, assuming more
     control over the document collection, it will be possible to identify
     the documents limited to the specific domain of interest.

    More motivated users. While the average search effort from a user
     on the web in general is limited (the average query is about 4

12                                         Theoretical background and State-of-the-art
    words, few users bother to use the advanced search interface, and
    most users expect to find relevant information within the first 10
    or 20 displayed documents), users that operate within a work
    setting may be more motivated to make an effort in their search,
    in particular if the information they are looking for is a necessity
    for their work activities.

   A limited and known user-base. Within an intranet setting it is
    possible to construct systems with higher complexity, and infer
    these in the same manner as any other information systems. For
    example, users can be trained in order to cope with a more
    advanced search interface.

2.1.1.      Common information spaces
In literature from the CSCW field (Computer supported cooperative
work), sharing of information is considered as part of maintaining a
common information space:
         “Here, focus is on how people in a distributed setting can work
           cooperatively in a common information space - i.e. by
           maintaining a central archive of organizational information with
           some level of ‘shared’ agreement as to the meaning of this
           information (locally constructed), despite the marked differences
           concerning the origins and context of these information items.”
                                               (Schmidt and Bannon, 1992)
Following this definition, an approach to sharing of information has to
provide support at both the “system” and “human” levels: At the system
level, we have the “maintaining of a central archive of organisational
information”. At the human level, we need to provide means for arriving
at “some level of ‘shared’ agreement as to the meaning of this information”.
Such a shared agreement on the meaning of information is “locally
constructed”. Reaching a shared agreement in meaning is the result of a
process within the group. This is a process of interpretation, dialogue
and negotiation. Different actors will have their own interpretation with
respect to the meaning of information, which during the process must
be made explicit and communicated to the rest of the group.
Interpretations are negotiated and related to each other until the desired
level of shared agreement is reached (Bannon and Bødker, 1997).
In the maintaining of a common information space, the perceived
meaning of information manifests itself in the way information is
collected, organised, described and presented to the users. For a
document classification and description system to be applied in a
cooperative setting, this raises two fundamental requirements:

   The classification and description scheme must be tailorable to
    the domain and the usage situation in which it will be applied.

Semantic modelling of documents                                                13
    The classification and description scheme must enable the users
     themselves – or the user community – to construct (define,
     communicate and negotiate) the semantics of the classification

2.2. Semantic document descriptions

         More often than not people are searching (digital libraries) to learn
           “about” something rather than trying to obtain a specific, known
                                                             (Borgmann, 2000)
         “When I use a word”, Humpty Dumpty said in a rather scornful
           tone, “it means just what I choose it to mean – neither more nor
                                           Lewis Carroll, Alice in Wonderland
Semantics of information is a difficult topic to handle in computers.
Semantics of information is subject to human interpretation. Documents
contain symbols and words. What is needed to represent the meaning of
a document is something that “can bridge the gap from linguistic inputs
to the kind of non-linguistic knowledge” that is needed to perform tasks
that deal with meaning (Jurafsky and Martin, 2000). In library systems,
this is the task of subject representation languages (Svenonius, 2000).
While semantics rarely can be uniformly or universally represented, the
importantance in cooperative settings is to reach some level of shared
agreement that enables precice communication about information. A
main aspect here is the ability to make representations of meaning
explicit within the community, in order to discuss and reason about
various interpretations. While complete agreement on semantics, even
within a limited domain seems almost impossible, an explicit
representation at least enables some level of collective awareness of the
intended meaning. Again, in library systems the role of a thesaurus is to
define and agree on a set of terms and their relations that supports
clarification of a topic and assimilation of related information. In
information systems development, this is the role of conceptual
modelling languages. Our approach is based on a combination of the

2.2.1.      The language problem in document retrieval
The general principle behind a retrieval and classification system based
on document descriptions is shown in figure 2.1. The purpose of a
document description is to communicate the semantics of the document
in order to facilitate its retrieval. The document description is provided
by a user and is an explicit representation of his view of the semantics of
the document in question. As with any kind of knowledge representation
(Gruber, 1995), this semantic description is subject to the users
perceived conceptualisation of the subject domain, i.e. the domain

14                                              Theoretical background and State-of-the-art
referred to by the document. In the same way, a document query is an
explicit representation of the information need of the information hungry
user. Again, the query is expressed based on his interpretation of the
subject domain.
Both the document description and the query expression are
represented in language. The language problem (Furnas et. al, 1987) or
communication problem in a document retrieval system arises when the
descriptions and the queries do not match, i.e. when the descriptions
and the queries are expressed on the basis of different interpretations of
the subject domain – referring to different concepts, using different
words to denote the same concepts or using the same words to denote
different concepts – i.e. “they do not use the same language”. This
problem is illustrated with the question mark in figure 2.1.
Our approach to describing the semantic content of documents comes
from the use of controlled vocabularies and subject languages
(Svenonius, 2000) from library science.

       “[Controlled vocabularies] are constructs in an artificial language,
          their purpose is to map users’ vocabulary to a standardized
          vocabulary and to bring like information together. (…)
          Vocabulary control is the sine qua non of information
          organisation. (…) A natural language cannot be used to organize
          information effectively, because its synonymy and homonymy
          would cause scatter and clutter.”
                                                               (Svenonius, 2000, pp88-89)

                                               Figure 2.1

  The language problem in retrieval; Queries will only find what the user intends if queries are posted with
  the same language as the document description (or index). Even within limited domains, there is less
  than 20% chance that two different users will use the same terms to denote the same thing.

Semantic modelling of documents                                                                            15
Subject languages are “artificial languages”, designed for the special
purpose of describing what documents are about – and in order to
retrieve information about these subjects. The terms in a subject
language are controlled, i.e. they are carefully selected and defined in
order to distinguish and “remove” much of the ambiguity one finds in a
natural langue, for example with respect to polysemy and homonymic
clashes. Terms in a subject language are often defined in thesauri
(rather than in simple lexical dictionaries used for words in a natural
language) and are very often given a semantic definition in connection
with the document collection it describes. For example the subject term
“Rock stars” will in a document collection refer to all indexed documents
about rock stars, and not necessarily all rock stars of the real world.
With the use of controlled vocabularies, the act of describing a
document amounts to translating the users interpretation of the
document into a proper description in the subject language. Likewise,
when formulating a query, a user must find a representation for his
information need in the words of the subject language – i.e. a query
expression. However, even though the words in a controlled vocabulary
are selected on the basis of their semantic content, this semantic
interpretation is normally not visible in the classification and retrieval
tasks. The interpretation of the words is the vocabulary is still left as “an
exercise for the user”. Clearly, the subject language must be constructed
and presented in such a way that it facilitates these user tasks.
As mentioned, preferred terms for vocabulary control is often defined in
a thesaurus:
         the·sau·rus n
         1. a book that lists words related to each other in meaning, usually
            giving synonyms and antonyms
         2. a dictionary of words relating to a particular subject
         3. a place in which valuable things are stored2
A clarification between thesaurus and dictionary is given by:
         A dictionary is a listing of words and phrases giving information
           such as spelling, morphology and part of speech, definitions,
           usage, origin and equivalents in other languages.

         A thesaurus is a structure that manages the complexity of
           terminology and provides conceptual relationships, ideally
           through an embedded classification. A thesaurus may specify
           descriptors authorized for indexing and searching. These
           descriptors form a controlled vocabulary.
                                                                               (Soergel, 2003)

    The word Thesaurus itself derives from Greek and Latin words which means “a treasury”, a place in which
    stores of wealth are kept (Merriam-Webster, 2003).

16                                                           Theoretical background and State-of-the-art
For retrieval, a thesaurus may be used in indexing, searching, or both.
When the thesaurus is used for indexing, it serves as the main
mechanism for vocabulary control, i.e. it defines the preferred terms to
be used for a given topic. When it is used for searching only, it is used
simply to suggest query terms to the user, especially synonyms and
narrower terms and in general to help formulate the query:
      By providing the searcher with links and associations between
        terms that may be surprising, thus stimulating for further
        thought, the main purpose of the searching thesaurus is to
        increase the searcher’s perception and cognition and to
        elaborate and clarify the search formulation.
                                                   (Lykke Nielsen, 2002)
In library science, a thesaurus is considered a generic vehicle that
supports several functions in retrieval (Soergel, 2003):

   Provide guidance in the conceptual analysis of a search topic
   Browsing the classification structure in order to identify useful
    terms for search at the desired level of specificity
   Mapping from the users query terms to the authoritative or
    preferred terms used in indexing
   Support meaningful grouping and presentation of results

Furthermore, a well-developed thesaurus may have several additional
purposes in a knowledge organization system (Aitchison, Gilchrist &
Bawden, 2000) (Soergel, 1996):

   Aiding in the general understanding of a subject area
   Providing definitions of terms
   Acting as terminological standard in writing articles, instructions,
    standards, etc.
   Supporting content analysis and text data mining
   Supporting machine translation

2.2.2.    Defining the domain – constructing the vocabulary

      “The terminology of a domain is obtained by explicating the natural
        language phrases of the domain”
                                                          (Hudon, 1998)
We are interested in letting the users define and use their own domain
specific vocabulary for this task. Thus, we are looking for mechanisms to
construct an explicit semantic representation of the subject domain.
Such a process of defining a vocabulary is depicted in figure 2.2. In a
cooperative setting such as the one we are looking to support, this is a

Semantic modelling of documents                                             17
process resembling that of “social construction of reality” – described
for example in (Gjersvik, 1993).
Such explicit conceptualisations are expressed, created or “designed” as
needed by the collaborating group (Gruber, 1995) – illustrated in figure
2.2. Again, this is a process of interpretation, representation and
communication. We find it to be at this level that the local construction
of meaning - as referred to earlier takes place (Schmidt and Bannon,
1992; Bannon and Bødker, 1997).
To support such a process, we turn to conceptual modelling languages –
languages that can support the creation of an explicit representation in
terms of a domain model (Bubenko et. al, 1997 ; Halpin, 1995 ; Hull &
King, 1986). Furthermore, to be useful in an actual approach to
document retrieval, the modelling language and the resulting domain
model must be directly useful in the document description and retrieval
system - i.e. it must support the two processes of classification and
To be able to classify a document semantically by way of concepts from
a controlled vocabulary, the users must be familiar with this vocabulary.
Classifying a document by a selection of concepts, the semantic
interpretation of these concepts should be apparent to the user, and
should reflect her view of the documents to be classified. Likewise, to
retrieve documents by using a controlled vocabulary, users must enter a
query string containing - or referring to - the concepts whose
interpretation reflects their information retrieval goals. Our claim is that
to aid the users in these tasks, the applied vocabulary should be
apparent and directly visible in the provided tools.

                                 Figure 2.2

                            Domain model construction

18                                            Theoretical background and State-of-the-art
2.2.3.        Applying the vocabulary
As mentioned, thesauri in information retrieval can be used to support
searching or indexing or both, as well as several other purposes.
Likewise, the existence of an explicit domain model may also several
purposes in information organisation. In this thesis however, we are
interested in the use of this model to support classification and retrieval
of documents.
The process of classifying documents by way of words from the
vocabulary is a process of selecting words from the vocabulary that
most accurately reflects the users interpretation of the document
semantics. The resulting description will have to be translated into an
appropriate storage format and stored as a library card in the system.
The process of document retrieval amounts to forming a query
expression based on words from the vocabulary that reflects the users’
information need. The query expression is then matched with the stored
document descriptions, and matching documents are returned,
preferably in a ranked list. The user then is presented with the options
of browsing through the returned hits, refining the search expression
(rephrasing the search), searching again within the result set of
documents, or starting all over again, by creating a new search

2.3. Principles for an approach
Library science has been a topic of interest for more than 3000 years.
Electronic document retrieval has been the focus of research for more
than 30 years. Internet search engines have been around almost as long
as the net itself, at least ever since “Archie” and “Veronica”, denoted the
grandparents of search-engines (Sonnenreich, 1999)3, arrived in 1993.
In order to be able to enter into such an area that is covered by such a
vast amount of both research and commercial activities, it is important
to keep a clear mind about one’s principles and focus. If not, our quest
may very well become a never-ending story. In the following, we will
describe the principles behind our approach:

2.3.1.        Support communication of meaning
Our approach to semantic document descriptions must include support
for construction of a domain model. The domain model is intended to
capture the semantics of the domain, in a human readable manner. The
domain model does not represent some federated database schema of
the domain or simply some shared structure for storing meta data.


Semantic modelling of documents                                         19
Ultimately, the domain model will approach knowledge management
aspects, as illustrated in the ideal goal for subject languages:

      “Using these to retrieve information provides a value-added quality,
        which, in the case of highly refined languages can transform
        information into knowledge.”
                                                  (Svenonius, 2000, p127)

In other words, this domain model will serve as a semantic map of
information from this domain.

This yields further requirements:

    Human understandable

     The chosen modeling language and representations must support
     the users in their semantic construction of the model. This
     requirement may seem like stating the obvious, but it is worth
     pointing out, since this has been the subject of some debate in
     metadata-, and particular in ontology- communities, whether or
     not the descriptions should be human readable or tailored
     towards the internal reasoning and mechanics in the retrieval
     machinery only. The latter is seen in some approaches from
     ontology based information management systems.

    Constitute an explicit representation

     The model must be made available and visible in the user tool for
     organising and retrieving documents, in order to serve as a
     semantic map of available information. The immediate availability
     and usablility of the model is necessary in order to stimulate the
     actual use of the model. If the domain model is considered
     applicable and directly useful, this will increase users awareness
     of the model and thus potentially enhance the users
     understanding of the model over time.

    Interface standard IR machinery
     IR machinery and search engines have been available for several
     years. With the arrival of the web and company intranets, most
     companies and organisations will have some kind of IR machinery
     available. Our approach is not intended to replace traditional IR
     machinery, but rather to build a user-centered model-based tool
     on top of existing machinery. We have to devise a model-based
     retrieval method that is compatible with the retrieval process of IR

20                                          Theoretical background and State-of-the-art
   Interoperability
    As interoperability is one of the reasons for introducing metadata
    in the first place, this is a strong requirement. Each domain
    participating in our heterogeneous group may have their own
    interpretation of information. Hence, we must be able to relate
    different views and cope with their existence in the interface to
    expressing metadata and performing search. In order to export
    information also to “the outside world”, the metadata should be
    exportable along with the set of documents to be transferred.
    Means to accomplish this should be included in the modeling
    language and the supporting tool and will encompass the ability
    to relate concepts across several models, to “connect” models in
    a hierarchical modeling approach and exportability of models into
    several interchange formats.
    As we do not want to limit our descriptions to a certain set of
    domains, the basic modeling constructs and abstraction
    mechanisms of our modeling language must be of a general kind.
    With a generic set of basic constructs, the modeling language
    should nevertheless be able to generate rich and detailed

2.4. Research issues in document retrieval
We will base our approach on the following constituents:

   Construction of a domain specific vocabulary (i.e. the domain
    model) by way of a conceptual modeling language.

   User based classification of documents, based on the domain

   Traditional IR machinery for the actual retrieval task

   A model based interface for query refinement and result
Based on these principles, we will present the following research areas
in the reminder of this chapter - the state of the art:

   Information retrieval techniques (Section 2.5).

   Natural language processing in IR: NLP techniques have been
    applied in IR systems for several purposes, to improve the
    indexing process and the query evaluation or presentation, as well
    as to lift retrieval into a higher semantic level. NLP techniques in
    IR systems are presented in section 2.6.

Semantic modelling of documents                                            21
    The Semantic Web: The semantic web can be considered a large
     scale cooperative library effort, in the sense that users are asked
     to describe their documents or web-resources, in order to support
     more specific retrieval and application building over the web’s
     resources. Technologies for the semantic web are presented in
     section 2.7.

    Ontologies: Ontologies are today the main vehicles for defining the
     semantics of information in distributed, heterogeneous
     information systems. Ontology languages are presented as
     mechanims to define a formal vocabulary of concepts for
     information discovery and exchange. Ontologies play an important
     role also in the semantic web intitiative. Ontologies, languages
     and tools are presented in section 2.8

    Semantic-based document management within an intranet setting
     may be considered a part of a knowledge management scheme.
     Many current knowledge managament applications are based on
     advanced retrieval machinery. As we have pointed out, internal
     retrieval settings are afterall more limited than the full web, and it
     may be feasible to apply more advance but resource-intensive
     techniques. Knowledge management systems are presented in
     section 2.9.

2.5. Information Retrieval techniques
Classic Information Retrieval (IR) techniques are at the heart of most of
the retrieval and search facilities today, both on the internet as well as in
a variety of company intranets and document management solutions. IR
was defined as early as in the 1950s as:

      Information retrieval is the name of the process or method whereby
         a prospective user of information is able to convert his need for
         information into an actual list of documents in storage
         containing information useful to him
                     (Mooers, 1950) recited in (Savino & Sebastiani, 1998)
A more recent definition is given in (Baeza-Yates & Ribeiro-Neto, 1999):
      Information retrieval deals with the representation, storage,
         organisation of, and access to information items. (…) [IR] is
         more concerned with retrieving information about a subject than
         with retrieving data which satisfies a given query. (…) the
         retrieved objects might be inaccurate and small errors are likely
         to go unnoticed.
In its classic and still most common form IR deals with text-based
retrieval at a document level. Increased focus on retrieval from free-text
sources has led to a specialisation into areas such as Information
Extraction, Knowledge Extraction and Text Mining.

22                                          Theoretical background and State-of-the-art
Information extraction is concerned with the task of extracting factual,
structured data from free-text documents:

         Information extraction may be described as the task of populating a
            structured information repository from an unstructured or free
            text repository (Gaizauskas, 2002)

Knowledge Extraction and Text Data mining techniques are concerned with
the generation of new knowledge based on data or information extracted
from text documents:

         Mining implies extracting precious nuggets of ore from otherwise
           worthless rock. (...) the goal of mining is to discover new
           information from data (Hearst, 1999).

(Hearst, 1999) further elaborates the distinction between text
datamining and IR, Text Categorisation and other approaches that apply
somewhat similar techniques. The important distinction is that text data
mining should discover knowledge from the document collection that
was not known in advance, and not simply respond to the need of an
information hungry user.
Our interest in IR lies within the classic IR definition, i.e. text based
retrieval of documents, and thus the presentation in the sequel is kept
within this area.
The definitions of IR emphasize to some extent the vague nature
inherent in IR; the task is to retrieve information that is somewhat useful
with respect to the users information need. There is vagueness both with
respect to what is considered "useful" and even to the "information need"
of the user, a need that may be unclear to the user himself at the start
of a retrieval process. Even so, classic IR mechanisms and machinery
are highly formalized.

2.5.1.       Information Retrieval models
An information retrieval model is defined as a quadruple (Baeza-Yates,
Ribeiro-Neto, 1999):
         [D, Q, F, R(qi,dj)] where
         –   D is a set composed of logical views for the documents in the
         –   Q is a set composed of logical views for the user information
             needs (queries)
         –   F is a framework for modeling document representations,
             queries, and their relationships
         –   R(qi,dj) is a ranking function which associates a real number
             with a query qi ∈ Q and a document representation dj ∈ D.
             Such ranking defines an ordering among the documents with
             regard to the query qi.

Semantic modelling of documents                                                23
Within text retrieval, the logical views of both queries and documents are
extracted from or otherwise created on the basis of the words found in
the text, i.e. the logical view is created from index terms.
Figure 2.3 shows the three common IR models and their current

    The Boolean Model is a simple model based on Boolean algebra
     where an index term is simply weighted as either existing or non-
     existing (0,1) in a document. Queries can be formulated using
     logical operators such as NOT, OR, and AND. The ranking function
     for the Boolean model simply determines a document as relevant
     (1) or non-relevant (0) for a given query, no internal ranking of
     relevant documents is present in the basic Boolean model.

    The Vector Model is an arithmetic model. Both documents and
     queries are viewed as a vector of weighted terms, and the ranking
     function uses arithmetic operations to compare the query and
     document vectors. Various weighting schemes and vector
     comparison calculations may apply. Most commonly though,
     weighting of index terms is performed by the tf*idf scheme and
     the ranking function is performed by calculating the cosine of the
     angle between the document and query vectors. Such a ranking
     function gives a number to the degree of similarity between the
     document and a query, so that the retrieved documents can be
     ranked for relevance.

     The tf*idf scheme calculates weights by using the frequency of a
     term in the document (tf) multiplied with the inverted document
     frequency (idf). The idf factor adjusts the raw frequency of a term
     with the number of documents the term occurs in; a term that

                                           Figure 2.3

        Common information retrieval models and extensions (Baeza-Yates & Ribeiro-Nieto, 1999)

24                                                      Theoretical background and State-of-the-art
    occurs with a high frequency in all documents is not necessarily a
    good index term, while a term that occurs in only one document
    can be used to distinguish that particular document from the rest
    of the collections. Since queries are by nature much shorter than
    documents, a slight modification of the tf*idf formula is used for
    weighting the query terms.

   The Probabilistic Model takes a somewhat different approach
    from the other two. The intuition is to be able transfer ranking
    calculations into probabilistic calculations. The query process is
    then viewed as a process of specifying the nature of an ideal set of
    document (the relevant ones) and the ranking functions then
    becomes the task of calculating the probability that a given
    document is considered relevant for the user. The probability
    calculations in turn are based on the existence or non-existence of

In general, the vector model is the most widely used. The Boolean model
is intuitive and simple and provides a logical approach to query
formulation. However the inability to rank retrieved documents for
relevance is a major drawback. The probabilistic model requires too
many initial assumptions which are difficult to estimate for the required
probability calculations and is still mostly applied in research and
research prototypes.

As shown in figure 2.3, all the three basic models are extended into
several variants. A presentation of all of these is beyond the scope here,
but the Latent Semantic Indexing extension to the Vector model requires
attention, since it performs retrieval on automatically detected concepts
rather than terms.
Traditional indexing is based on the terms found in the text and all the
basic IR models assumes independence of these index-terms. LSI
(Furnas et al, 1988) however uses singular value decomposition
techniques applied on the document-term association matrix in order to
index on concepts latent in the document texts. The latent concepts are
automatically discovered concepts, based on the mathematics and
statistics behind singular value decomposition. SVD reduces the
dimension of the document-term association matrix such that different
terms that seem to be used within the same context in the documents
are "grouped" to form a concept. All indexing and querying take place in
this concept space rather than on individual terms. SVD is a technique
from linear algebra that is also a central component in regular Data
Mining tasks.

The claim presented by LSI advocates (Goldberg et. al, 2000) is that this
transforms retrieval from a syntactic space (of terms) into a semantic
space (of concepts), and thus generally improves retrieval quality. This
has yet to be confirmed in a general sense, but the approach seems

Semantic modelling of documents                                            25
promising (Baeza-yates & Riberio-Neto, 1999). For example, the
technique offers improved retrieval in multi-language settings, in
situations with short and/or noisy documents, as well as in connection
with text data mining efforts (Telcordia, 2003).

Most common web search engines like Google4 and Alltheweb5 apply a
variant of the Vector model. These engines do not only operate with
term based indexes, but also create indexes on whatever meta data they
can extract from a document, for example its title, domain, location,
language, etc. Such meta data may be added to the query directly and is
also used in ranking. For example, the occurrence of a query term in a
document's title or URL can be used to indicate a higher relevance than
if found in the document body. An interesting and well-known example of
alternatives to the basic relevance ranking of the vector model is the
"page-rank" algorithm applied by Google:

         PageRank relies on the uniquely democratic nature of the web by
           using its vast link structure as an indicator of an individual
           page's value. In essence, Google interprets a link from page A to
           page B as a vote, by page A, for page B. But, Google looks at
           more than the sheer volume of votes, or links a page receives; it
           also analyzes the page that casts the vote. Votes cast by pages
           that are themselves "important" weigh more heavily and help to
           make other pages "important."
                                                          (Brin & Page, 1998)

2.5.2.        The query process
Figure 2.4 shows an overview of the retrieval process as found in
modern retrieval systems. The figure distinguishes between those steps
that are performed by the user (shown as an ellipse) and those that are
performed by the retrieval system (shown as a rectangle). In brief, the
steps may be defined as follows:
     Query Formulation: The user expresses her information need in
      the query language or notation offered by the retrieval system.
     Query Transformation: Most retrieval systems perform certain
      transformations on the user query before evaluating it against the
      index, in order to improve the matching. Example transformations
         - Normalisation: If the index of the IR system is normalised
            in any way, say for encoding, language variants, based
            on stem/canonical forms, etc, the query terms will also
            have to undergo the same normalisation before


26                                             Theoretical background and State-of-the-art
      - Default expansion: Some search engines use default
         expansion of the query by adding terms to the query.
         For example, if stemming is not used in the index, all
         conjugated forms of a query term may be added. Terms
         to add may also be extracted from a synonym list, or
         generated from statistical analysis of the text corpus
         that is indexed. Some search engines such as Google
         also removes stop words from their queries, i.e. a
         default reduction of the query.

      - Phrasing/Anti-phrasing: Phrases in the query can be
         recognised and treated as a unit, for example names
         such as "Mario Lemieux" or compound terms such as
         "Conceptual Modelling Techniques". Web based search
         engines often allow the user to phrase terms themselves
         by using the double quote notation (cfr table 2.1). Anti-
         phrasing refers to the removal of "noise phrases" from
         the query, for example in the query "give me information
         about Mario Lemieux" the underlined phrase does not
         contain any important semantics for the query and
         could even disturb query evaluation.
After the appropriate transformations are applied, queries are executed
against the indexes and documents are retrieved. The result set is
analysed and presented to the users.

                                         Figure 2.4

                        The steps in a basic information retrieval process.

Semantic modelling of documents                                               27
    Result set analysis may include:
      - Ranking: documents are ranked, given the particular
         ranking function of the retrieval model in use.
      - Clustering/relevance feedback: Search engines such as
          Alltheweb, Altavista and Google apply clustering (Baeza-
          Yates and Ribeiro-Neto 1999; Manning & Schutze,
          1999), that is, they cluster retrieved documents based
          on statistical similarity measures. Normally, only the N
          highest ranked documents are included in the
          clustering. To enable relevance feedback (Robertson
          and Sparck-Jones, 1976), the user is asked to select the
          most promising (cluster of) documents, which may
          again trigger a re-evaluation and presentation of the
      - Categorisation: if the search engines use categories (Gulla
         et. al., 2002 ; Sebastiani 2002 ; Aas & Eikvil, 1999) in
         the index, the result set can be matched against the
         category definitions, in order to determine which
         category the document belongs to. Categories may be
         used in a similar manner to clusters to assist the users
         in subsequent query refinement, e.g. by limiting further
         search to a particular category, or by including category
         terms in the search expression.

2.5.3.    Query formulation (notation support)
Query formulation is generally performed in "natural language", albeit
most systems now adopt some common notation found in web search
engines. There are two widely used query notations – the so-called
MATH notation and the Boolean notation (Sullivan, 2001b). These
notations are presented in table 2.1.
                                                  Table 2.1
          Feature                      MATH notation                   Boolean notation
          Include term                 + term                          AND term
          Exclude term                 - term                          NOT term
          Match any term               Default                         OR combinations
          Match all terms              Default or use of '+'           AND combinations
          Nesting                      -                               (Use of Parentheses)
          Near                         -                               NEAR N Words
          Phrasing                     "use of quotes"                 "use of quotes"
          Stemming                     Option                          Option
          Approximate search           Option or wildcards             Option or wildcards
          Feature search               feature:                        feature:
                     Search enigne query notations - abstracted from (Sullivan, 2001b)

28                                                             Theoretical background and State-of-the-art
The math and Boolean notations are not directly compatible, even if
they employ somewhat the same mechanisms for expressing a query.
The Boolean notation offers the ability to create nested logical query
expressions, perhaps at the expense of ease-of-use for novice users
compared with the MATH notation.
The "near" operator refers to the proximity of terms in the documents
and normally refers to the word distance between two query terms.
As defined earlier, the technique of phrasing refers to detecting term
combinations and sequences that should be handled as a unit rather
than as separate words, such as proper names or compound noun
Approximate matching can be viewed as a form of handling spelling
corrections or syntactic variations from query terms to. Google
exemplifies this quite neatly in their "Did you mean (most common
spelling)" technique, in which the terms of the query are matched
against the index in order to detect if the query uses another form than
the most common use in the index (Google, 2003).
Feature-based search allows for searching in the extracted contextual
meta data about a document – such as its title, location (URL), etc.

2.5.4.    Result presentation and query refinement
In most web search engines today, advanced techniques are applied on
the query results, mainly in order to enhance presentation to the user
and to support subsequent query refinement. In this section, we will
define how the most common techniques apply to our model-based
As long as the "information item" of an IR system is a document,
retrieved results will be presented as a ranked list, even if more
advanced ways of responding is on the research agenda, such as
"Question Answering" (Vorhees & Tice, 1999) or "Topic summaration"
(Radev et. al., 2000).
Ranking schemes are applied in order to present the most relevant
documents first. Most commercial IR systems apply the vector model, in
which ranks are calculated as variants of the tf*idf weighted term-
document vector and a similar weighted query-term vector. "Degree of
similarity" - sim(dj,q) - for this vector comparison is performed through
calculating the cosine of the angle between the two vectors (Baeza-
Yates, Ribero-Neto, 1999). The query term weights are calculated in a
slightly different manner than td*idf, using (Salton & Buckley, 1988)'s
"standard" calculation of query term weights. In a given setting, one can
establish a threshold for this value, as a limit to what is considered
Most web search engines use an intricate formula for calculating ranks,
which includes mechanisms to boost certain aspects of a hit (e.g. terms

Semantic modelling of documents                                        29
found in the title seems more relevant than terms found in the body). In
general, ranking configurations in practical use are a result of
experiments and tuning with the actual IR system.
In order to support query refinement, most retrieval systems today offer
techniques that will support the users in narrowing down the result set.
In Relevance feedback, one distinguishes between three sets of
documents; Dr - the relevant documents retrieved , Dn – non relevant
documents retrieved and Cr – the set of relevant documents in the whole
collection. The idea behind a user relevance feedback strategy is to have
the user indicate the set Dr and then calculate a modified query that will
more accurately retrieve and rank documents from the total set of
relevant documents Cr. Supporting relevance feedback necessitates
some mechanism to group the retrieved documents into sets. Two
techniques are common:

    Categorisation: If the search engine supports the use of
     categories, as mentioned earlier, the retrieved set of documents
     will be matched against the categories. The users may then
     choose the most relevant category to elaborate on. Either further
     search may be limited to within this category only, or the category
     documents and the term-based definition of the category may be
     used to calculate a modified query that will retrieve more relevant
     documents from the total collection.

    Clustering: In clustering, terms from the most relevant documents
     (i.e. the n top-ranked ones, or the ones selected by the users), are
     used to retrieve a larger set of relevant documents from Cr. An
     association cluster is calculated based on term similarity, and the
     original query will then be expanded with the highest ranked
     terms according to the similarity measures. Several strategies
     exists, which will result in anything from several small and
     cohesive clusters to larger and more diverse clusters.

2.6. Natural Language Processing and IR
With text-based retrieval, a natural area for improvement is the analysis
and processing of text, both in the indexing process, in the initial
examination of the query and in the presentation of query results.
Natural Language Processing (NLP) techniques are currently applied to
some extent in most web search engines (Gulla et. al., 2002), and is an
active area for IR research (Arampatzis et. al., 2001 ; Katz, 1997 ;
Strzalkowski, 1999 ; Sparck-Jones, 1999), even if It is still uncertain
whether such techniques significantly improve retrieval in the general
case (Bayeza-Yates & Ribeiro-Neto, 1999 ; Allan, 2000).
The term NLP techniques reflects a view of NLP as a set of individual
techniques of different linguistic level (morphological, syntactic,
semantic) that can be applied at different stages of the indexing and

30                                         Theoretical background and State-of-the-art
retrieval cycle. In the sequel, we will describe the relevant techniques
according to where they are applied in the retrieval cycle.     Text analysis for improved indexing
The selection of index terms is an important step in the indexing
process. Figure 2.5 shows a text analysis process for improved index
term selection, adapted from (Baeza-Yates & Ribeiro-Nieto, 1999).
Language detection is important in multilingual environments and open
environments such as the Web. Language detection offers the ability to
provide results in a particular language upon request, but is also a
prerequisite for some of the other linguistic techniques, for example
transliteration and stemming, which in some cases are dependent on
language-specific rules and dictionaries. Language detection is
performed by web search engines such as Alltheweb, by keeping lists of
terms of the supported languages. Words from the document are
matched against each list and the "best match" is selected. More
advanced techniques do exist; examples are calculation of language-
word frequencies, the removal of duplicate words (that exist in several
languages) from the term lists and the use of N-Gram analysis rather
than single terms.
Transliteration amounts to the handling of special language characters,
signs, symbols, accents and punctuations. The main objective is not to
use punctuations and signs to add semantics to the indexing process,
but rather to be able to handle this uniformly, i.e. to perform indexing
and retrieval on normalised lexical forms. For this, it is generally advisable
to define a default procedure and specify exceptions if needed (Fox,
1992). Sometimes, accents, signs and symbols are simply dropped
altogether and the index is built of "stripped" word forms.

                                           Figure 2.5

Text analysis process for index term selection. Several linguistic techniques may be applied either as
independent techniques or chained in sequence. Adapted from (Baeza-Yates & Ribeiro-Neto, 1999).

Semantic modelling of documents                                                                    31
Lemmatization refers to handling of inflections and conjugated word
forms. As with transliteration, it is considered an improvement to be
able to index on a standard form, a base form or lemma, rather than to
construct an index entry for each inflected form of a word.
Lemmatization can be performed either by pure statistical techniques,
through language-specific syntactical and grammatical rules such as the
Porter Stemmer for English (Porter, 1980) or using dictionaries. The
benefit of lemmatization is not always evident, particularly not for
English which is not a highly inflected language, while as in other more
inflected languages such as German, eastern European or Asiatic
languages this gives a beneficial effect (Gulla et. al., 2002).
Not all words are interesting to use as index words. Words with limited
semantic content, or words that occur frequently in all documents and
thus do not discriminate between documents, are candidates for
filtering out before indexing. Such words are commonly referred to as
stopwords. Stopwords can be detected by keeping a list of the
commonly accepted stopwords or by calculating word frequencies in the
document colletion. If a part-of-speech analysis is performed to
determine the word class of each word, words from particular word
classes, such as articles, prepositions and conjugations may also be
considered as stopwords.
Phrasing and antiphrasing of queries are described earlier. Phrasing
refers to the detection of word sequences that clearly belong together
and should be treated as a unit rather than individual words, such as
"New York", "Conceptual modelling", "Emergency response unit". Phrasing
is normally performed using dictionaries of the most common phrases
in a language, often extracted from the document collection. A variant of
phrasing is detection of proper names, such as "Tim Berners Lee" and
"United Kingdom". Names can be detected in the same manner as the
other phrases or by keeping a separate dictionary of common names.
Such extraction of common phrases from a document collection or a
"reference set" of documents may be performed purely statistically, by
detecting recurring patterns of words. However, for indexing purposes it
is generally considered that noun phrases are the ones carrying most of
the semantics of a text (Baeza-Yates & Ribeiro-Neto, 1999; Jurafsky &
Martin, 2000)6. Compound noun phrases are common especially in
English, like "natural language processing", "computer science",
"semantic modelling of documents". Detection of noun phrases (NPs)
can be performed grammatically by examining the word classes of each
term in the text and then selecting adjective noun sequences. Such NP
requires tagging the text with word classes (Jurafski & Martin, 2000) a
task that either requires tedious grammatical rules (Voutilainen &

    This is not universal; in a domain that is related to events in time, sequence of actions or processes, for
    example in business process analysis, much semantics will also be found in verbs. In some language
    analysis techniques, verbs are considered the most complex word class and the main constituent of a

32                                                             Theoretical background and State-of-the-art
Heikkila, 1993; Heikkila, 1995) or thorough statistical training of the
tagger (Brill, 1995). Furthermore, even if adjective noun sequences are
detected, it is not evident how to select the sequences; should all
adjectives and nouns in a sequence be considered equally important?
What is the "head" of the NP? How do we handle permutations such as
"processing of natural language" vs. "natural language processing"?
NP detection and indexing on NP's takes indexing a small step towards
a higher semantic level than single term indexing, but NP's are still "just
words". A higher level of semantics in indexing would be reached if one
were able to detect and index on concepts. Concept extraction from
running text is another NLP technique waiting to find its perfection, but
several approaches to this exist. Concept extraction examines term
similarity in order to cluster different terms that seem to refer to the
same concept, i.e. a lexicalized definition of a concept sometimes
denoted a synset. NLP techniques applied in such term clustering are
co-occurrence and co-location analysis and standard statistical
clustering techniques. The before mentioned Latent Semantic Indexing
approach applies singular value decomposition to reduce the dimension
of the document-term matrix and can logically be viewed as a term-co-
occurence analyis even if a different mathematical apparatus is applied
(Manning and Schutze, 1999). LSI is the most mature approach to
semantic indexing that has reached IR applications and experiments.
Again, the results from experiments with LSI are still ambiguous
(Manning and Schutze, 1999) in the general case, but some tests show
that LSI seems to give a higher level of recall.
Work in thesaurus generation, furthermore strives to detect lexical
relations such as hyponymy/hypernymy, antonymy and meronomy from
text. WordNet (Miller, 1995; Fellbaum, 1998) and its related projects
such as EuroWordNet7 is the most commonly known example of such a
database of lexical relations that also have been applied in IR systems.
(Strzalkowski et. al., 1998) show that the application of such thesauri
based or "concept" based indexing does not give a large effect on
retrieval in the general case, but success stories seem to be very
dependent on the nature of the domain and the document collection.
Linguistic- and data mining techniques for concept extraction from text
are treated in more detail in section “Ontology learning tools”.     NLP based query enhancment
Advanced NLP techniques are computationally costly, especially when
applied to full text documents. These costs have been a major obstacle
for these techniques to be applied in large-scale IR settings, such as
Web search engines. Queries on the other hand are much shorter (the
average query on a web search engine is approximately 4 words) and


Semantic modelling of documents                                         33
can be treated in more detail. NLP based query analysis often amounts
to transforming the query, either by adding information or by preparing
it for better matching with the index.
If NLP techniques such as stemming, phrasing or NP extraction are
applied in the indexing process, such measures must also be applied to
the query terms in order to prepare for proper matching of the query.
Query analysis also has the benefit of utilizing the index or lexicons
extracted from the already indexed text collection in order to expand
query terms with term variants or to perform spellchecking of the query.
As mentioned, this is applied for example in Google (Google, 2003). Use
of synsets or thesauri is also more feasible with respect to queries where
the terms of a query may be expanded with its synonyms or the lexical
relations in the thesaurus may be used to offer users input for refining
their queries. Such strategies for query expansion is computationally
feasible even in large scale settings and does not require any
modifications or added complexity of the regular indexing process.
Most search engines today also provide document categories similar to
the Yahoo8 directories. A category in this sense may be defined simply
by the documents it contains or by lists of weighted terms significant for
each category. As mentioned, categories may be exploited in query
processing either by directing/restricting the query to one of the
categories or by expanding the query with terms from the category.
Predefined categories can also be used in a variant of clustering
approaches, so that retrived documents are presented grouped by the
categories to which they belong.      NLP for advanced result presentation
The presentation of retrieval results as an endless list of documents that
the user will have to browse through in order to find the relevant piece of
information is not necessary the ideal way to answer a query. NLP
analysis of the retrieved documents provides possibilities for more
advanced result presentation.
One, quite common approach in modern web search engines (e.g.
Alltheweb, Altavista9) is clustering of retrieved documents. This is
normally done by examining terms or document vectors for the top set
of retrived documents and then cluster them into ad hoc categories or
topics. Different cluster algorithms exists for dividing documents into
either flat or hierarchic clusters (Manning and Schutze, 1999 ; Jardine
and van Rijsbergen, 1971 ; Willett, 1988).
Clusters may be constructed according to several strategies (Baeza-
Yates & Ribeiro-Neto, 1999) for example by examining the document-


34                                       Theoretical background and State-of-the-art
term matrix, by analysis of term co-occurrence or co-location or by
calculating synsets or small thesauri for the retrieved documents.
Clustering not only enables the grouping of retrieved document, but also
enables relevance feedback (Salton 1971 ; Rocchio, 1971 ; Robertson
and Sparck Jones, 1976), in which users select the set of documents –
or cluster - that seem the most relevant and the query is evaluated or
executed again, either by re-weighting document vectors for the
documents in the cluster or by expanding the query with terms
extracted from the cluster topic or the documents within. A variant of
relevance feedback is the pseudo-feedback (Buckley et al, 1998 ; Kwok
and Chan, 1998) approach where a co-occurrence analysis extract terms
from the top n retrieved documents and use these to expand the original
query which is then executed again, i.e. without actual user feedback.
Text summarization (Sparck-Jones & Endres-Niggermeyer, 1995 ;
Goldstein et. al., 1999) is a technique where either each retrieved
document is summarized into smaller "abstracts" or where a
summarized text on the whole query topic is generated. Text
summarization is either based on the computation of semantic
structures from retrieved text fragments and then generation of new text
over these, or by simply ranking and extracting relevant sentences from
retrieved documents. Some search engines such as Ask Jeeves10 or Co-
Brain11 have experimented with text summarization in a Question
answering manner, i.e. to try to respond natural language questions
with factual information extracted from the text of the retrieved

2.7.       The Semantic Web
The most important initiative on semantic web technology is of course
the Semantic Web:
         "The Semantic Web is an extension of the current web in which
            information is given well-defined meaning, better enabling
            computers and people to work in cooperation. "
                                                             (Berners-Lee et. al., 2001)
The main purpose of the semantic web is to enable intelligent
         The Semantic Web addresses this problem in two ways. First, it will
           enable communities to expose their data (...). Secondly, it will
           allow people to write (or generate) files which explain - to a
           machine - the relationship between different sets of data. For
           example, one will be able to make a 'semantic link' between a
           database with a 'zip-code' column and a form with a 'zip' field


Semantic modelling of documents                                                            35
         that they actually mean the same thing. This will allow machines
         to follow links and facilitate the integration of data from many
         different sources. (...) This notion of being able to semantically
         link various resources (documents, images, people, concepts,
         etc) is an important one. With this we can begin to move from
         the current Web of simple hyperlinks to a more expressive
         semantically rich Web, a Web where we can incrementally add
         meaning and express a whole new set of relationships
         (hasLocation, worksFor, isAuthorOf, hasSubjectOf, dependsOn,
         etc) among resources, making explicit the particular contextual
         relationships that are implicit in the current Web.
                                   (Berners Lee & Miller, 2002 , emphasis added)
For this, a layered architecture is proposed (figure 2.6). The layers that
are of interest to us is the data layer (RDF + RDFS) and the semantic
layer (the Ontology vocabulary). We will describe each of these layers in
the sequel.

2.7.1.     Resource Description Framework (RDF) and Schema (RDF-S)
The Resource description framework (RDF) (W3C-RDF, 1999) and its
corresponding schema definition facility RDF-S (W3C-RDFS, 2002), is
the emerging standard for Web-resource descriptions and meta data
statements. The RDF language defines a simple model for expressing
meta data statements. A statement in RDF is a triple: subject - predicate
– object, which constitutes the assigning of properties (the predicate)
and values (the object) to resources (the subject). The object (or value)
can be a literal (i.e. a string value) or another resource. An example RDF
model is shown in figure 2.7

                                            Figure 2.6

 The layered architecture of the Semantic Web. At the bottom of the architecture we find, standards to
 handle access to (URI) and encoding of (Unicode) of data. Above this, XML is selected as the data
 storage structure. The data, which in the semantic web sense is a blend between meta data and data, is
 encoded in RDF/S and serialized into XML for storing. The semantic layer resides above the data,
 encoded in a formal ontology vocabulary. On top of these, mechanisms are to be developed that
 supports reasoning about the data, prove their correctness, or to ensure their trusworthyness.

36                                                       Theoretical background and State-of-the-art
The basic RDF model has been designed with flexibility and simplicity in
mind. However, RDF does not provide any means to define the
vocabulary of the domain, that is the semantics of resources and
properties, their classes and relations, and does not provide classes that
can be used to define other classes and properties (W3C-RDFS, 2002).
The declaration of this is performed using the statements of RDF-
Schema. The RDF schema is considered a vocabulary description
language and is specified through RDF constructs.
The RDF-S is defined in order to:

       "address the need to create metadata in which in which statements
          can draw upon multiple vocabularies that are managed in
          decentralised fashion by independent communities. (…) an RDF
          schema will define properties in terms of the classes of the
          resource to which they apply. (…) One of the benefits of the RDF
          property centric approach, is that it is very easy for anyone to
          say anything they want about existing resources, which is one of
          the architectural principles of the Web."
                                                                      (W3C-RDFS, 2002)

As figure 2.7 shows, the RDF-S constructs can be superimposed over an
existing RDF model in order to add structure and relations between the
resources and properties. The figure also illustrates the use of
namespaces in RDF – where each object is prefixed with the namespace
to which it belongs.

                                                Figure 2.7

Small example of RDF statements and RDF-Schema structures. RDF resources are drawn as ellipses, while
properties are pointers. The resorce called http://…./proposal/ in the lower left corner has two properties:
its title and its author. The title is a literal property, i.e. a string. The author is itself a resource. Schema
definitions are inferred to declare types of resources (the proposal is of type Document, the author is of type
Person, which in turn is a subclass of Agent. (Brickley & Guha, 2002)

Semantic modelling of documents                                                                               37
2.7.2.     Meta data statement construction
RDF statements are the fundamental mechanism for expressing meta
data. Statements provide a very generic and flexible way of expressing
metadata. Initially there is no RDF model.12 The model manifests itself in
the RDF-Schema referenced in the statements. However, one is free to
express statements without a schema. Schema-conformance only
becomes important in application-specific settings, for example for
query languages that exploit schema information in order to provide
structural queries. A statement is simply a triple – subject – object –
predicate – that assigns a property to a resource.

    The Subject is always an rdfs:Resource. The resource is identified
     by the URI pointing to "it", or rather to the meta data document
     describing it.

    The predicate denotes the property being assigned to the
     resource. A property in RDF is both considered an attribute of the
     resource and possibly a relation between two resources. The
     rdf:Property class is a subclass of the rdfs:Resource class.

    The object of a statement is the value of the property. Objects are
     either other resources or literal values. In the latter case, the
     statement is often called a lexical statement.

2.7.3.     Examples of RDF applications and tools
Even if RDF/S is still developing, and still must be considered a research
project, a huge amount of applications and tools are being developed.
Coarsely, RDF tools and applications can be divided into:

    RDF editors and parsers: With the complexity of the RDF/S
     language, as well as its XML serialization syntax, editors must be
     provided for users to produce RDF. Several editors and
     corresponding parsers have been proposed. The most recent
     editor is the IsaViz authoring tool (figure 2.8) (W3C-IsaViz, 2003),
     the only tool so far that allows for direct graphical editing of
     RDF/S statements. IsaViz includes support for a graphical
     stylesheet mechanism that allows for constructing views and
     tailoring of the presentation of RDF/S graphs. RDF/S editors are
     still mostly single-user tools. Also, tools for validation, consistency
     checking and merging of RDF/S statements exists. Annotation
     tools are not RDF editors per se, but allow users to mark-up web
     pages with RDF statements. Annotation tools are described in
     section (on ontology based annotation tools)

  The notion of "model" is now being considered in RDF, but not in a conceptual modelling or database
 sense, rather as a logical model, viewing RDF statements as assertions (RDF model theory, W3C Working
 draft,, April 2002)).

38                                                       Theoretical background and State-of-the-art
                                                Figure 2.8

The IsaViz RDF/S authoring tool. IsaViz supports graphical editing of RDF/S graphs. The editor supports a
wide range of browsing, navigation, search and selection functions. The editor also suppors form based
editing and pure-text editing of RDF. IsaViz supports graphical stylesheets, that allows for view generation
and tailored presentation of the RDF graphs.

      RDF generators: Some tools allow for generation of RDF
       statements, based on meta data extraction from files, filesystems
       or databases. Some examples are InferEd13, Catalogue14 and
       DR2MAP15. InferEd is an RDF inference system from files, wich
       includes an RDF editor for the user to finalise the inferred RDF/S
       statements. The Catalogue fileexplorer can gather metadata from
       files in your filesystem and construct RDF statements from these.
       DR2MAP can export relational database data into RDF/S via
       JDBC/ODBC acces to the database.

      RDF Storage and Query tools: RDF/S is considereded the data
       storage layer of the semantic web. As such, meta data statements
       in RDF can be viewed as the database or knowledge base of the
       semantic web. Several systems exist for storing and querying of
       RDF/S data. RDF storage and querying is presented in more detail
       in the next section.


Semantic modelling of documents                                                                          39
                                               Figure 2.9

                  Extended RDF/S example from a cultural portal (Karvounarakis et. al., 2000)

      RDF based meta data standards: Several meta data standards
       have selected RDF/S as the encoding standard. Among these are
       the Dublin Core digital library meta data standard (Weibel et. al.,
       1995) 16 and the PICS17 internet conent rating standard.

      RDF based catalogs and directories: Some open catalogues and
       directories use RDF for describing their content. The Mozilla Open
       Directory project18 exports both their organisational structure and
       the directory content as RDF dumps. The UK software mirroring
       service19 uses RDF in order to describe their content. Another
       example is the MusicBrains20 CD index that uses the Dublin Core
       RDF format to describe contents of musical CD’s.


40                                                           Theoretical background and State-of-the-art
2.7.4.      RDF querying
Query strategies for RDF may be categorised as either (Magkanaraki, et.
al, 2002):
   Semistructured / XML structured: Query strategies that exploit
    the structure of the RDF statements and and thus enables to
    query for patterns in the storage structure, for example exploiting
    the XML serialisation syntax. Some of these approaches use the
    RDF Triples as a "schema" for querying. That way, one is able to
    search for all “resources” that occur either as a subject, predicate
    or object of a statement, or even to retrieve all occurrences of the
    concept. These approaches do commonly not apply any schema
    information and do not rely on the vocabulary of an ontology or
    the RDF-Schema. (e.g. Ontologger – figure 2.10)
   SQL-like: A number of approaches have been introduced to
    develop a SQL like structured query language for RDF
    (Magkanaraki et. al., 2000 ; Karvounarakis et. al, 2000 ; Haase et.
    al., 2004). These approaches formalise the structure of the RDF-
    Schema or ontology level information, in order to introduce RDF
    query primitives and operations.
   Formal theory reasoning: A third strategy, in line with the AI
    perspective of ontologies and information exchange, is to extend
    RDF in order to build formal world models based on the
    statements ("assertions") and then to include logical primitives in
    these models. This strategy will enable reasoning and planning as
    query modes.

                                           Figure 2.10

The ontologger query interface: The example shows the definition of the resource:“Female” and the results
of a search for the name Female. The results are all statements that include female either as subject,
predicate or object. (Heggland, 2002)

Semantic modelling of documents                                                                       41
One of the most developed query languages for RDF is the RQL language
(Karvounarakis et. al. 2002) from the ICS FORTH RDF Suite21. RQL is a
hybrid query language that enables queries on several levels:
     Basic queries
         -   Find all instances of class (bag of instances)
         -   Find all Properties (bag of source – targets)
     Schema-based queries:
         -   Which properties can appear as domain and range of
             class X?
         -   Find all properties and their range that apply to class X?
         -   Find all information related to class X, that is:
             sub/superclasses, properties/inherited properties
     Data queries
         -   Find resources with property X ?
         -   Find properties of all instances of class X, with value Y ?      Summing up
Initially RDF was meant to be the meta data langauge for the web,
portraying the meaning22 of metadata (W3C-Metadata, 1997). Nowadays,
it is being referred to as the data-layer, while the semantics of meta data
statements is defined in the ontology layer above.
The main virtues of RDF are:
     At its heart, it contains a simple mechanism for expressing meta
      data statements: Subject, Predicate, Value. This basic model is
      simple, flexible and fairly easy to understand for the expression of
      basic meta data statements (e.g. chapter (S) – title (P) –>
      “summing up” (V) ).
     Anything in RDF statements can be referenced with URI’s. Nodes
      and Properties in RDF statements are themselves resources that
      can be referred to. Exploiting this, meta data statements can free-
      ly link to each other and incorporate statements or definitions
      from other sources. Furthermore, this enables superimposion of
      statements, i.e. enabling people to build statements on top of
      each other, specialise and extend statements without altering the
      original document. This is a fundamental issue in such a
      heterogeneous environment as the Web.

     ICS FORTH RDF Sutie:
  Actually the word “semantics” was never applied to describe RDF from the w3c consortium. However, as
 the meta data activity statement presents, it was meant to declare the meaning of meta data. The
 Semantic Web activity took over from the meta-data activity in 2001.

42                                                         Theoretical background and State-of-the-art
So far, RDF still has to prove its value and “enter the big market”.
Several critical arguments have been raised about RDF (Butler, 2003):

    RDF is neither a complete nor a formal data model. If its only to
     function as a datamodel, why not use a better datamodel then,
     such as the ER model or XML schema?

    RDF does not define the semantics of meta data statements, it is
     just a way of labeling them.

    RDF allows linking of meta data statements, in such a manner
     that several actors may reuse each other’s vocabularies and infer
     statements on top of each other. However, RDF does not include
     mechanisms to support merging of vocabularies or handling of
     inconsistencies. There are no statements to define equality or

    The RDF syntax is too complex. While the intial S,P,V model is
     simple to understand, later additions to RDF, RDFS and its
     serialisation syntax into XML are hard to understand, even for
     trained users. The multiple layers of the Semantic Web
     architecture, where all information is “superimposed” and
     structured into XML produces an XML document of such
     complexity that its difficult to read, even by XML parsers.23
The answer to all but the last of these critisisms from the Semantic Web
community is to point to the layered architecture model, and claim that
the higher layers are adding the necessary semantis and formality. While
this is true, it is still a question wheter the RDF technology will prove the
right data-model for meta data statements. There seems to be an
overlap between what is defined in the RDF/S languages and the
ontology languages above, and it is not clear that one cannot leave out
the RDF layer and go directly from a vocabulary definition in an ontology
language and directly into a more direct storage mechanism, for
example XML. This is the approach that was initially taken in the DAML
ontology language (described in the next section), before it was merged
with the RDF-based OIL language.
The syntax problem is still valid. The answer to that from Semantic Web
advocates is that users should not bother with the syntax, they will be
provided with tools that will help produce good RDF. Still, however, no
such tools exist. All current editing tools, including the graphical IsaViz
editor, exposes users to the complexity of the RDF syntax.

  As an anecdote, Tim Bray, one of the participants behind the initial RDF proposals, and currenly owner of
 the web domain name has recently posted an RDF challenge, as a personal note of dissatisfaction
 with the recent developments in RDF: The first proposal that can be viewd as an RDF application to spread
 “virally”, will receive this reserved domain name as a reward for the achievement.

Semantic modelling of documents                                                                          43
A survey of RDF data on the web (Eberhart, 2002) found that in 2002 (5
years after the first RDF proposal), out of 3 million examined pages,
1479 contained valid RDF and 2940 contained invalid RDF. This is a
fairly small number, compared with the 3 billion web pages contained in
the Google search engine. (Eberhart, 2002) concludes that:
     “The use of RDF as a simple meta data format does not make much
       sense at the moment, since HTML meta-tags can do this job just
       fine. (…) Furthermore, the nature of facts shows that the level of
       interconnection is low. (…) Apart from the Dublin Core and the
       Adobe XMP namespaces, hardly any other non-W3C vocabulary
       is used (…).”

The fact that almost twice as many pages contain invalid RDF than valid
statements also indicates that the syntax problem is a valid one.
However, the RDF is still the emerging standard for resource description
on the web, and is at the heart of the Semantic Web initiative. The really
positive aspects of RDF are the initially simple and well defined
datamodel (subject-predicate-object) and the possibility to gradually
build complex descriptions by superimposition of statements.

2.8. Ontologies

     “To create effective representations, it is an advandage if one knows
       something about the things and processes one is trying to
                                     the Ontologist’s Credo (Smith, 2003)

Ontology as a branch of philosophy is the science of “what is”, that is
the kinds and structures of objects, properties, events, processes and
relations in every area of reality. Philosophical ontology seeks a
classification that is exhaustive in the sense that all types of entities are
included in its classification (Smith, 2003). In information systems, a
more pragmatical view to ontologies is taken, where the ontology is
considered some kind of shared agreement on a domain representation,
     “Ontologies (…) are often able to provide an objective specification
       of domain information by representing a consensual agreement
       on the concepts and relations characterizing the way knowledge
       in this domain is expressed. This specification can be the first
       step in building semantically-aware information systems to
       support diverse enterprise, government and personal activities.”
                                                              (Denny, 2002)

44                                          Theoretical background and State-of-the-art
As such, an engineering viewpoint is often taken in ontologies for
information systems, as reflected in a commonly quoted definition:

       “An ontology is an explicit account or representation of [some part
         of] a conceptualization”
                                                            (Uschold & Gruninger, 1996)
A more recent definition is given by Nicola Guarino, a dominant actor in
ontologies for information systems but still heavily inspired from early
philosophical work:

       [An Ontology is] an engineering artifact, constituted by a specific
         vocabulary to describe a certain reality, plus a set of explicit
         assumptions regarding the intended meaning of the vocabulary
         words (…) In the simplest case, an ontology describes a
         hierarchy of concepts related by subsumption relationships; in
         more sophisticated cases, suitable axioms are added in order to
         express other relationships between concepts and to constrain
         their intended interpretation.

                                                                             (Guarino, 1998)
Thus, for practical purposes in information systems, ontologies are
constructed or designed (Gruber, 1995 ; Uschold & Gruninger, 1996).
Their need varies depending on the participants’ background, their
knowledge of each other as well the kind and nature of the domain to be
described and the intended usage of the ontology. The design of an
ontology experiences the same problems as any design, e.g. multiple
stakeholders, varying viewpoints, differing needs etc. As a result of this,
“ontologies” are found in a variety of forms on the web today, from the
generic informal taxonomies of Yahoo (quoted as ontology in (Noy and
McGuinnes, 2001) to specific and strictly formal ontologies with
reasoning support, like the gene ontology described in (Yeh et. al.,
2003). Figure 2.11 illustrates one kind of distinction.

                                            Figure 2.11

 What is an ontology? (Noy & McGuinnes, 2001) proposes a borderline somewhere between thesauri and
 formalised IS-A hierarchies. In general, several dimensions apply; level of formality, generic-specific,
 content versus application dependent, level of sharing and agreement, etc.

Semantic modelling of documents                                                                        45
While the process of constructing an ontology as an “exercise” or as part
of a knowledge building initiative has a value in its own right, the
purposes for developing an ontology in information systems varies. In
general terms, there are two common causes:

      To establish a shared understanding of a domain and thereby
       facilitate sharing of information therein.

      To enable the construction of intelligent or “semantic”

The Semantic Web needs both. The Semantic Web is not only about
information discovery and sharing, the main focus to provide intelligent
web-based applications and services. Thus, the Semantic Web need a
representation that allows “machines to perform useful reasoning tasks
on [the web’s] documents” (W3C-OWL, 2003), thus Semantic Web needs
ontologies with a significant degree of structure (W3C-OWLREQ, 2003).
The ontology layer of the Semantic Web builds on the data-
representation from the lower layers (cfr. figure 2.6).

A number of ontologies are available for reuse, both on the web (cfr. for
example the ontologies found at the websites of Ontolingua, Protégé,
and DAML), publicly available commercial ontologies (,, A distinction is made between Upper-level
ontologies and domain or application specific ontologies. An upper
ontology is concerned with abstract and philosophical concepts at the
meta-level, concerning for example the very representation of a concept
or class itself. Upper ontologies are therefore general enough to address
(at a high level) a broad range of domain areas (IEEE, Standard Upper
Ontology Working group24). Good examples of domain specific ontologies
are found in the medical domain with large and long-running initiatives
such as GALEN25, UMLS26, SPRITERM27 and SNOMED28.

Our interest in ontologies is focused on ontology languages, the tools
and methodologies used to build ontologies as well as applications of
ontologies in description and retrieval of information.


46                                           Theoretical background and State-of-the-art
2.8.1.       Ontology languages
Several languages for ontology representation have been proposed, and
there is an ongoing debate on the features and properties of such
languages29 (consider for example the Special interest group on
Ontology Language Standards30). Several recent surveys provide an
evaluation and comparison between ontology languages (Bechhofer,
2002 ; Auxilio & Nieto, 2003 ; Su et. al., 2002 ; Noy & Hafner, 1997 ;
Ribiere & Charlton, 2001 ; Corcho & Gomez-Perez, 2000).
In this section, we describe a small selection of Ontology engineering
languages, in order to exemplify the variations of virtues in these
languages as a basis for our own work. Thus, our presentation is not
aimed at a full evaluation of these. We have selected languages from
different “camps” in order to illustrate the variations in the current state
of the art.
According to (Gruber, 1993) knowledge in ontologies can be specified
according to the 5 following facets: Concepts, Relations, Functions,
Axioms and Instances. Set theoretically, functions are a specific kind of
relations, and we will treat these together. Furthermore, according to
our purposes, we are interested in human readable languages, or
languages with a graphical notation. We will present the languages
according to the following aspects, adapted in part from (Corcho and
Gomez-Peres, 2000a):
    Concepts: also denoted class in ontology languages, often
     represent the main constructs in an ontology – as a
     “conceptualization” of a domain. As the examples will show,
     concept is used in a very broad sense in ontological work and we
     will look for further aspects to the definition of concepts:
         -   Definition/Documentation: How is the concept defined?
         -   Attributes/slots: Is it possible to provide attributes for
             the concepts?
         -   Meta-classes: Can classes be instances of other classes?
             In upper-level ontologies and in the process of merging
             ontologies, it is a desired feture to be able to assert
             statements at the meta-level.

   In fact, there seems to be more current activity towards development of languages – and language
 features - for ontology representation, that what seems to be the case in the Conceptual Modelling
 communities. Also, conceptual modelling languages are not to a large degree considered as adequate
 representation languages in the Ontology community, mainly as a result of the desired degree of formality
 and the strong connection to logic that is the basis of much ontology work. Lack of formality is a major
 objection against modelling languages such as UML. The benefit from applying conceptual modelling
 languages would be that, they have a long evolution history, in many cases supports several modelling
 viewpoints, and offer advanced modelling support through tools and methodologies.
  Ontoweb SIG2 on Ontoloy Language Standards, Ian Horrocks and Frank van Harmelen (chairs).

Semantic modelling of documents                                                                         47
    Relations: An ontology is not simply a list of concepts. Many
     ontological languages differ vastly in the way that they handle
     relations. An objection to taxonomies being considered as
     ontologies is the fact that they constitute mere is-a hierarchies
     and do not allow for arbitrary relations between concepts.

      -   Taxonomies: What kind of hierarchies among concepts
          may be represented? Does the language allow for
          disjoint partitions of concepts, exhaustive sub-class
          decompositions or multiple inheritance?

      -   Binary and N-ary relations: how does the language
          support arbitrary n-ary relations? Is there a distinction
          between relations and functions?

    Axioms: What mechanisms are applied in the language to support
     expression of axioms? Axioms are considered important in
     ontologies, in particular in intelligent applications that require
     machine-readable ontologies. Axioms are used to express
     constraints over the structural conceptualisation (given by
     classes, attributes and relations). In formal ontologies, axioms
     may be used to defined concepts intensionally, i.e. by stating the
     properties of instances of this concept. Then, as instances are
     added to the knowledgebase, the concept is defined by the
     instances that conform to the intensional definition. Further,
     axioms are used in order to reason over the instance base of the
     ontology, for example to detect inconsistencies or in attempts to
     merge ontologies.

    Instantiation: refers to explicit representation of elements in the
     domain, and thus a part of the representation of reality that is the
     nature of ontologies. (Corcho & Gomez-Peres, 2000b) separates
     between instances, facts and claims; Instances are the elements
     of a domain that is referred to by a given concept. Facts represent
     a relation between elements. Claims represent assertions, and in
     reasoning, it is important to separate between claims and facts.

    Human readable/graphical notation: in our approach to search
     and retrieval we are looking for a user-friendly domain model.
     Thus, we are interested in representations of an ontology that can
     be presented to users, directly – or by view generation or
     translation – so that it can be put to interactive use. In our
     presentation of ontology languages, we therefore look for
     mechanisms in the languages that support human readable
     presentations or visualizations.

48                                         Theoretical background and State-of-the-art
As mentioned, several recent studies have evaluated ontology languages.
Two things motivate our selection of languages. First, we want to
illustrate the variety of languages that are being applied. Second, we
have selected the languages that we consider most relevant, given the
scenarios we are targeting. Our selection of languages include:

      Ontolingua: collaborative construction of ontologies (Farquhar et.
       al., 1996)

      CyC/CyCL: The encyclopedia of common sense knowledge (Lenat,

      OIL, DAML, DAML+OIL: The proposed reccomendation for
       semantic mark-up on the web (W3C-DAML+OIL, 2001).

      OWL: The Web Ontology language for the semantic web (W3C-
       OWL, 2003)

      TopicMaps: An ISO standard for defining topic-based indexes over
       resources (Pepper & Moore, 2001)

      WordNet: The ontology of English words and word senses (Miller,

      UML Class diagrams: The unified modelling language, perhaps the
       most widely used conceptual modelling language.     Ontolingua
Ontolingua (Farquhar et. al, 1996 is an ontology development
environment with tools for collaborative editing, modification and
browsing of ontologies. Ontolingua is interesting as one of the first large
web-based ontology environments and since it from the very beginning
has had a focus on collaborative ontology development. The term
‘ontolingua’ refers to both the toolset and the language. The Ontolingua
language is based on KIF31 and Frame-Ontology. KIF is a declarative
semantics language originally designed for knowledge exchange between
agents. The Frame-Ontology (FO), which is built on top of KIF, adds
support for frames, which provides the possibility to define ontologies in
object oriented terms (Classes, relations, sub-classes, etc.). Ontolingua
is currently being used within the Rapid Knowledge Formation project32
at Stanford (McGuinnes, 2003).

     KIF WebSite:
     Rapid Knowledge Formation project:

Semantic modelling of documents                                                49
                                        Table 1.2
Concepts          Most statements in Ontolingua are represented through terms. A term is any object that has a
                  defintion and includes class, slot, instance, relation and functions).
                  Ontolingua separates between Class and Frame. A class is considered a “representation for
                  grouping of similar terms”. Classes are defined through the definition of their slots and
                  axioms. Frames are “named data structures used to represent some concept in a domain”.
                  Frames are used to group related statements about the domain concept, and can contain
                  classes, instances, slots, facets, functions, relations and axioms.

Attributes        Attributes are treated through Slots and Values. Both class and instance slots are
                  Facets represent information about slots, such as cardinalities and value types.

Relations         Supports N-ary relations and functions with definition of cardinalities, domain and range of
                  the relation. Functions are distinguished from relations in the same way as in set theory.
                  Taxonomical relations are supported by defining them as relations. Example ontologies
                  shows definition of generalisations specified by Subclass of and Superclass of statements,
                  and Decomposition of and Disjoint decomposition relations. Instance relations are
                  supported through Instance of.

Instances         Ontolingua separates between instances and individuals. Instances are instances of a class,
                  and can themselves be classes (e.g. the class Tourists can be an instance of the class
                  Persons). Individuals are any terms that is not a class (Jim, the tourist would be an individual)

Axioms            Separates between assertions and axioms. Assertions are any statement that is true in the
                  ontology, and includes axioms, values on slots, etc.) Assertions are the statements that are
                  explicitly asserted by the modeller (by pushing and assert button in the editor).
                  Axioms are statements in prefix first order logic, based on the KIF language. Assertions in
                  Ontolingua contain statements of logical implications and equivalence, logical connectors
                  (and, or), negation (not) and universal and existential quantifications.

Human-Readable    Ontolingua represents ontologies in a text only, Lisp-like syntax, which is awkward to read, for
presentation      untrained users. The Ontolingua ontology servers breaks up the statements and present
                  these in a form-based web interface, with navigation facilities through the slots and relations,
                  but users are still exposed to the basic syntax of the ontology. Thus, ontolingua ontologies are
                  in general readable for trained users only.

Tool support      Ontolingua is both the language and the tool. The ontolingua server was one of the first web-
                  based tools for collaborative ontology editing.

                    Description of the Ontolingua ontology language     CyCL
The CyC project is a project that is trying to build a large knowledge
base of “common sense knowledge” (Lenat, 1995). CyC contains over
100.000 atomic terms and more than 1 million handcrafted assertions
over these. As such, CyC constitutes a large general ontology, somewhat
in the line of an encyclopedia. CyC is arbitrarily extensible, and
compartmentalises its knowledgebase into what are denoted micro-
theories, but otherwise there is little systematic organization to the
categorization or granularity of the concepts (Schmidt, 2003). Amends
to this are being made by the inferred UpperCyC ontology. CyCL is the
ontology language within CyC.

50                                                      Theoretical background and State-of-the-art
                                                    Figure 2.12
     Class Document
     Defined in Ontology: Documents
     Source code: documents.lisp
     Disjoint-Decomposition :{
     Book ,Miscellaneous-Publication ,Periodical-Publication ,Proceedings ,Technical-Report ,Thesis }
     Value-Type :Class-Partition

     Documentation :
     A document is something created by author(s) that may be viewed, listened to, etc., by some audience. A
     document persists in material form (e.g., a concert or dramatic performance is not a document).
     Value-Type :String

     Domain-Of :
     Has-Author ,Has-Series-Editor ,Has-Translator ,Number-Of-Pages-Of ,Organization-Of ,Publication-Date-Of
     ,Publisher-Of ,Series-Title-Of ,Title-Of

     Subclass-Of :
            Individual-Thing ,Individual ,Thing
      Superclass-Of :
            Book ,Cartographic-Map ,Computer-Program ,Doctoral-Thesis ,Edited-Book ,Journal ,Miscellaneous-
     Publication ,Multimedia-Document ,Periodical-Publication ,Proceedings ,Technical-Manual ,Technical-Report

     Template Slots:
     Has-Author :
            Value-Type :Author
     Has-Series-Editor :
            Value-Type :Person
     Has-Translator :
            Value-Type :Agent
     Number-Of-Pages-Of :
            Maximum-Cardinality :1
            Value-Type :Natural
     Organization-Of :
            Maximum-Cardinality :1
            Value-Type :Organization
     Publication-Date-Of :
            Maximum-Cardinality :1
            Value-Type :Calendar-Year
     Publisher-Of :
            Maximum-Cardinality :1
            Value-Type :Publisher
     Series-Title-Of :
            Maximum-Cardinality :1
     Value-Type :Title
            Title-Of :
            Maximum-Cardinality :1
            Slot-Cardinality : 1

  The Definition of a “Document” in an example Ontolingua ontology. The human readable definition is
  given under the “documentation”. The example shows disjoint decomposition, subclasses and
  superclasses of this concept, as well as some of the attributes. (Example extracted from the
  documents ontology in the online Ontolingua library:

CyCL is a second order predicate language that allows quantification
over predicates and relations and construction of arbitrary collections.
In addition, CyCL incorporate features to represent natural language,
such as modal and general quantifiers. CyC is interesting as an attempt
to codify common sense within an AI methodology.

Semantic modelling of documents                                                                                  51
                                       Table 2.3
Concepts         All statements in CyCL are specified through first order logic in a Lisp-like (Lisp-based)
                 syntax. CyCL statements are written as formulas.
                 Constants are the basic vocabulary in CyC expressions. Constants can represent individuals,
                 concepts or even collection of concepts. All constants in CyCL have a unique name and
                 follows a naming convention that separates constants from variables.
                 Formulas are asserted into the knowledgebase of the CYC Ontology. Every assertion is
                 contained in a microtheory. A microtheory is essentially a collection of assertions that share a
                 common set of assumptions. The microtheory mechanism allows CYC independently
                 maintain assertions are initially contradictory. The CYC knowledgebase currently contains
                 several hundred microtheories.

Attributes       Attributes must be defined through predicates over the constant one wishes to specify
                 attributes for, such as (#$colorOfObject ?theObject ?theColor)

Relations        Again, all statements in CyCL are first order formulas, thus all relations are defined as
                 formulas with predicates, such as (#$likes #$Fred #$Mary). N-ary predicates are supported,
                 but each predicate must be defined with a fixed number of arguments. The arity of a predicate
                 is either defined implicit by the number of arguments it is given, upon first entry, or explicitly
                 through the #$arity binary predicate. Furthermore, the types of each predicate must be
                 defined through #$isa predicates.
                 Taxonomy building relations are supported through for example the predicates #$isa (element
                 of) and #$genls (subset of).

Instances        Expressed as constants and treated no different from other constants.

Axioms           Again, all statements are first order logic, thus, axioms are native in CyCL. The language
                 supports a range of logical connectives and quantifications.

Human-Readable   CyCL with its strong formal basis, Lisp-like syntax and particular naming convention is not
presentation     human readable, apart from highly trained users. However – as CyC is an encyclopedia of
                 commonsense knowledge – all concepts are given a natural language definition, intended for
                 Local (at one site) organizations composed of physicians, support personnel, and usually also
                 administrators. The main function of the organization is to provide medical care (short or long
                 term) to a number of patients/clients, for a fee if the patient/client is able to pay. A clinic
                 services out-patients, while a hospital has in-patients. A hospital may have a clinic as a sub-
                 organization, though.
                 isa: #$ExistingObjectType
                 genls: #$MedicalCareOrganization #$LocalCustomerContactPoint #$MedicalCareInstitution
                 some subsets: (4 unpublished subsets)

Tool support     Web directory, overview + programming API.

                                 Descritpion of CyC/CyCL

52                                                      Theoretical background and State-of-the-art     DAML+OIL
OIL (Fensel et. al, 2000) is defined as an extension to RDF/S. OIL has
now been paired with DAML33 - an ontology language built on top of
XML, and the pair (DAML+OIL) is now a central ontology language in
Semantic Web efforts, in particular this is the language in use in the
ongoing Ontoweb project34 and is proposed as the W3C recommendation
for semantic mark-up of web pages. DAML+OIL is considered an
ontology extension to RDF/S that introduces more formality and richer
modelling primitives. DAML+OIL is interesting to us as a semantic mark-
up language for web documents. Our presentation is based on the
paired language (DAML+OIL) rather than the individual languages (OIL
and DAML).
                                                Table 2.4
Concepts                                 D+O concepts are defined through classes. D+O separates between Class
                                         elements, Class expressions, PropertyElements and PropertyRestrictions.
                                         Class Elements are the definitions of an object class.
                                         Class Expressions are expression over one class, collections (enumerations)
                                         of classes or property restrictions.
                                         Property Elements are property names and are either properties that relate
                                         classes (objects) to classes or properties to datatype values.
                                         Property restrictions are particular kinds of class expressions that constrain the
                                         definition of a property.

Attributes                               Attributes are defined as properties. Attributes can be property related to a

Relations                                D+O defines a number of relations on class level, such as: subClassOf,
                                         disjointWith, disjointUnionOf, sameClassAs and equivalentTo, that are mainly
                                         used in class-expressions.
                                         Arbitrary relations are defined as properties with a domain and range for the
                                         relation (property), and with property restrictions specifying constraints on the
                                         relation, for example cardinality. Specific property relations are supported,
                                         such as: subPropertyOf, samePropertyAs, equivalentTo and inverseOf.

Instances                                D+O supports instantiation of both classes and properties.

Axioms                                   Axioms are defined on the classes as characteristics over their properties. No
                                         explicit logic axioms are supported.

Human-Readable presentation              D+O ontologies are encoded in XML. Since D+O is built on top of RDF/S, most
                                         expressions follow the syntax of RDF/S statements, but with additions for the
                                         D+O specific features. Also D+O statements can be mixed with arbitrary
                                         RDF/S statemtents. Due to the layering of constructs, the inferred name-
                                         spacing and its complexity, a D+O ontology in its final XML form is difficult to
                                         read for most persons (and even for some XML parsers).

                                         Description of DAML+OIL

     DAML WebSite:
     Ontoweb Project Web:

Semantic modelling of documents                                                                                        53
                                                         Figure 2.13
        <daml:Class rdf:ID="Person">
         <rdfs:subClassOf rdf:resource="#Animal"/>

        <daml:Class rdf:about="#Person">
         <rdfs:comment>every person is a man or a woman</rdfs:comment>
         <daml:disjointUnionOf rdf:parseType="daml:collection">
          <daml:Class rdf:about="#Man"/>
          <daml:Class rdf:about="#Woman"/>

        <daml:Class rdf:ID="Man">
         <rdfs:subClassOf rdf:resource="#Person"/>
         <rdfs:subClassOf rdf:resource="#Male"/>

        <daml:Class rdf:ID="Woman">
         <rdfs:subClassOf rdf:resource="#Person"/>
         <rdfs:subClassOf rdf:resource="#Female"/>

        <daml:ObjectProperty rdf:ID="hasParent">
         <rdfs:domain rdf:resource="#Animal"/>
         <rdfs:range rdf:resource="#Animal"/>

        <daml:DatatypeProperty rdf:ID="shoesize">
          shoesize is a DatatypeProperty whose range is xsd:decimal.
          shoesize is also a UniqueProperty (can only have one shoesize)
         <rdf:type rdf:resource=""/>
         <rdfs:range rdf:resource=""/>

            <daml:onProperty rdf:resource="#shoesize"/>

        <!—Intensional class definition : TallThing is the class of things whose hasHeight is tall -->
        <daml:Class rdf:ID="TallThing">
           <daml:onProperty rdf:resource="#hasHeight"/>
           <daml:hasValue rdf:resource="#tall"/>

        <!—Instance -->
        <Person rdf:ID="Adam">
         <rdfs:comment>Adam is a person.</rdfs:comment>
         <age><xsd:integer rdf:value="13"/></age>
         <shoesize><xsd:decimal rdf:value="9.5"/></shoesize>

     DAML + OIL ontology example; The example shows the definition of the concepts Person, Man and
     Woman. Person is a subclass of Animal and Man and Woman are disjoint subclasses of Person. The
     example further shows the definition of the propertytype Shoesize with a corresponding restriction
     that defines the cardinality of this property to be 1. Further, the example shows the intensional
     definition of a class of tall things.
     Example extracted from (

54                                                                      Theoretical background and State-of-the-art     OWL
Owl is the ongoing specification of the Ontology language for the
Semantic Web. OWL is layered above RDF/S, and considers the latter
the “data base” of the Semantic Web, while OWL is concerned with the
representation of semantics of classes and properties used in web
resources and with the goal of inferring “greater logical capabilities” and
reasoning support for the Semantic Web applications. OWL is divided in
three sub-languages: OWL-Lite, OWL-DL and OWL Full, which offers
varying degrees of expressiveness and reasoning support. OWL is built
on top of the primitives in RDF/S such as classes and properties and
with the inclusion of constraint and equality primitives. OWL is
interesting to us, due to its position as the ontology language of the
semantic web.
Very roughly speaking, there is not a big difference between the
DAML+OIL and the OWL languages. OWL is more “true” to the RDF/S
languages and incorporates recent updates of these. OWL also clarifies
the naming for several of the DAML+OIL features and some features of
DAML+OIL that have proven to be unused in the business cases have
been removed. A converter exists for changing DAML+OIL files into OWL.
Many of the original DAML tools are being migrated to OWL.
                                         Table 2.5
Concepts                          OWL builds on RDF/S and its basic constructs are built from RDF/S constructs.
                                  Thus, the basic concept in an OWL ontology is the rdfs:Class.
                                  Class expressions can contain equality and inequality constructs, such as
                                  equivalenceClass, disjointWith, unionOf, intersectionOf and complementOf.

Attributes                        Based on RDF, attributes are represented as rdf:Properties.

Relations                         Taxonomy kind of relations are defined based on the RDF/S constructs
                                  subClassOf, subPropertyOf,
                                  Arbitrary relations are defined as properties, with rdfs:Domain and Range
                                  constraints, and with cardinatlity constraints. Propertycharacteristics are
                                  inferred, such as inverseOf, Transitive, Symmetric and Functional.

Instances                         Instances are defined through the “individual” construct.

Axioms                            Axioms are used to associate class and property IDs with either partial or
                                  complete specifications of their characteristics, and to give other logical
                                  information about classes and properties.

Human-Readable presentation       OWL ontologies are serialised into XML. Currently no visualisation
                                  implementations are available.

Tool support                      Ongoing.

                                      Description of OWL

Semantic modelling of documents                                                                            55
                                                          Figure 2.14
        <owl:Class rdf:ID="Animal">
        This class of animals is illustrative of a number of ontological idioms.

        <owl:Class rdf:ID="Male">
        <rdfs:subClassOf rdf:resource="#Animal"/>

        <owl:Class rdf:ID="Female">
        <rdfs:subClassOf rdf:resource="#Animal"/>
        <owl:disjointWith rdf:resource="#Male"/>

        <owl:Class rdf:ID="Man">
        <rdfs:subClassOf rdf:resource="#Person"/>
        <rdfs:subClassOf rdf:resource="#Male"/>

        <owl:Class rdf:ID="Woman">
        <rdfs:subClassOf rdf:resource="#Person"/>
        <rdfs:subClassOf rdf:resource="#Female"/>

        <owl:ObjectProperty rdf:ID="hasParent">
        <rdfs:domain rdf:resource="#Animal"/>
        <rdfs:range rdf:resource="#Animal"/>

        <owl:ObjectProperty rdf:ID="hasFather">
        <rdfs:subPropertyOf rdf:resource="#hasParent"/>
        <rdfs:range rdf:resource="#Male"/>

     Example definitions in OWL, covering some of the same classes as the previois DAML+OIL ontology.
     Example adapted from     TopicMaps
Topic Maps (Pepper & Moore 2001) is an ISO-standard for defining
topic-based indexes to resources. Topic maps are based on XML and are
targeted towards web resources. A topic represents a “thing”
whatsoever. Topics can be related, typed and classified. Topic maps are
essentially about indexing, and contain a mechanism for defining
occurrences, i.e. pointers to resources where the topic occurs. The XML
based topic maps have a visualization counterpart that allows for
navigation and direct access to the occurrences of topics. With its lack
of detailed semantic definitions and reasoning support, TopicMaps are
not always considered an ontology language, but it is interesting to us
particularly because of its focus on indexing and information access.

56                                                                          Theoretical background and State-of-the-art
                                         Table 2.6
Concepts                          The main construct in a Topic Map is a Topic. A topic represents any “thing
                                  whatsoever”. In the construction of a TopicMap, topics are defined as Topic

Attributes                        Topic Characteristics are limited to the topic’s names, the associations it
                                  participates in and what its occurrences are. Topics may have several names.
                                  A distinction is made between BaseName (mandatory) and Display Names
                                  and SortNames (optional).

Relations                         Topics may be associated to each other. A generic relation between two topics
                                  is defined in a topic map as an association type (for example written_by),
                                  where association types themselves are defined as topics. Associations are
                                  binary and bidirectional, but directionality is implied through roles. Every topic
                                  participating in an association fills an association role.

Instances                         A Topic Map defines instances of both topic types and association types
                                  through the “instance of” construct. However, since TopicMaps are intended
                                  for indexing another level of instantiation is provided through the concept of
                                  Occurrence. An occurrence is a pointer to a resource where the topic occurs.

Axioms                            No.

Human-Readable presentation       Topic Maps have a simple basic formalism that may be visualised graphically.
                                  Several implementations visualise topic maps in the fashion of a MindMap with
                                  node and link symbols. The TopicMap itself is serialized into XML (XMLl Topic
                                  Maps syntax, XTM).

Tool support                      A wide range of tools available, graphical editors, navigation systems,
                                  Dynamic Web site building and knowledge servers.

                                  Description of TopicMaps     WordNet
WordNet (Miller, 1995) is one of the most developed lexical ontologies
and is a manually constructed lexical reference with the basic distinction
between nouns, verbs, adjectives and adverbs. WordNet is based on
English, but support for several European languages are under
development as a part of the EuroWordNet project. The basic concept of
Wordnet is SynSets, a collection of synonyms. SynSets are primarily
organized into a hierarchy, but also richer lexical relations between
words are provided. Access to WordNet is given by a browser client,
providing dictionary lookup of words, synonyms and relations, but is
also distributed as a Prolog program, allowing users to build
implementations around the lexical base. WordNet is interesting, as it
has been adopted in several attempts to infer lexical or natural language
support for information retrieval.

Semantic modelling of documents                                                                                 57
                                                            Figure 2.15
        <topic id="verdi">
         <instanceOf><topicRef xlink:href="opera-template.xtm#composer"/></instanceOf>
         <!-- born-in: le-roncole (1813 (10 Oct)) -->
         <!-- died-in: milano (1901 (27 Jan)) -->
           !-- composed-by: oberto un-giorno-de-regno nabucco i-lombardi ernani i-due-foscari giovanna-darco alzira attila
        macbeth1 i-masnadieri jerusalem-o il-corsaro la-battaglia-di-legnano luisa-miller stiffelio rigoletto il-trovatore la-traviata i-
        vespri-siciliani simon-boccanegra un-ballo-in-maschera la-forza-del-destino don-carlo aida otello falstaff -->
          <baseNameString>Verdi, Giuseppe</baseNameString>
          <instanceOf><topicRef xlink:href="opera-template.xtm#homepage"/></instanceOf>
          <scope><topicRef xlink:href="opera-template.xtm#italian"/>
          <topicRef xlink:href="opera-template.xtm#land-of-verdi"/>
          <topicRef xlink:href="opera-template.xtm#online"/></scope>
          <resourceRef xlink:href=""/>

        <topic id="rigoletto">
           <instanceOf><topicRef xlink:href="opera-template.xtm#libretto"/>
                <topicRef xlink:href="opera-template.xtm#italian"/>
                <topicRef xlink:href="opera-template.xtm#land-of-verdi"/>
                <topicRef xlink:href="opera-template.xtm#online"/>
         <resourceRef xlink:href=""/>
           <instanceOf><topicRef xlink:href="opera-template.xtm#premiere-date"/></instanceOf>
           <resourceData>1851 (11 Mar)</resourceData>

     Example Italian opera ontology in TopicMaps. The example shows the definition of “Verdi” and his
     opera “Rigoletto” as a topic, along with occurrence pointers to information about these topics.
     Example extracted from the TopicMap starter kit:

58                                                                              Theoretical background and State-of-the-art
                                              Table 2.7
Concepts                               WordNet is an ontology of English words and their relations. The main
                                       concepts in the WN ontology are the main word classes Noun, Verb, Adjective
                                       and Adverb, plus what is denoted function words. Words are organized around
                                       word meanings. A word meaning is defined by the set of all word forms (the
                                       surface form) that can be used to express it. These sets of word forms are
                                       denoted synsets. There is a many to many relation between forms and
                                       meanings; a word form may have many meanings, and naturally a meaning
                                       may have several forms for expressing it. Synsets are composed of words from
                                       the same word class, in order to support substitutability.

Attributes                             Lexical attributes only.

Relations                              WN is organized according to semantic relations among the words. From its
                                       focus on word meanings and synsets, the most notable semantic relation is
                                       Synonymy. In addition the semantic relations Antonomy (“opposite” meanings,
                                       Hyponymy/Hypernymy (generalisations) and Meronymy (part of) are
                                       In addition to semantic relations, morphological relations between word forms
                                       are supported in order to support derivational and inflectional morphology.

Instances                              WordNet is a populated lexicon, hence the instances of the WN ontology are
                                       the words already stored in WN. WN is distributed and accessed with its
                                       wordbase intact.

Axioms                                 No.

Human-Readable presentation            WordNet definitions are human readable, they are after all, just words. To fully
                                       appreciate the organization of WN, users must understand the semantic
                                       relations between word meaning, such as synonymy, antonymy, etc. Most of
                                       these are used in common language, albeit perhaps not as formally defined as
                                       in the WN database

Tool support                           Users access the WN base as any other lexicion, either through a web-based
                                       look-up interface, or by installing a specific client, the WordNet browser. Full,
                                       programmatic access is supported through a prolog interface.

                         Description of WordNet according to “Ontological facets”    UML Class Diagrams
The Unified Modelling language (UML) is maybe the most widely
adopted language for conceptual modeling in software systems
development. Attempts at constructing ontologies in UML have been
made, along with mappings or translations in to RDF/S for applications
with the Semantic Web. UML stems from the OO paradigm of modelling,
and with its basis in modelling classes, attributes, relations, etc. UML

Semantic modelling of documents                                                                                     59
fits in with the class-based ontology languages. The benefit of UML is its
support for different modelling paradigms (activity diagrams, state
charts, class and package diagrams) along with its extensive tool
support. Ontologists however, tend to discard UML as useful for
ontology construction, mainly due to its lack of formality and support for
axioms and reasoning. To us, UML is interesting, since we ourselves will
base our approach on a conceptual modelling language, rather than an
ontology language.
While its lack of formality has prevented UML from being adapted by
ontologists, at least in the AI sense, approaches to using UML as an
ontology have been made.

                                                                Figure 2.16
         The noun "hospital" has 2 senses in WordNet:
         1. hospital , infirmary -- (a health facility where patients receive treatment)
         2. hospital -- (a medical institution where sick or injured people are given medical or surgical care)

         Sense 2
         hospital -- (a medical institution where sick or injured people are given medical or surgical care)
         medical institution -- (an institution created for the practice of medicine)

         Coordinate terms
            -> medical institution -- (an institution created for the practice of medicine)
              => clinic -- (a medical establishment run by a group of medical specialists)
              => extended care facility --
          (a medical institution that provides prolonged care (as in cases of prolonged illness or rehabilitation from acute illness))
              => hospital -- (a medical institution where sick or injured people are given medical or surgical care)

         hospital -- (a medical institution where sick or injured people are given medical or surgical care)
               HAS PART: coronary care unit --
          (a hospital unit specially staffed and equipped to treat patients with serious cardiac problems)
                HAS PART: intensive care unit, ICU -- (a hospital unit staffed and equipped to provide intensive care)

         Sense 2
         hospital -- (a medical institution where sick or injured people are given medical or surgical care)
              => medical institution -- (an institution created for the practice of medicine)
                => institution, establishment -- (an organization founded and united for a specific purpose)
                   => organization, organisation -- (a group of people who work together)
                      => social group -- (people sharing some social relation)
                        => group, grouping -- (any number of entities (members) considered as a unit)

     Wordnet example that shows the definition of hospital. Hospital has two word senses in WordNet. Sense 2 has two synonyms. The
     example further shows the semantic relations Meronyms, Hypernyms as well as the coordinate terms for this concept.
     The example is composed by repeated queries to the online version of wordnet (

60                                                                                      Theoretical background and State-of-the-art
                                                Table 2.8
Concepts                                  Classes and attributes. Meta-classes, stereotypes,

Attributes                                Attributes with datatypes and visibility constraints. By its OO nature, UML
                                          supports inheritance of attributes.

Relations                                 Generalisation and n-ary relations with cardinality and specification of roles.
                                          Relations may be defined as a class.

Instances                                 UML class diagrams naturally supports implementaion of the defined classes,
                                          and thereby instantiation, but not in the ontology sense.

Axioms                                    Not in UML class diagrams directly, but constraints over classes, attributes
                                          and relations may be expressed using the Object Constraint Language (OCL).

Human-Readable presentation               Intended for interactive modelling. The full OO paradigm will require trained

Tool support                              A vast number of tools; commercial, open source and research tools exists.

                              Description of UML according to “ontological facets”

(Cranefield and Purvis, 1999) supported by (Haustein & Plaumann,
2002) presents an evaluation of UML as an ontology language and list
benefits from using UML:

    A large and expanding user community and extensive tool

    Unlike the formal ontologylanguages, UML supports graphical
     modelling, which is a benefit both for construction and use of the
     ontology: “A graphical representation is important to allow users of
     distributed information systems to browse an ontology and discover
     concepts that can appear in their queries”.

    For systems where the reasoning support is restricted, for
     example to answering specific questions, “UML is a strong

Semantic modelling of documents                                                                                        61
2.8.2.         Ontology tools
In general, the actual storage format for each of the ontologies are not
that important. As languages and formats are constantly evolving, and
since most languages are serialized into XML, the current state-of-the
art in ontology tools can import most of the common languages.
A thorough evaluation of ontology tools is given by (Fensel & Gomez-
Peres, 2002). They categorize their selection of tools into:
      Tools for ontology editing and development
      Tools for merging and integration of ontologies
      Tools for evaluation and verification of ontologies
      Ontology storage and query tools
      Ontology-based annotation tools
      Ontology learning tools     Tools for ontology editing and development
The current state-of-the-art in tools for ontology editing is rapidly
evolving. From the kick-off meeting of the OntoWeb project in june 2000,
where only 4 tools were listed as generally acceptable35, the OntoWeb
survey from 2002 reports on 11 tools in this category. Most of these
tools have grown from smaller, single-user editing tools into more
complex tools with support not only for editing, but incorporating
process and methodology support for the construction of ontologies.
Some, yet not all, also incorporate collaborative support. Most tools
provide a form-based interface for definitions of concepts/classes and
their slots. Some tools offer a graph-based visualisation technique for
viewing and browsing the ontology, but no tools that we are aware of,
incorporate direct visual editing.
In general, the collaborative support in ontology editing tools, are often
concerned with the integrity of the ontology, in the sense that they
support consistency checking, access-control and transaction-support
for ontology operations. The Onto-edit system (Sure, 2002)
includes a mindmap tool that supports collaborative editing, in order to
provide an informal starting point for ontology development. The created
mindmap is not a proper ontology itself, but is later translated into the
appropriate ontology language for further development. The system
presented by (Domingue, 1998) included support for synchronous
discussion built on top of an existing ontology editor.

     Most of the participants preferred Emacs, for their own needs, however.

62                                                             Theoretical background and State-of-the-art     Tools for merging and integration of ontologies.
Tools in this category allows for importing several ontologies – from
possibly different formats – into a shared workspace. Tools will then
scan ontologies for merging candidates and leave the final descision to
the ontologist. Among the techniques used for discovering merging
candidates are linguistic analysis of class and slot names (name
resolution) and value inspection for instantiated ontologies.     Tools for evaluation and verification of ontologies.
These are tools that exploit the formality of the ontology languages in
order to check the quality of the developed ontologies. In most cases,
the user can express rules or predicates over the desired checkpoints.
Common tests are syntactic error detection, cycle detection, redundancy
control and pruning of concept hierarchies (based on similarity analysis
or instance analysis).     Ontology storing and querying tools
Instantiated ontologies are knowledge bases. For the Semantic Web,
RDF/S (serialised into XML) emerge as the main datastructure for
ontological knowledge. The storage and query tools for this follows the
principles for RDF query tools, previously described in section 2.7.3.     Ontologybased annotation tools
Semantic annotation of web-documents (e.g. Fensel et. al., 1998) has
the purpose of allowing users to semantically enrich web pages, by
tagging them with concepts from a pre-existing ontology. Initially, this
does not require instantiation of the ontology; the annotated documents
become the important instances. The tools in this category resemble
somewhat the tools we describe in the KM systems section (section 2.9).
Later approaches turn this around somewhat, and use an instantiated
ontology in order to build and manage dynamic web-sites (Jin et. al,
2003 ; Jin et. al, 2001). Semantic mark-up in this setting refers to
anything from extraction of contextual meta data, imposing navigational
structures to the mapping text to ontology concepts. The OntoWebber36
system provides a complete methodology for designing web-sites, using
ontologies for every aspect of the design, including page and site
structures, navigational structures, presentations, user profiling as well
as content.    Ontology learning tools
Ontology construction is tedious work that requires a lot of effort from
domain experts. The workload can be significantly reduced if the domain
experts can be given a starting point for their work. Therefore several

     The Ontowebber System:

Semantic modelling of documents                                                 63
approaches exists for the learning of ontological data from existing
sources. Tools in this category can be classified according to the input
used for learning (Gomez-Perez & Macho, 2003):
    Learning from text
    Learning from dictionaries and thesauri
    Learning from an existing knowledgebase
    Learning from semi-structured sources (e.g. XML schemas)
    Learning from relational databases
Of these, the approaches to learning ontological data from text are the
most relevant to us. These methods differ in the type of analysis that is
conducted, even though most proposed approaches (Gomez-Perez &
Macho, 2003) applies a combination of linguistic and statistical

    Concept learning: Involves linguistic-based extraction of words
     and phrases from text and the organisation of these into

    Conceptual clustering: Statistical clustering methods are used to
     create a hierarchy of concepts.

    Pattern based extraction: Pre-defined patterns are used to define
     “what to learn”, i.e. lexical structures that constitute possible
     occurrences of concepts or relations.

    Association rules: association rules are used to discover non-
     taxonomical relations from texts, given a concept hierarchy as

    Ontology pruning: Some methods do not start from scratch, but
     take as their input existing and/or generic ontologies. Given a
     corpus of domain specific texts, these methods perform text
     analysis in order to remove concepts from the input ontologies
     that are not domain specific.
None of the proposed approaches are fully automated. Most methods
take input from some kind of existing taxonomies, and require user
intervention at some part of the process. Wordnet is the most commonly
used input for NLP based approaches. Figure 2.17 illustrates such a
learning process. The figure shows natural language processing of terms
from a domain corpus, contrasted with other, non-domain, documents
in order to detect a domain specifc terminology. Discovered terms must
in some way be interpreted in order to get from word-level (terms and
phrases) into a semantic level. The OntoLearn method, applies WordNet
for this purpose. The interpretation step produces a set of proposed
concepts, before inference rules are applied in order to detect possible

64                                        Theoretical background and State-of-the-art
                                                 Figure 2.17

     Example ontology learning process from text. (adapted from the Ontolearn method, (Navigli et. al.,
     2003 ; Missikof et. al., 2002)

relations between the concepts. As illustrated, the result of the process
is fragments of candidate concepts and relations for the ontology. The
finalisation of the ontology is then left to the domain experts.
The process presented in figure 2.17 is representative of most of the
approaches presented in (Gomez-Peres & Macho, 2003). Candidate
terms can be extracted in different ways, both linguistic and statistical,
and the actual selection of techniques is guided by the nature and
language of the domain documents. Commonly applied linguistic
techniques include: example, part-of-speech tagging, morphological
analysis as well as rule-based grammatical analysis (cfr. e.g. the
Corporum OntoExtract tool (Engels, 2001).
Relations are harder to detect. Linguistic techniques often yield “lexical”
relations or tend to over-generate as a result of the immense variation
inherent in language. A typical example of “lexical” relations are
determining the “semantic roles” of entities in a sentence with respect
to the main verb (e.g. the agent or object of an action), or discovering
modifiers37 for verbs and nouns. Such relations resemble the semantic
relations of Sowa’s conceptual graphs.38 Discovering relations by way of

     Modifiers, such as adjectives and adverbs, are often indicators of specializations.
     In fact, the approach taken by (Roux,, 2000) is based on conceptual graphs.

Semantic modelling of documents                                                                           65
predefined patterns also tend to be tedious work, with a large pattern- or
rule-base that must be maintained. Statistical methods, for example
clustering methods, tend to be “rougher” in their treatment of language,
but perform quite well in suggesting hierarchical relations. Both
clustering techniques and analysis of semantic distance between words
can be used to infer generic unspecified relations (associations) (Faure
& Poibeau, 2000).
Even though a lot of progress has been made both in the specific field of
ontology learning, and in related fields such as general NLP analysis and
Text Data Mining, extracting conceptual structures from text still
remains a hard task (Faure and Nedellec, 1998 ; Maedche and Staab,
2000). Some general observations on the state-of-the-art ontology
learning tools are given by (Gomez-Perez & Macho, 2003):

    Most tools apply linguistic based techniques to some extent (no
     pure statistical methods are proposed, even if many of the NLP
     techniques can be performed statistically).

    A unified, detailed methodology for ontology learning from text
     does not exist. A significant part of most approaches is to select
     and adapt techniques in order to suit the actual domain corpus
     and the particular goal of the learning process. “In general, we use
     a multi-strategy learning and a result combination approach”
     (Maedche and Staab, 2001).

    The most commonly applied lexical ontology for semantic
     interpretation is WordNet.

    Tools differ in focus, as to what can be learned. Some look for
     lexical evidence of concepts and attributes, while others focus on
     hierarchical and general relations.

    No tools exist for evaluating the learning results or for comparing
     results from different approaches.

2.8.3.    Ontology based document retrieval
As we have presented earlier, there is an increasing trend in IR systems
to lift retrieval to a higher semantic level. As we have seen, similar
techniques as the ones adopted by the ontology community are applied
in IR systems and in KM based retrieval systems, even if not all IR and
KM systems explicit mention the use of ontologies and ontology
While ontology systems have many purposes, and retrieval in ontology-
based systems tend to query the instantiated ontology (the
knowledgebase), rather than the document collection, also some
systems from the ontology community are targeted directly towards
document retrieval.

66                                         Theoretical background and State-of-the-art
SHOE was initially an annotation system (Helfin & Hendler, 2000), that
enables users to mark-up web-documents by way of an ontology. Rather
than collecting marked up pages and building a knowledgebase,
however, SHOE adopts a methodology in parallel with traditional IR
systems. A crawler is used to collect and index the annotated web-
pages. The index is stored in a knowledge base. SHOE then provides a
user interface for query formulation which enables users to select an
ontology as a “context” for posing the query. Based on the selected
context ontology, SHOE will then ask users to fill in information
according to the classes and properties in the ontology. If users have
trouble filling in values, the system can retrieve values from the indexed
annotations. The interface supports iterative refinement of the query. If
no relevant information is found, the composed query can be translated
into a regular web-query and posted to a regular web-search engine.
On2Broker (Fensel et. al., 1998 ; Fensel et. al., 1999a ; Fensel et. al.,
1999b) follow an approach similar to that of SHOE, but all queries are
posted within a “closed world assumption”, i.e. within the
(Guarino et. al., 1999) argues that for general purpose document
retrieval, “there doesn’t yet seem to be a much better option than some
sort of lazy full text analysis” and instead reduces the target domain to
“a relevant field of information repositories”. Their system, OntoSeek,
applies a combination of ontologies and lexicons to query online product
catalogs and “yellow pages”.
The Ontoquery project focuses on developing methods for ontology
based linguistic analysis of source text and queries, and further
develops this into a system for ontology-based query processing. So far,
their primary concern is the identification and analysis of noun-phrases
(comprising morphological, syntactic and semantic analysis).

2.8.4.    Summing up
Ontologies in the area of information systems have left the philosophical
view of an ontology as “universal knowledge”. Ontologies in information
systems are explicit, shared and to some extent a g r e e d
conceptualizations over a particular domain. As such ontologies are put
to use in particular application, for example to support information
exchange and retrieval.
Current ontology languages offer mechanisms for formal definitions of
vocabularies. In the context of the Semantic Web, this is intended to
define the semantics of the lower layers, i.e. the meta data statements
expressed in RDF/S. For the Semantic Web and other “intelligent
applications”, particular requirements on the ontologies are put forth;
namely that they are f o r m a l , support r e a s o n i n g , and are in
machinereadable format. As a result of these requirements, some of the
languages we have presented are not recognised as proper ontology

Semantic modelling of documents                                         67
languages, even if they are used as ontologies in particular domains
(e.g. WordNet, TopicMaps and UML).
From a Semantic Web perspective, ontologies over RDF offer formal
semantics, reasoning support and mechanisms to detect and handle
A problem with the formal machinereadable ontologies is the fact that
shared, explicit semantics is a matter between humans, not machines.
The definitions of concepts have to be agreed, negotiated and to some
level understood by the human users of the ontology, if the notion of
shared, agreed semantics are to be fulfilled, to the help of users.
Formal, machineradable languages do not prohibit human
understandable ontologies, but the current proposals are difficult to
read for most users, which can become an obstacle. The ontology
languages proposed for the Semantic Web that are layered over RDF/S,
suffer in particular from an awkward and complex syntax due to the RDF
serialisation. Some recent proposals have argued for a pruning of RDF
into a pure meta data storage language, and to resolve features that are
overlapping between RDFS and ontology languages, in order to get a
cleaner syntax (Horrocks et al, 2002).
The Semantic Web have received critisism for not really being semantic
(cfr. Butler, 2003) as “semantics are beyond the scope of the
computer”. Following this line of criticism, the ontology layer does not
offer more than a formal schema for meta data statements.

2.9. Knowledge Management Systems
We have defined the scenarios we are targeting as “Collaborative
document management”. Collaborative aspects of information sharing
are more in focus in approaches from CSCW and Knowledge
Management than in the IR and Semantic Web approaches we have
presented earlier. Knowledge Management systems, in particular,
present a connection between advanced retrieval techniques and
collaborative sharing of information.
Knowledge Management (KM) is defined as

     “… Knowledge management focuses on knowledge as a crucial
       production factor and consists of activities that aim at optimal
       use of knowledge, now and in the future. KM determines which
       knowledge, where, in which form and at which point in time,
       should be available within an organization, company or network
       of institutions. It employs a broad spectrum of techniques and
       instruments to improve the performance of knowledge
       operations and the learning capabilities of a system …”
                                            (Spek and Spijkervet, 1997)

68                                        Theoretical background and State-of-the-art
The term Organisational Memory (OM) has come to be a close partner of
KM, denoting the actual content managed by a KM system and may be
defined as:
      “Organisational memory is an evocative metaphor, suggesting the
        promise of infinitely retrievable knowledge and experience. (…)
        Answer garden [a specific OM system] supports OM in two ways:
        by making recorded knowledge retrievable and by making
        individuals with knowledge available”
                                                      (Ackerman, 1994)
Other definitions, for example (van Heijst et al, 1998), (Conklin, 1996)
or (Schwartz, 1998) puts more emphasis on the technological aspects of
an OM, i.e. on the knowledge storage and retrieval system, particularly
(Schwartz, 1998) emphasise the use of meta data necessary to enable
precise retrieval of relevant knowledge.
A pure technological focus of KM is clearly not appropriate and would
leave out a number of important facets of knowledge, in particular the
human and tacit aspects, as pointed out by (Von Krogh et. al., 2000):
      “The real managerial challenge is enabling knowledge creation;
        capturing its by-product, information, is the easy part.”
Even if inadequate to get the whole picture of KM, our interest in this
thesis lies within information management systems, and the
presentation here will focus on technologies and approaches related to
this part of KM.
Because of this ambiguous nature of knowledge, KM tools come with a
variety of forms and focuses. A classification is provided by (Borghoff
and Pareschi, 1998) by defining the technological parts of an
architecture specifying what they denote as “corporate memory” (figure

                                         Figure 2.18

       A corporate memory architechture for Knowledge Management (Borghoff & Pareschi, 1998).

Semantic modelling of documents                                                                 69
The parts of the corporate memory architecture are defined as:
       1. Knowledge repositories and libraries: Tools for handling
          repositories of knowledge in the form of libraries (heterogeneous
          document repositories, search, access, integration and
          management, Directory and links, etc.)
       2. Communities of knowledge workers: Tools to support
          communication of practise in work (awareness services, context
          capture, shared workspaces, knowledge work processes support,
       3. Knowledge cartography: Tools for mapping and categorizing
          knowledge (domain specific concept maps, maps of people’s
          competencies and interests, etc.)
       4. Flow of knowledge: The glue that binds the parts together by
          “using knowledge, competencies and interest maps to distribute
          documents to people”.
The focus in this thesis lies within the boxes 1 (repositories and
libraries) and 3 (cartography) above.
For the presentation in the sequel, we divide the techniques into IR
driven approaches, contextual meta data driven approaches and
collaborative approaches.      IR Driven KM systems
Again, the basis of KM retrieval systems is traditional IR techniques and
machinery. However, most vendors of KM systems (such as CognIt39,
Verity40, Intel, Inxight41, Convera42) emphasise a focus on concept based
and “more semantic” retrieval. More specifically, they all apply a
selection of the linguistic techniques for IR that we have presented
earlier (section 2.6) in order to lift their retrieval towards a higher
semantic level. A good example is Inxight, whose selection of “value
adding” retrieval techniques include text categorization, text
summarization, document clustering, thesauri based text
transformation, language detection, stemming and compound word
analysis with noun phrase detection (Inxight 2002). Retrievalware from
Convera uses a predefined semantic network of 500.000 words and 1.6
million relations to pre-process and expand queries into “concepts”.
Some KM systems also reports on the use of ontologies and agents. In
IR based systems, ontologies are used as a form of advanced thesaurus
in order to conceptualize retrieval. Other efforts, for example the

     Cognit web site:
     Verity web site:
     Inxight web site:
     Convera web site:

70                                               Theoretical background and State-of-the-art
COMMA43 and DÉCOR (Abecker et. al, 2000) projects report on the use
of ontologies as the basis for knowledge exchange in a multiagent
architecture. For tasks similar to ours, COMMA utilizes two agents: a
document agent that tags new documents based on meta data
extraction and matching against the ontology, and a user-agent that
holds a model of the users interest and filters and profiles information
updates and presentations to the user.
Some systems, such as the Intelligent Miner for text from IBM and the
Knowledge Server from Autonomy, take text analysis techniques one
step further and report on the used of Text Data Mining or text analysis
in connection with data mining systems in order to “discover new
knowledge from text” (Hearst, 1999). A focus for these systems is to
extract knowledge from unstructured sources and to feed this into their
“regular” data mining, business intelligence or decision support
The navigation features in retrieval based KM systems are again similar
to traditional IR systems. However, there is somewhat more focus on
improved presentation of search results. Several systems report on the
use of text categorization and text summarization to provide users with
an overview of the retrieved documents. (Ando et. al, 2000) presents a
scheme for multidocument summarisation, that “blends navigation and
summarisation”, by visualising the topic and sub-topic structure
discovered during summarisation and allows users to navigate between
topics, summaries and documents.
Convera reports on alternative ranking schemes that actively uses the
semantic map in order to rank retrieved documents for “contextual
evidence” and “semantic distance”. Most systems that present their
search results in traditional list manner, uses more contextual meta
data in the presentation. Few systems take on retrieval based KM from a
pure visualization perspective, however ThinkMap and TouchGraph are
examples of graph based document navigation systems. In ThinkMap
the interactive graph can be reorganized by the user, and is used to
accommodate users with “a goal somewhere between search and
browsing”. SHriMP (simple hierarchical multi perspective) (Storey et. al,
2002) is a generic visualisation technique for “exploring complex
information spaces”, that uses a nested graph view of hierarchically
structured information and allows users to navigate through multiple
views at different abstraction levels. SHriMP is integrated with the
Protégé 2000 ontology environment.44
IR based KM systems are technologically advanced in the sense that
they all apply “semantic driven” retrieval techniques and are able to lift

  The COMMA project: Corporate Memory Management through Agents:
     Protégé web site:

Semantic modelling of documents                                         71
retrieval to a higher semantic level than what is common in todays Web
retrieval machinery. However, from the point of collaborative document
management, they do not accommodate the users to actively engange in
the classification and sharing of their documents:

    Concepts are defined mostly by automated extraction from a
     document collection. Users do not participate in concept
     definition and specification of concept semantics.

    In many cases, systems are not using domain specific language
     analysis, but rather general lexical resources, such as the
     semantic map used in Overta. However the analysis behind the
     concept based retrieval tools are similar to the techniques used
     for learning ontologies, and some tools, like the Corporum tools
     from CognIt and Inxight Smart Discovery, support domain specific
     thesauri generation.

    Most systems allow user to define the high-level text categories
     applied in categorization.

    The thesauri and semantic nets are not primarily focused towards
     user viewing, but are rather part of the internal retrieval
     enhancing machinery. However, some systems actively use the
     concept network as a navigation feature. The Inxight Star Tree
     viewer is a powerful example of such.

    Users can in most cases affect the way information is presented
     to them. Most systems allow for repeated searching, query
     expansion and related searches. Such techniques are available
     also in web-retrieval systems, but are seldom used but for a small
     part of advanced users. Within enterprise search systems, these
     techniques may be more welcomed by users. Systems using
     agents allow the specification of information (request) profiles and
     user defined filtering and presentation of discovered “knowledge”.
In sum, the IR driven KM systems can be viewed more as advanced or
semantic based retrieval, than enabling for the cooperative sharing of
information, that is emphasized in the CSCW definition of common
information spaces. Even if some of the necessary technology is in
place, there is no explicit measure taken in order to establish the
“shared agreement of meaning” over the document base, as
emphasized by (Schmidt and Bannon, 1992 ; Bannon and Bødker,
1999).    Contextual meta data driven systems
Contextual meta data is not the focus of this thesis, but the meta data
approaches in KM systems are more in the direction of shared
information spaces than the IR driven ones. Also, these two lines of

72                                         Theoretical background and State-of-the-art
approaches are complementary. In fact, NLP techniques are increasingly
used in order to discover and extract meta data from documents.
Coarsely, we may divide the meta data schemes according to how they
attempt to organize underlying documents:
   1. Documents in process: Knowledge are interesting for workers
      only because they assist them in their current tasks. Thus many
      meta data approaches tries to connect documents to process,
      tasks and to specific (and reccuring) problems. Examples are
      Alpha Project and Process guides (Dingsøyr, 2002), CHAR (Angele
      et. al, 2000), DÉCOR (Abecker et. al, 2000) (Workflow oriented),
      Alpha Well-Of-Experience (Dingsøyr, 2002), Answer Garden
      (Ackerman & McDonald 1996) (common problems, diagnostic
   2. Documents and people: Knowledge is embedded in humans.
      Documents contain information, at best explicit knowledge. Some
      meta data schemes therefore strive to enable users to find the
      appropriate people with knowledge on a given topic. Examples are
      Alpha Skills manager and Expert Seeker (NASA) (Dingsøyr, 2002).
      The connection from documents to people is through the fact that
      people who (currently) work, write, publish and use documents on
      a given topics are the present experts on this topic.
   3. Documents in interaction: Knowledge is exchanged in interaction
      between people. The conversation for action perspective (Winograd
      & Flores, 1986) does not only portray the interaction itself, but
      enlights and explains rationale behind decisions and actions, see
      e.g. PromisE2 (Hoffman & Herrmann, 2002)
   5. Documents in Use: Documents are primarily interesting for their
      use in support of work. Active documents, that is documents
      under production or in active use in ongoing projects and
      activities are interesting since they are of current relevance.
      Systems that rank and present documents according to use make
      up an interesting comparison with the celebrated google page-
      rank algorithm. Often used (or cited) documents must have more
      value, or be more interesting than other documents in that topic.
      Some KM systems achieve this by using collaborative reviewing of
      documents or simply by ranking schemes based on usage logs.
      An example is the per topic “Top 10 intellectual capital” list of
      frequently used documents in IBM’s K Portal.
As pointed out by early papers on meta data (Uschold, 1996 ; Weibel et.
al, 1995) meta data schemes need to be properly constructed and
tailored in order to be useful in a given setting, and the examples given
above merely illustrates this point by showing the variety of meta data in
use in different situations.

Semantic modelling of documents                                         73     Cooperative driven systems
Some of the presented KM systems – in particular the IR driven systems
– offer support for sharing of information, but do not explicitly support
cooperation and collaboration. CSCW and groupware systems are by
nature more directly focused on supporting cooperation. Sharing of
information is considered a prerequisite for cooperation (Grudin, 1994),
and supported by most groupware system.
While several metaphors for cooperation support exists, our interest
here is closest to the approaches that strive to enable a persistent
shared (electronic) space for collaboration. People working at the same
location often enable or rely on physical spaces, to support cooperative
work, such as meeting rooms, project rooms and even shared libraries.
The electronic or computerized versions of these often try to resemble
the virtues of a physical space, building on metaphors such as “rooms”,
“places” and “workspace”:
         “Places organise work; for example a group can (…) use this place
            to meet, work on tasks both collectively and individually, to store
            project artefacts, and to leave project information for others”
                                                         (Roseman and Greenberg, 1997)
Information sharing approaches within the electronic places also
resemble those of their physical counterparts, building on folders,
binders, books, or variations of the MacOs/Windows desktop metaphor.
Systems like BSCW45, ICE (Farshchian, 1998) or FirstClass46 use a
“desktop-like” metaphor with folders as the main way of organizing
documents. These systems ofen adopt a “free-hand” approach to meta
data, offering a small set of contextual metadata attributes as basis for
document storage, but with support for adding own attributes and
relying on freely selected key-words, free-text descriptions, or
annotations and user comments for the “semantic” classification or
content description. TeamWave Workplace47 (Roseman & Greenberg,
1997) used a "concept map"-tool, where users could collaboratively
define and outline concepts and ideas as a way of structuring the
discussion. There was however no explicit way of utilizing this concept
graph in the classification of information. An approach to this is found in
the ConceptIndex system (Voss et. al, 1997), where concepts are
defined by attaching them to phrases – or text fragments – selected
from their occurrences in the text. This way the concept-definitions also
serve as an index of the text fragments. The concepts are only defined
by their various appearances and the actual “concept model or conocept

     BSCW Web site:
     First class collaborative groupware communication platform;
   TeamVawe workplace was an early research product building on the place metaphor that offered a
 flexible and extensible toolset, among others the “concept map” tool, a “stuff in common” tool etc. The
 system grew into a commercial product and is now a part of the Sonexis netCollaborator and Conference
 Manager applications (

74                                                           Theoretical background and State-of-the-art
map” is not visible in the approach. The Navigational Brain, allow users
to manually drag and drop their desktop files onto a free-hand drawn
mind map. The Topic Map ISO standard (Pepper and Moore, 2001)
offers a way of linking various kinds of “instances” (files, pictures, text-
fragments, etc.) to a topic, and then navigation in this material by
following associations between the topics. Topic maps may be stored
using SGML and XML.
To faclititate cooperation in the electronic workspaces, these systems
further offer a range of tools, such as shared calendars, meeting support
(chat, conferencing, on-line voting and discussions), application sharing,
collaborative writing and editing, shared whiteboards etc. In particular,
many of these systems offer awareness mechanisms that enable users
to become aware of their collaborators activities and keep track of
updates and changes to information and any new information being
added to the workspace (Gutwin et. al, 1996).
“Stuffincommon” is central to the shared workspace. The term is
actually the name of a tool within the TeamWave portfolio of
applications, but we use it here to refer to approaches to support
collaborative collections of documents and links to information within
these applications. As mentioned, the basic support provided for this is
to provide a desktop like folder hierarchy. More advanced solutions is
found in systems like DynaSites48 who offer a set of tools for managing a
shared information space. Tools in DynaSites include DynaGloss (a
collaborative, editable glossary (i.e. a list of terms), where users may
provide definitions of terms, annotate them, edit definitions and include
pointers to further reading) and DynaVirtualLibrary (a shared bookmark
system where pointers to web information is collected and annotateable
for the user community). WebGuide (Stahl, 2000), referred to by its
author as a “knowledge building environment”, is a system where users
collaboratively discuss topics and organise them into perspectives. Links
to documents and user provided notes are organised within the
discussions. The Groove Shared Workspace System49 contains a shared
active bookmarks tool with awareness support for keeping its users alert
to changes to information within their workspaces. The Annotea project50
is a framework for “comments, notes, explanations, or other types of
external remarks that can be attached to any Web document or a selected
part of the document”. An early implementation of Annotea is found in the
Amaya51 collaborative web-site editor, which among other things also
includes a shared annotateable bookmarks service. The BSCW52 system,

    “DynaSites Year I report”, NSF IRI-9711951(L3D OMOL). Project              web   site   at:
     Groove web site:
50 (Accessed June 2003)
51 (Accessed June 2003)

Semantic modelling of documents                                                              75
from the beginning supporting awareness information, has later been
augmented with AwarenessMaps (Gross et. al, 2003) in order to provide
an easier way of keeping track of all changes. AwarenessMaps sports a
DocumentMap tool – a visual schematic overview of all shared
documents, with color codes indicating the documents status.
Shared document or bookmark collections with annotations somewhat
resemble the collaborative ranking approaches that we presented briefly
in the meta data driven approaches. Distinctions are made between
Collaborative Filtering and Ranking, Collaborative Indexing and Social Data

    Collaborative Filtering and Ranking is sometimes denoted
     reccomender systems, since the purpose is to collaboratively rank
     and then recommend information or documents, with the implicit
     suppression (filtering) of other. Such approaches are visible today
     in commercial web sites, like the Amazone bookstore or the
     Internet Movie database, where the purpose is to recommend
     items based on the users taste as discovered from previous
     actions and based on selections from other users with similar
     action profiles. Bringing such techniques over to working
     communities within an organisational setting, systems like
     PolyLens (O’Connor et al, 2001) and Knowledge Pump (Glance et.
     al., 1998), provide a domain structured or topic specific
     reccomendation service based on matching of user profiles of the
     colleagues. In Knowledge Pump’s shared bookmark service a user
     can explicitly devise “advisor lists”, i.e. colleagues they
     particularly look to for advice on a given domain or topic.
     Collaborative filtering and ranking systems rely on automated
     statistical classification algorithms, such as Bayesian decision
     networks (Breese et al, 1998), Latent Semantic Indexing
     (Goldberg et al, 2000 ; Sarwar et. al 2000) and clustering
     techniques (Ungar and Foster, 1998) thus bridging classification
     techcniques found in IR systems together with collaborative
     information space systems. Some of the commercial web search
     engines such as AllTheWeb and Google also record user actions
     by examining the clickthroughlog that explains whitch documents
     users actually select from the ranked list of search results and are
     working to build such user selections into the ranking algorithms.

    Collaborative indexing is a variant of recommender systems where
     users take a more active part in the recommendations. Where
     collaborative filtering may be performed automatically, simply by
     recording user actions and matching user profiles, collaborative
     indexing is an approach to collect index data from a community of
     users and then share the result, and where users are encouraged
     to annotate and describe documents (Park and Chon, 1995 ;
     Goldberg, 2000).

76                                         Theoretical background and State-of-the-art
   Social Data Mining (Amento et al, 2003 ; Terveen and Hill, 2001)
    is – unlike what the name indicates – an automated approach to
    harvest and aggregate information from shared active documents
    within a community, but with a focus on the commuity actions,
    not the individual user. Even if automated, systems such as the
    TopocShop system provide a user interface for browsing, where
    users are allowed to categorise and to create “personally
    meaningful organisations” of information, in order to be able to
    record users actions when browsing and organising documents.
Our coarse classification of KM systems may have indicated that
groupware systems are smaller, more lightweight systems than, for
example, the previously presented IR driven approaches. That is not
necessarily the case. A well known example of a large integrated
groupware solution is the Lotus Notes Domino server with its extensions
that offer a wholistic support for cooperation through a wide set of
possibilities; e-mail, shared calendars, templates for constructing
shared documentbases and for building intranets as well as
companywide document management and content and knowledge
management application. Large companies such as Norsk Hydro, Statoil
and consultancy companies Accenture and Cap Gemini apply Lotus
Notes as their main vehicle for sharing of information and collaborative
working on projects. Statoil uses Lotus Notes both as a part of their
company wide document archive (“Elark”) as well as to support
document management in active projects. In their implementation, all
documents for an active project are connected to an ongoing activity in
the project.

2.10. Summing up
We have presented a set of related work, according to the various
ongoing research aspects in document retrieval. As we have seen, the
vast amounts of documents people and organisations are exposed to
these days, makes this an area of much focus.
In IR systems, a lot of work is done in improving the indexing and result
presentation parts of retrieval, and in particular, to apply linguistic
techniques in order to lift automated indexing from a purely syntactic
tecnique towards concept level. In large-scale IR systems, such as
general web-search systems, however, the focus is put on automated
and computationally feasible solutions.
The Semantic Web (SW) and its related projects provide work on a host
of relevant issues regarding ontology definition and instantiation, as well
as corresponding tools for their applications. SW proponents tend to
disregard conceptual modelling languages, primarily due to their lack of
formality, machine interpretability and reasoning support. These are
issues needed in SW’s quest for enabling intelligent applications and
information services.

Semantic modelling of documents                                         77
A user-centered and cooperative focus to document descriptions is
found in the CSCW-approaches to knowledge management and
organisational memories. In general, these approaches tend to put more
focus on contextual meta data and also more informal document
descriptions such as free-text annotations or user-ranking.
With such an amount of related work going on, is there still room for yet
another line of approach? Or is the huge effort put into this area simply
evidence that this is still an area where solutions are not evident and
several open issues exist?
We have not found ample approaches that provide users with tools for
domain modeling and thereafter offer immediate applicability of the
domain model in document retrieval. Our proposed approach is
motivated by our belief in conceptual modelling languages as a vehicle
for defining domains at an intermediary level between human and
computer understandability. We find it interesting to experiment with an
approach that exploits the virtues of conceptual modelling in
cooperative document management settings. Such an approach has a
close resemblance with the use of retrieval thesauri in digital library
approaches, and our approach will draw several principles from such
work. Compared with thesauri, conceptual modelling languages in
general offers a higher level of formality as well as a graphical notation
that could make them more directly applicable when devising a user-
oriented tool for document description and retrieval.

78                                       Theoretical background and State-of-the-art
       3. Semantic modelling of documents – the approach

We propose a document retrieval system that brings together semantic
retrieval and controlled vocabularies53. Particular to this system is the
way conceptual modelling and linguistic techniques are combined to
create conceptual models that serve as controlled vocabularies and
further in the tasks of describing and retrieving documents.

3.1. The Cooperative Semantic Document Retrieval Model
An information retrieval model comprises of three main parts (Baeza-
Yates & Ribeiro-Neto, 1999):
      1. A set of users, from whom we get a set of queries.
      2. A set of documents, from which we construct a set of document
      3. A retrieval system that is responsible for storing the constructed
         document descriptions and retrieving the set of relevant
         documents for a given query.
We characterize our approach as cooperative semantic document retrieval.
The motivation for using this term, also used in (Voss et. al., 1999) is

                                            Figure 3.1

At an abstract level, an IR model has 3 parts: A set of documents from which we construct a set of
document descriptions, a set of users from which we get a set of queries and an IR system responsible for
storing document descriptions and retrieving relevant documents for a query.

  The overall approach and its implementation have previously been published in (Brasethvik, 1998;
 Brasethvik & Gulla, 1999 and Brasethvik & Gulla, 2002)

Semantic modelling of documents                                                                       79
twofold. First, the sharing of documents for an extended audience is
viewed as a cooperative effort in itself. Second, cooperative document
management denotes a situation where the users themselves have to
perform the document management tasks they need in order to support
their own activities. This is in contrast with ordinary retrieval systems,
where users normally only take on the role as searchers. The cooperative
effort puts an extra demand on the users. In our approach, they both
have to participate in the definition of the domain vocabulary as well as
in the classification of documents. A fundamental assumption behind
our approach is that this extra effort can be demanded of users, within a
limited controllable domain. The same added effort is demanded in the
Semantic Web initiative, where users are asked to participate in the
definition of the ontology of their domain, as well as in the semantic
mark-up of their Web resources. This is considered the only feasible,
scalable approach to inferring semantics on the vast amount of
information available on the Web.
With semantic retrieval, the emphasis is put on domain-specific
semantic document descriptions and query expressions. We have
developed a retrieval model that applies a domain model as the
common basis for both document descriptions and queries. Semantic
descriptions are information intended to capture a documents subject.
Subject is defined as:
     (1): something concerning which something is said or done <the
     subject of the essay> (2): something represented or indicated in a
     work of art (3): the term of a logical proposition that denotes the
     entity of which something is affirmed or denied; also :the entity
     denoted (4): a word or word group denoting that of which
     something is predicated (5): the principal melodic phrase on which
     a musical composition or movement is based.
                                                 (Merriam-webster, 2003)
Most traditional ad hoc retrieval systems use a syntactic approach to
retrieval. Queries are posted as terms or near natural language phrases
and matched either against the full text of the stored documents, or
against an index of terms and phrases. While modern information
retrieval systems adopt advanced linguistic and statistical techniques to
lift their retrieval to a higher semantic level (i.e. a concept level – cfr.
section 2.5), in most of these systems, this conceptualisation takes
place behind the scenes. There are no explicit mechanisms that allow
users to share their interpretation of the concepts.
Operating within a limited subject domain, a fundamental assumption
behind our retrieval model is that the domain can be characterised by a
locally constructed and well-defined vocabulary of concepts – the domain
model. In our approach, this domain model represents the controlled
vocabulary to be used for classification and retrieval. When using
controlled vocabularies, most of the work is put into the definition of the
concepts in the vocabulary. A well-defined vocabulary becomes a

80                                    Semantic modelling of documents – the approach
tangible representation of the subject domain of the documents, and
can be applied directly in semantic description and retrieval of
documents. Vocabularies are often created and expressed in their own
“subject languages” and are useful in their own right:
      Using these [subject languages] to retrieve information provides a
        value added quality, can transform information into knowledge.
        When this happens, the subject language becomes an analogy
        for knowledge itself.
                                                            (Svenonius 2000, p 127)
With controlled vocabularies, the solutions to the language problem in
information retrieval, lies in the definition of the vocabulary, the
representation of the vocabulary and the way this vocabulary is put to
use within the retrieval system (figure 3.2):
   The definition: Analogue to the goal of ontologies, the controlled
    vocabulary should be capable of defining the explicit semantics of
    the domain, shared among its users. The semantics of a concept
    for a user is fundamentally based on his interpretation of the
    concept. The only way a community of users, even within a limited
    domain, can be assumed to arrive at some level of shared
    interpretation of the vocabulary would be if they were allowed to
    participate themselves in its definition.
   The representation: The controlled vocabulary must constitute an
    explicit representation of semantics, accessible for its users. If the
    vocabulary is represented in a form or manner not understandable
    for its users, it cannot convey any semantics.
   Document description: The controlled vocabulary must be
    directly useful for expressing the subject of a document and its
    representation within the IR system.
   Query formulation: Again, the vocabulary must be directly useful
    for expressing or queries to the IR system.

The last two points above are by nature implicit when using
controlled vocabularies. However, it’s worth pointing out here, since
within our cooperative retrieval model, the idea is that the users

                                       Figure 3.2

                              Aspects of a controlled vocabulary.

Semantic modelling of documents                                                       81
themselves should must be able to describe documents by way of
this vocabulary, while in library systems, this is performed by skilled
and trained librarians.

3.2. A conceptual modelling based approach
Our model-based IR model is presented in figure 3.3. The two main
principles behind this approach are:

    The domain can be described by its users in a conceptual domain
     model. The domain model will contain the definition of concepts,
     their properties as well as hierarchical and named relations.

    Document descriptions and queries can be formulated by the
     users as possibly instantiated fragments of this domain model.
     Each document is represented as a set of references to the
     concepts and relations in the model. In order to increase the
     precision of the description for the particular document, the
     model fragment can be instantiated. Furthermore, also queries
     must be represented as domain model fragments. Users may
     start out with a natural language query, which is then transformed
     into a model fragment. Query refinement is done graphically by
     selecting and deselecting elements of this extended domain

                                            Figure 3.3

The model based IR model: The domain is represented via a domain model. The domain model constitutes
the controlled vocabulary. Document descriptions and queries are represented as domain model fragments.
An IR system is used to store document descriptions and retrieve and rank relevant documents for a query.
In order to use standard IR machinery, we must define an interface between domain model fragments and
the IR system.

82                                                 Semantic modelling of documents – the approach
3.2.1.    The use of conceptual modeling
Models are needed whenever knowledge is to be communicated to
others and examined independently of the knowledge originator (Bunge,
1998). Modelling languages provide the necessary mechanisms for
converting personal knowledge into publicly available representations.
Given appropriate tool support, modelling languages are in their very
nature intended to support the shared construction of domain
semantics. The process of modelling is quite similar to the construction
of domain vocabularies for document management. Conceptual
modelling languages allow us to define concepts, their properties and
interrelations, and to communicate this information to the stakeholders.
We emphasise the following aspects of a modelling language to be used
for document retrieval:

   Concept based modelling. We will apply a conceptual modelling
    language from the Structural modelling perspective (cfr. e.g.
    Krogstie, 1995).

   Clean, graphical presentation. In order to increase readability, the
    ability to navigate in and the ability to get an overview of the
    domain model, the selected modelling language should have a
    clean graphical presentation. It should be possible to choose
    between different presentations of the model, depending on the
    users level of interest and familiarity with modelling. We will
    specify view-mechanisms (e.g. Seltveit, 1994) to be able to
    provide variations in presentation of the domain model.

   Mechanisms to handle heterogeneity: In a complex domain, with
    many stakeholders, it will not always be feasible to reach a
    complete agreement on semantics for all concepts. The selected
    modelling language will need to be able to incorporate means for
    handling and representing this heterogeneity.

   IR model interpretation: In our approach, model fragments will be
    the “index-units” for documents. Hence, we must be able to
    device a document-oriented interpretation and representation for
    domain model fragments that corresponds to the underlying IR
    model and system.

3.2.2.    The role of automation
Clearly, such an approach implies added effort by the users. Modeling is
a tedious and mainly manual task that also requires competent users.
Within our approach some of the contribution lies in the cooperative effort
to semantic retrieval. Hence, we would not argue for the full automation
of this work. However, we can devise an approach to facilitate the users’
tasks, by automating parts of the process. In particular by:

Semantic modelling of documents                                            83
    Devising a document analysis process that can provide input to
     the domain modelling.
    Providing suggestions for users when describing documents.

This will leave the important steps of the process to the users, i.e. the
actual semantic definition of the domain (the modelling) and the actual
description of a given document. In our approach we will devise a
flexible document analysis process, based on a combination of linguistic
techniques and statistical analysis in order to provide input to the
modelling process. Furthermore, we will also support the construction of
a domain model lexicon, that provides a textual grounding of the
concepts, and that can be used in providing suggestions to the users
when describing a document.

3.2.3.    Textually grounded model concepts
We need a modelling language that is suitable for descriptions. For this,
we need concepts in the model that are textually grounded. That is,
there must be a short path from the text in the document, to the
concepts in the domain model. In ordinary classification, users read the
text of a document and look for prominent terms and phrases to
describe its content. In similar manner, a user bases himself on the text
of a document while looking for concepts in the domain model to
describe its content. If there were a big gap between the concepts that
users would have in mind after reading the documents and the concepts
available in the domain model, the task of describing the document
would be difficult.
Using input from the document analysis process in modelling would
facilitate the selection and representation of concepts that is connected
to the text of the documents. Furthermore, a domain model lexicon that
connects concepts to terms and phrases that appear in the domain
documents would provide a link from model concepts to document text.
This will enable us to analyse a given document, and provide
suggestions for classifying a document.

3.3. Overall specification of the approach
The functionality of the proposed document system includes three
fundamental processes:

    The construction of the domain model on the basis of selected
    The classification of new documents using linguistic analysis and
     conceptual comparison, and
    The retrieval of documents, including linguistic pre-processing
     and model-based refinement of queries.

84                                   Semantic modelling of documents – the approach
In the sequel, we present these processes in some detail. The complete
specifications are given in the following chapters (chapters 4-6).

3.3.1.        Constructing Domain Models from Document Collections
Figure 3.4 shows the proposed modelling process. If no document
analysis or attempts at automation would be used, only the middle step,
i.e. the actual modelling, will be carried out.
       1. The first step of the process is to run the document collection
          through the linguistic and statistical word analysis. The purpose is
          to propose a list of concept candidates and relations for the
          manual modelling process. The domain specific document
          collection is analysed together with a contrast (or reference)
          collection, in order to be able to discover the terminology that is
          particular to documents from this domain. The first part of the
          analysis process is a linguistic process, comprising
          morphological analysis and part-of-speech-tagging in order to be
          able to detect words and phrases (noun-phrases and verb-
          phrases). The second part of the analysis process is based on a
          statistical co-occurrence analysis with the purpose of detecting
          relations between the detected words and phrases. The document
          analysis process is specified in detail in chapter 6.
       2. After the word analysis, the conceptual model is created by
          carefully selecting terms from the proposed list through a manual
          and cooperative process. Terms are selected and defined –
          possibly with examples from their occurrences in the document
          text, and related to each other. The actual modelling process can
          be performed in different manners, depending on the skills and
          other constraints of the stakeholders. In a setting we have
          examined (Brasethvik & Gulla, 1999), the modelling work was
          performed by a committee consisting of computer scientists,
          medical doctors54 and other stakeholders from the particular
          domain. With respect to the goal of shared semantics, it is
          important that a sufficient number of end-users are involved in the
          modelling process, either directly in modelling, through training
          or under guidance, or at the very least in the definition of
           As we noticed in chapter 2 on ontologies, domain-specific
           vocabularies exist today, in different domains, and a particularly
           good example is the medical domain. If adequate vocabularies or
           thesauri exist, these could be used as input in the modelling

     This was an example from a medical domain.
  Each concept that is included in the model, is given a textual definition, possibly with examples. The
 document analysis process extracts occurrences of the discovered words and phrases, that can be used
 as examples in the definition.

Semantic modelling of documents                                                                       85
                                                            Figure 3.4
                                                    definitions      Guidelines                       Lexicon
                         Lexicon                    (Thesauri)

  Domain document

       DOC’s           Word                                                                         Linguistic
                      Analysis                                                                     Refinement
                                   Concept Candidates
                                                            Definitions    Referent Model
                       Automatic   • Term frequency list                   • Concepts, Relations   Semiautomatic
                                                                                                                     Domain Model
                                   • Correlation Analysis                  • Definition                            (Model repository)


 Contrast document

Constructing the domain model. The actual modelling and definition of concepts and their relations is a
manual cooperative effort. However, modelling could be facilitated by a linguistic and statistical document
analysis process that can extract prominent words and phrases and indicate relations. Example definitions of a
concept can be extracted from existing lexicons or domain specific thesauri if available. In order to prepare for
automated matching of documents to the domain model, a domain model lexicon must be developed through
linguistic refinement of the model.

             process. Our modelling tool supports a dictionary lookup
             interface, in order to provide extraction of definitions from existing
             resources. (Soergel, 2003) discuss a variety of sources that can
             be used as input in such a process.
             The result of the modelling process is a conceptual domain model
             with textual definitions for each of the concepts.
        3. The last step of the modelling process is the construction of the
           domain model lexicon. This is considered a “linguistic refinement”
           of the model in order to prepare for automated discovery of
           concept occurrences in documents. For each concept in the
           model, we add a list of terms that we accept as a textual
           designator of the concept. This list of terms may be extracted
           from occurrences in the document text or from a synonym
           dictionary (like WordNet, Northes56, or the Norwegian synonym
           dictionary (Gundersen, 2000)). Today this is performed manually.
           We then run the model through an electronic dictionary and
           extract all conjugations for each of the concepts in the model and
           its corresponding list of terms.
 The completed domain model and the corresponding lexicon is stored in
 our model-repository.

 3.3.2.          Classifying Documents with Conceptual Model Fragments
 A document is semantically classified by selecting a fragment of the
 conceptual model that reflects the document’s content. Figure 3.5

      Lingsoft, “NORTHES - Norwegian Thesauri”,

 86                                                                 Semantic modelling of documents – the approach
                                                                Figure 3.5
            Model                                   Domain                              Model
           lexicon                                   Model                             lexicon

                                                    concepts                                                               Translation
            Word                                       &
 DOC                                                                               Instantiation                               &
           Analysis                                 Relations                                                               Storage
                           Document Concepts                    Selected Concepts                  Referent Model Fragment
          Automatic                                 Manual                              Manual     • Concepts                 Automatic
                           • # Occurrences                      Relations                                                                 Index
                           • Position                           Sentences pr. relation             • Relations
                          All ”concept” Sentences               • Concepts                         • Suggested relation names
                          • Pos (Sentence Number)               • Relation Path
                                                                • Pos (Sentence Number)

               • (Automatic classification)

Semantic classification through lexical analysis and conceptual model interaction. A document can be
matched against the domain model in order to provide users with a suggestion to what concepts are
relevant for the document. The user manually overrides and refines this suggestion by direct interaction
with the model. The finished domain model fragment then has to be translated into a suitable storage
format for the underlying retrieval system.

shows how linguistic analysis is used to match the document against the
conceptual model and hence help the user classify the document.
   1. The user provides documents to be classified, either by drag and
      drop from a file system, or by providing the URL to the
      documents. Each document is downloaded by our document
      analysis servlet, which matches the document text with the
      concepts of the domain model. The matching is made by
      comparing a sequence of words (a concept name in the model
      may consist of more than one word) from the document with the
      concepts in the model. If available, the domain model lexicon is
      used in this process. The result of the matching is a list of all
      concepts found in the document – sorted according to number of
      occurrences – as well as a list of the document sentences in which
      the concepts were found.
   2. The found concepts are visualised to the user in the domain
      model. If several documents are being classified at once, the user
      may also examine a summary view that displays how all
      documents match the query. The user may then manually change
      the selection of concepts according to her interpretation of the
      document. The user may also select relations between concepts in
      order to classify a document with a complete model fragment,
      rather than just individually selected concepts.
   3. Once the user is satisfied with the selected model fragment, the
      fragment can be instantiated with respect to the particular
      document. By instantiation, we refer primarily to the naming of
      relations and adding of proper names that are considered
      instances of the concept. The statistical analysis will indicate

Semantic modelling of documents                                                                                                               87
         proposed relations, but no names for the relations.57 In order to
         add some semantics to the selected relation, it is important also
         to provide relevant relation names. In our system, users can add
         names to the relation ad hoc, or by selecting any name that has
         been added by previous users classifying a document according
         to this relation. Proper names (such as names of persons,
         locations etc.) are difficult to extract, and not readily available in a
         dictionary, hence difficult to capture automatically in the
         construction of a domain model lexicon. However, for a given
         document, these names may be considered important for
         capturing the connection between this document and the domain
         model. Users may therefore add these “instances” of the model
         during classification. All information added in the instantiation is
         included with the domain model lexicon, so that they can be
         provided for later users.

     4. The final step of the classification process is to transform the
        instantiated model fragment into an adequate storage format that
        interfaces the actual retrieval system. The nature of this
        transformation will depend on the actual retrieval system in use.
        We have experimented with transformations in to the RDF meta-
        data language. In our current prototype, we store the descriptions
        directly in our own classification repository and rely on a
        “homemade” retrieval mechanism.

In addition to the semantic meta-data, the system should also be able to
handle an adequate set of contextual meta-data attributes. We also
support the definition of a set of contextual meta-data attributes. These
are supplied by the user in an ordinary form-based interface. Also,
attributes regarding the classification (classified by, edited by etc.) are
kept, in order to keep track of which users that have classified which

3.3.3.      Retrieving Documents with NL Queries and Model Browsing
In our approach, queries must be represented as model fragments. If
the domain lexicon is available, queries could be posted in natural
language, but then mapped into domain model fragments. Query results
are visualised in the domain model. Query refinement is performed
through user interaction with the domain model.

  In a previous version of our system (Brasethvik & Gulla, 1999), we experimented with a name generation
 approach, that would take as input all sentences in the current document that referred to both concepts of
 a relation, perform a lexical analysis of these, and try to propose a name. This solution proved to be too
 complex and impractical in implementation. Furthermore, this sentence analysis here would require that
 the search phrase is written almost the exact same way as the original document sentence in order to
 produce a match. Details of this analysis is found in Appendix A.

88                                                   Semantic modelling of documents – the approach
Figure 3.6 gives an overview of the retrieval process:
    1. The domain model should be used in the query formulation
       process. Users can interact with the model, select concepts and
       relations and also select among the instances that have been
       provided in the classifications.
    2. If the domain model lexicon is available, users may also chose to
       enter text-only queries in natural language. This text is then
       matched against the conceptual model like in the classification
       process. The words of a text query are matched against both the
       concepts and the relations in the domain model.
    3. The model fragment representing the query is then matched
       against the stored document descriptions. For the prototype
       interface shown in this paper, we are using our own document
       servlet, which uses a simple concept matching algorithm to
       match, sort and present documents.
    4. Retrieved documents are visualised in a regular list. The result set
       of documents is also visualised graphically in the domain model,
       and provides a starting point for refining the search. The search
       may be narrowed by a) selecting several concepts from the model,
       b) following the generalization hierarchies of the model and
       selecting more specific concepts or c) selecting specific relation
       names from the list. Likewise, the search may be widened by
       selecting more general concepts, or selecting a smaller model
       fragment, i.e. fewer relations and concepts.

                                                            Figure 3.6:
      Domain          Model                     Model
       Model                                                                                   Domain
                     Lexicon                   Lexicon

         Linguistic                              Model             Query                    Presentation
       preprocessing                          interaction         execution
                         Domain model query
         Automatic                             Manual             Automatic                  Automatic


                                                                    Index                      Manual

                                                                                      Sentence -
                                                                                        rules         Lexicon

Document retrieval: Queries are formulated through domain model fragments. If desirable, a user can start
by entering a natural language query phrase, and only use direct interaction with the model to refine the
initial query. Queries are executed by the retrieval system in order to produce a list of relevant documents.
The retrieved documents are visualised in the domain model, in order to support subsequent query

Semantic modelling of documents                                                                                 89
Figure 3.7 shows the prototype search interface. Users may interact with
the model and retrieve a ranked list of documents. Documents are
presented in a Web-browser interface by using a selection of the stored
Dublin Core attributes. The figure also shows what we have denoted
“enhanced document reader”, that is, when reading a document, each
term in the document that matches a model concept is marked as a
hyper-link. The target of this link is a “sidebar” with the information
regarding this concept that is stored in the domain model, i.e. the
definition of the concept, the list of accepted terms for this concept and
its relations to other concepts. If relevant, this sidebar also contains a
list of documents classified according to this concept. This way, a user
may navigate among the stored documents by following the links
presented in the sidebar. Such a user interface visualizes the connection
between the documents and the domain model, and may aid the user in
getting acquainted with the domain model. This “enhanced document
reader” not only works on classified documents, but is implemented as
a servlet that accepts any document URL as input. However, the reader
requires the presence of the domain model lexicon.

3.4. Possible modes of application
A domain model-based approach to semantic retrieval may be applied in
different modes58, depending on the desired extent of manual effort:

                                              Figure 3.7

   Somewhat orthogonal to all the approaches is the ability to navigate and explore semantics of documents
 by way of the “enhanced” document reader. This add-on functionality was initially developed as a
 supplement to the retrieval task, in order to support navigation in the set of retrieved documents.
 However, by underlining all terms in the document text that maps to a concept, and by displaying the
 concept details in a sidebar, such a document reader helps to visualise the connection between document
 and model and may increase users awareness of the model and model concepts. The reader is
 implemented as a stand-alone web-service that will accept any html-document, which simply “injects”
 links to the model-information into the regular html text. The document is viewed as normal in a web-
 browser. Such “muted” use of the model, where users are domain model fragment (i.e. a selection in the
The prototype query interface. A query is represented as anot exposed to the full graphical model, but still
 able to somehow use the model ranked list a documents. An enhanced applied in settings where
domain model), and produces ato navigate inof document collection can be document reader enables
exploration of a document with references to the domain model marked out in a sidebar.
 users cannot be bothered to learn the syntax of a graphical, conceptual modelling language.

90                                                   Semantic modelling of documents – the approach
   Manual: The modelling process is performed manually. No lexical
    definition of the concepts will be provided. Classification is done
    through interaction with the domain model, by selecting a domain
    model fragment. Querying is done by browsing and selecting
    model concepts. The query execution amount to compare the
    model fragment that represents the query, and the stored model

   Manual, with input: The modelling is provided with input from the
    document analysis process, and lexical definitions (the domain
    model lexicon) for concepts are constructed or partially extracted
    from the domain analysis process. Classification can then be
    performed semi-automatically, by having the system suggest
    occurrences of concepts to the user when describing a document.
    For query formulation, users can select between writing a text-only
    query that can be mapped into a model fragment, or by
    interacting with the model directly. The query execution and
    ranking mechanism is the same as in the full-manual approach.

   Query interface only: If users cannot be bothered with the actual
    classification of documents, the model-based approach can be
    used as an advanced retrieval mechanism only. In this case, the
    classification of the documents will be performed automatically.
    Queries however, can be formulated by way of the model
    interface, and transformed into query expressions that match with
    the indexing in the applied IR system. For this expansion to work,
    however, the domain model lexicon must be constructed, in order
    to be able to map from model concepts to index terms and
    phrases. Query execution and ranking would then be performed
    entirely by the IR system.

    A variant of this approach could be to use the domain modelling
    approach in connection with the definition of categories, now
    common in IR based KM systems (Sections 2.5 & 2.9). Categories
    in these systems, can be defined in several ways, but a common
    approach is to define categories by lists of weighted terms. Rather
    than rely on simple term-lists however, a domain model can be
    constructed for a selected level in the category hierarchy. This
    would give the users an opportunity to participate in the definition
    of categories, while relieving them of the burden to model their
    domain completely. Using a hierarchy of high-level categories and
    models to define the lower parts of the hierarchy is also a way of
    enabling the use of several models in classification. While it may
    not be feasible to model everything in a domain within one large
    model, using multiple models could infer ambiguity between
    overlapping concepts from different models. In general,
    mechanisms to solve heterogeneity would have to be built into the
    applied modelling language, such as the “equivalentTo”,

Semantic modelling of documents                                            91
     “sameClassAs” constructs found in ontology languages (section
     2.8.2). Applying models together with categories and a weighting
     scheme to determine the degree of how much a concept belongs
     to a category, could prove to be another way of coping with such
Our approach is initially designed with the first and second of these
approaches in mind. The fully manual approach conveys the principles
on which we have based our approach, while the automation is
supported simply to ease the amount of manual work imposed on the
users. The last approach, where the model is only used as a query
interface, has emerged as a result of feedback from reviewers and
general response to the approach.

3.5. Overview of approach chapters
In the sequel, the approach is detailed through the four following

    Chapter 4: presents the Referent Model Language (RML), the
     conceptual modelling language we apply in our approach.

    Chapter 5: defines the interpretation of RML for document
     retrieval and the translation of RML fragments into RDF.

    Chapter 6: presents the linguistic document analysis process we
     have defined in order to support domain modelling.

    Chapter 7: presents the prototype implementation of the

92                                  Semantic modelling of documents – the approach
                                   4. The Referent Model Language

4.1.      The referent model language - introduction
The Referent Model Language (RML) is a concept modelling language
targeted towards applications in areas of information management and
heterogeneous organisation of data (Sølvberg, 1999 ; Sølvberg 2002). It
has a formal basis from set theory and provides a simple and compact
graphical modelling notation for set theoretic definitions of concepts
and their relations.
The presentation in the sequel is focused on the excerpt of the language
applied in our document retrieval approach.

4.1.1.      RML foundations
In RML, semantics of concepts are defined through set theoretic
constructs such as intension, extension and reference. Our interest in
RML is the constructs for defining concepts and their relations in such a
manner that it can be used in text classification and retrieval, i.e. as a
vocabulary of the domain. The act of modelling is to transform the
domain description from terms, the units of language, into concepts, the
units of thought:

         1 : something conceived in the mind : THOUGHT, NOTION
         2 : an abstract or generic idea generalized from particular
            instances. synonym see IDEA,
                                                (Merriam-Webster, 2002)

RML defines constructs for modelling of concepts, the selection of
constructs is based on the concept types given by (Bunge, 1998):

   Individual concepts – individual concepts apply to individuals.
    Individuals can be either specific "Terje" or generic (e.g. person

   Class concepts – concepts that apply to collections of individuals

   Relation concepts – concepts that refer to relations among objects
    (individual or class concepts). There is a somewhat blurred
    distinction beween class concepts and relation concepts, as a
    relation may be considered a class concept in its own right. The
    concept of "marriage" may be considered a relation between
    persons (that is a relation between two individual concepts – the

Semantic modelling of documents                                           93
     two persons – or a recursive relation in the concept class of
     Persons). However, "marriage" is also considered a distinct legal
     entity, thus viewed as a class concept in its own right.

    Quantitative concepts – quantitative concepts do not represent
     distinct objects, but refer to magnitudes often associated with
     individual or class concepts. Quantitative concepts are not applied
     in our approach to document retrieval and will not be treated in
     detail in the sequel.
The modelling constructs in RML are derived from the different types of
concepts given above. In order to formalise the language, each concept-
modelling construct is given a definition from set theory. The way class
and individual concepts are defined above, we adopt a straightforward
transition between concept and sets; we model in concepts but our
concepts may refer to sets or individuals. We define the connection
between concepts and sets as in figure 1, a variant of the triangle of
As before, terms are units in language: " a word or expression that has a
precise meaning in some uses or is peculiar to a science, art,
profession, or subject" (Merriam-Webster, 2002). In modelling, we define
the semantics of a term according to the two relations in figure 4.1. A
term in the UoD designates a concept in our models, while at the same
time it may denote individuals or sets of individuals. A concept then,
may be defined in either of the following ways:
     1. Through its term-set, i.e. through all the terms in the UoD that is
        used to designate it.
     2. Through its intension, the concept is defined "by itself", by
        formulating the characteristics of the concepts.
     3. Through its extension, that is, the concept is defined through the
        set (of individuals) that it refers to (or vice versa; through the set
        of individuals that belong to the concept). We may distinguish
        between the extension and the referent set of a concept. The
        extension of a concept may be empty, i.e. there are no real world
        individual "trolls", even though the referent set of trolls are not
        empty (there are many specific trolls in Norwegian fairy tales).

                                             Figure 4.1

We define the semantics of a concept through a term-concept-referent triangle inspired by the "triangle of
meaning", (Ogdens & Richards, 1923).

94                                                                       The Referent Model Language
Definition of a concept through intensions and extension yields the
connection to set theory and both ways are supported in RML. The
definition of concepts by way of terms in the language of the domain
(UoD), is something we will add in our approach to document retrieval,
and will be presented in chapter 5.

4.1.2.      Class concepts, individual concepts and attributes
Figure 4.2 shows the modelling of individual and class concepts and
attributes (properties). The class concept Students represents the set of
students in the UoD, while individual students "Hallvard", "Monica" and
"Terje" are represented as individual concepts. Class concepts are drawn
with a thick bottom line, while individual concepts are drawn as a
rounded edged rectangle.
As defined in the previous section and illustrated in the figure, a concept
in RML (e.g. student) may be defined in two ways, by listing the set of
members (figure 4.2a) or by defining a variable that can be used to
qualify members of the set (Figure 4.2b). This corresponds to the
distinction in RML between intension, extension and referents (Sølvberg,

   The intension of a concept is the set of all characteristic
    properties of the concept. A characteristic property of a concept
    is a property that is shared by all the referents of a concept. The
    core intension is the set of essential properties, i.e. the properties
    that may be used to determine whether or not a referent belongs
    to this concept. A characteristic property need not be an essential
    property. In addition, quantitative concepts are used to formulate
    intensions, e.g. "all individuals must weigh more than 100kg".

   The referent is what the concept refers to. The referent set of a
    concept contains all members; past, present and future,
    imaginary or real. The set of referents is what we refer to by
    naming the concept.

                                                    Figure 4.2
                              Students       Name                                    Students

                                                        Students                        ∈

                   Hallvard              Terje


Set theoretic modelling of individual and class concepts in RML. A class can be defined by listing its specific
members (a) or through the definition of a generic – i.e. characteristic – individual concept (b).

Semantic modelling of documents                                                                             95
    The extension of a concept is all individuals that belong to the
In figure 4.2a, we have modelled the extension of the Student concept
while the characteristic element 'X' in figure 4.2b refers to the intension
of the student concepts. In RML today, there are no graphical
mechanism for defining intensions, this must be declared explicitly in a
constraint language, such as SDL (Semantic Definition Language) (Yang,
1993) or Object Constraint Language (OCL) (Warmer & Kleppe, 1998).
In RML, supporting these different approaches to defining a concept is
considered important for real world modelling. In our case, we will have
to devise a definition of a concept that will suit our application of RML
as a "subject language". Our definition of a concept in that sense is given
in section 5.2.

4.1.3.    Attributes
The figure also shows the listing of attributes for a concept. The
attributes of a concept are defined as a list of properties and are drawn
in a rectangle with a small black triangle in its lower right corner. This
listing is a short hand notation for the set-theoretic definition of
attributes in RML. In set theory, the notion of an attribute is defined as a
function from the concept to a value (a linguistic unit). Thus
theoretically, each of these attributes must be identified and
represented separately (Sølvberg, 2002). However, in our approach to
semantic modelling of documents, we limit the use of attributes to the
listing of properties, as illustrated in figure 4.2.

4.1.4.    Relation concepts
Figure 4.3 illustrates the modelling of binary relations. In set theory, a
relation is a set of ordered pairs and as such it can be viewed as a
concept in its own right. This is reflected in RML by explicitly applying
the relation concept symbol, illustrated at the top of the figure.
In our approach however, we limit ourselves to the so-called mapping
notation for binary relations, where binary relations are drawn as simple
lines between the two concepts participating in the relationship.
Two kinds of constraints may be applied to a relation: Cardinality and
Coverage. The cardinality of a relationship is defined as the number of
members from each of the corresponding sets that participates in the
relation. The cardinality of a relation is shown with the use of an arrow
or by specifically numbering the maximum number of participating
members from the set. The arrowhead "points to one", which indicates
that this relationship is in fact forming a function. The third example in
figure 3 shows an unspecified binary relation, indicating that a person
may drive many cars, while a car may be driven by many persons. The
first example shows that a car must be owned by one and only one
(arrowhead) person, and constitutes a function from the set of cars into
the set of p e r s o n s. The second example shows a one to one

96                                                  The Referent Model Language
                                            Figure 4.3

Example modelling of binary relations: Top: Complete notation using the relation concept symbol. Bottom:
The mapping notation offers a simplified "shorthand" notation.

correspondence between cars and their registration plates, and illustrates
that each car must have one and only one registration, which in turn is
valid for one and only one car. The notion of coverage indicates whether
or not all members of a set participate in the relation. If all members of
a set participate, this is indicated with a filled circle. In our example
every car (filled circle) must have an owner (person), and likewise every
car must have a registration while every registration must belong to a car.
Relations may be given names. Names are written on top of the relation,
and may be prefixed with a concept name that indicates the direction of
reading the relation name (Person owns car, rather than car owns person).
We will return to the semantics of relation naming with respect to
document classification later (section 5.2). For now, we illustrate
naming of relations by the examples in figure 4.3.

4.1.5.     Abstraction mechanisms – concept classification
RML supports several abstraction mechanisms. In particular, the various
ways of generalisation and classification of a concept into higher level
abstractions are useful when modelling in information management
The generic classification operation is illustrated in figure 4.4a. More
specific are the classification operations used to specify set membership
and set inclusion and partition. As illustrated in figure 4.4b, set
membership is specified either by listing the particular individual
concepts that constitute the extension of the concept – denoted by using
the member of (curly bracket) symbol – or by defining a generic set
member (variable) and qualifying set properties, using the element of

Semantic modelling of documents                                                                      97
Since class concepts correspond to sets, any class concept may be
made up of several subsets. Subsets may be overlapping or disjoint.
Figure 4.4c illustrates how the concept of Employee is considered a
subset of the concept Persons. This is not a distinct subset, we could
think of several others, for example the set of skiers, the set of parents
and so on and so forth, the same person may be member in all of these.
However, we may also have a disjoint partition of the set of persons, for
example into men and women, where no person can participate in both,
in our – albeit conservative – interpretation of the UoD.

4.1.6.     Composition of concepts and relations
In RML, n-ary relations are modelled as composite class concepts. This
treatment of aggregation diverts from basic set theory, and follows the
notation from regular modelling languages, such as UML. Composing a
concept of other concepts is performed by using regular relations as
part-of relations from the part class concepts to the composite class

                                            Figure 4.4




Modelling abstraction by way of concept classification. A) generic isa-relation. B) member of (left) and
element of (right). C) overlapping (left) and disjoint (right) subsets.

98                                                                      The Referent Model Language
                                              Figure 4.5

   Modelling composition of concepts. The example shows composing/decomposing a cat – each
   element in the cat class concept can is composed of parts from the set of cat-parts. For the mappings
   from a part set to the composite concept, cardinality and coverage constraints apply.

Figure 4.5 illustrates the composition of a cat. A cat is considered as
consisting of parts from the different part classes, head, leg, tail and
torso. Cardinalities are given on each part-of relation, in order to
indicate how many elements from each set is participating in the
"construction" of the composite class concept. In the figure, a cat
consists of 1 head, up to 4 legs, 1 torso and 1 tail. Coverage is applied
in order to further define the composition, showing that a cat must have
a head and a torso, but may be fully functional without legs and a tail. In
the example, all the part concept classes have to be connected to a cat,
as we do not want any cat parts floating around. In some cases however,
the parts may be class concepts in their own right and have elements

                                              Figure 4.6

Modelling composistion through individual concepts. The structure of a composition can be explicitly shown
by defining a "characteristic element", that is by modelling the composition at the individual concept level.
The structure indicates the construction of a cat from components from the various part sets. The figure
also illustrates the ability to model the "same thing" at several abstraction levels even within one model.

Semantic modelling of documents                                                                            99
not always connected to a composite. This is for example the case with
car wheels, which – in Norway at least – are changed as the driving
conditions change from summer to winter, always leaving a spare set of
wheels in the garage.

An alternative way of modelling compositions is to build a composite
structure of individual concepts (figure 4.6). This is the way of defining
(graphicaly) the generic individual cat, previously referred to as a
"variable" (Section 4.1.2) – illustrated in figure 4.6. As shown, modelling
in individual concepts gives more attention to structural details than the
abstract modelling of figure 4.5.

As relations can be considered concepts in their own right, hence also
relations can be composed. In RML a composition of relations is defined
as a derived relation. From a set theoretical perspective, derived
relations correspond to composition of functions. However, from a pure
modelling perspective, they can also be viewed as a simple naming of a
path in the model, ie. a kind of short-cut or shorthand notation. For the
purpose of document classification and retrieval, we opt for the latter
interpretation. Graphical modelling of composite relations is illustrated
in figure 4.7.

The examples in this chapter show the variation in detail and abstraction
that can be applied by way of RML. In conceptual modelling, one
frequentially has to move among different levels of abstractions, even
within the same model or model fragment. Such mechanisms will have
to be exploited in models of heterogeneous and ambiguous domains.

                                              Figure 4.7

  Modelling of derived relations: The speeding ticket is caused by the sporty person (is a person) driving
  the sportscar (isa car).

100                                                                       The Referent Model Language
4.2. The RML - Meta Model
This section defines the part of RML applied in our approach through a
UML meta-model of the language followed by definitions of the RML
constructs we use. Using UML for this serves two purposes. First, it is
reasonable to use a different language than RML to model itself, and
second, this meta-model will be used to map constructs in RML to RDF
(chapter 5). It is therefore also feasible to have the meta-models of the
two languages available in a common neutral language, to facilitate the
comparison. The meta-model is shown in figure 4.8. The OCL constructs
that apply to the meta-model is defined in table 4.1, while the meta-
model constructs themselves are properly defined in table 4.2. Note that
some of the concepts in the meta-model are abstract concepts, not
found in a RML model, but included here as a “modelling tactic”, i.e. to
bind the model together.

                                      Figure 4.8

                                  UML meta-model of RML.

Semantic modelling of documents                                       101
                                                    Table 4.1

OCL Statement                                                           Comment
Concept:                                                                Non-relation concept and relation concept are
                                                                        disjoint classes
self.NonRelationConcept -> intersection(self.RelationConcept).isEmpty

NonRelationConcept:                                                     Individual and class concepts are disjoint
self.ClassConcept -> intersection(self.IndividualConcept).isEmpty

GeneralisationConcept:                                                  The result of a generalisation operation is
                                                                        always a class concept
self.result-> forall(isTypeOf(ClassConcept))

Subset:                                                                 All Operands of a subset construct must be
                                                                        class concepts
self.operand-> forall(isTypeOf(ClassConcept))

Disjoint:                                                               The disjoint operation represents the union of
                                                                        mutually disjoint subsets
self.operand.-> forall(C1,C2 | C1.allInstances -

Element:                                                                Operands of an element generalisation
                                                                        operation must be individual concepts

ElementOf:                                                              All operands of an ElementOf operation must
                                                                        be generic individual operations

MemberOf: self.operand->forall(isTypeOf(SpecificIndividualConcept))     All operands of an MemberOf operation must
                                                                        be specific individual concepts

              Object Constraint Language (OCL) constraints for the RML meta-model constructs

4.2.1.       Construct definition
Table 4.2 gives a summary the RML constructs together with their
                                                    Table 4.2

Construct                              Definition
Concept                                Abstract. All concepts may have a name. Names should be
                                       name space constrained, hence we use the XML-schema
                                       QualifiedName type for names (XML, XX).
                                       Definition: a concept is semantically defined through its
                                       extension E(C), intension I(C), and referent set Rset(C). Two
                                       concepts are semantically equal iff:
                                       E(C1) = E(C2) and I(C1) = I(C2) and Rset(C1) = Rset(C2)
Relation Concept                       Abstract. Relation concepts all relation concepts are related to
                                       at least one or more other concepts
Binary Relation Concept                A binary relation concept is a relation concept related to only
                                       two other concepts. For each of the two related concepts
                                       cardinality and coverage must be specified. In the meta-
                                       model, this is specified with the relation_end construct.
                                       Definition: for two concepts C1, C2 the binary relation R
                                       between them is defined as R ⊆ C1XC2, that is E(R) ⊆ E(C1)
                                       X E(C2). The subset of C1 X C2 is selected through the use of
                                       cardinality and coverage.
                                       cardinality(R,C) = n, where n is number of times one element
                                       from C participates in R
                                       coverage(R,C):{full, partial} defines wether or not all elements
                                       of C must participate in R

N-ary relation concept                 N-ary relations (relation class concepts) are "composed of" n-
                                       binary relations.

102                                                                             The Referent Model Language
Construct                      Definition
                               binary relations.
Non-relation concept           Abstract. A construct introduced in the meta-model in order to
                               state that the end-points of relations cannot be relation
Class concept                  A non-relational class concepts. Its definition is equal to that of
                               a concept, i.e.:
                               Definition: a concept is semantically defined through its
                               extension E(C), intension I(C), and referent set Rset(C). Two
                               concepts are semantically equal iff:
                               E(C1) = E(C2) and I(C1) = I(C2) and Rset(C1) = Rset(C2)
Individual Concept             Abstract.
Specific individual concept    A specific individual concept, named by a string literal
Generic individual concept     A generic individual concept, named by a variable. Variable
                               names in RML are written as single capital letters.
Atrribute                      An attribute is a relation from a non-relation concept into a
                               linguistic class concept. We do not apply linguistic class
                               concepts in our approach.
                               An Attribute A is defined in core RML as a mapping into a
                               value, such as:
                               A={(c,q,s)| where q:E(c)X {s} → V}
                               While in our approach, we are only interested in listing of
                               attributes for a concepts, thus A = (c, n, t), where
                               c is the concept the attribute belongs to
                               n is the qualified name for this attribute, and
                               t is the attribute type, one of {defining, non-defining}

Operations                     Abstract. Operations on concepts converts one or more
                               operands into a result, where both the operands and the result
                               is a concept.
Mathematical operations        N/A (included for completeness)
Generalisation operations      A generic Is-a operation. Its definition is equal to the
                               overlapping subset operation.
Subset operations              Abstract.
Overlapping Subset Operation   E(Result) = ∪i E(Opi)
                               For all i, E(Opi) ⊆ E(Result)

Disjoint Subset Operation      E(Result) = ∪i E(Opi), and where
                               for all i,j, i≠j, E(Opi) ∩ E(Opj) = ∅

Element operations             Abstract.
Element Of                     The element of operation classifies generic individual
                               concepts as elements of a class concept

Member Of                      The member of operation classifies specifc individual concepts
                               as elements of a class concept.

                                  Definition of RML constructs

Semantic modelling of documents                                                                      103
4.3. RML Models and model fragments
Having defined the constructs of the RML language, we may now define
what we mean by a Referent Model.

4.3.1.    A model and a model fragment
We define a model as being composed of externalised statements about
the domain, thus:
   An RML model is a set of statements composed of constructs from the RML language.
   An RML model is syntactically correct if all statements in the model are allowed
   according to the RML language specification, that is M \ L = Ø, Where M is the set of
   expressed model statements and L is the set of all possible language statements.

In our approach, we limit the use of RML constructs to those defined in
the RML meta-model of figure 4.8. This implies that a model in our
sense, must conform to the constraints and rules expressed in the meta-
model. In addition to the definition above, we thus have:
   A model is an expression of statements composed of non-abstract constructs from the
   RML meta-model. A model statement is correct if it conforms to the language
   specification given by the RML meta-model.

We describe the semantics of documents through Referent model
fragments. A model fragment is simply a part of a model:
   A model fragment (Mf) is a subset of the statements in the model (M): Mf ⊆ M. A model
   fragment is said to be well-formed if all statements in the fragment conforms to the
   RML language specification, given by the meta-model. A model fragment is said to be
   connected if for any two statements of the model fragment, there exist a path between
   them that is also part of the fragment.

The model fragments applied in our approach is the result of user
interaction, and we will simply refer to them as selections. A selection is
simply a model fragment determined by user interaction.

4.3.2.    Defining model views
Neither the model nor the model fragment definitions given above
explicitly refer to the graphical presentation of a model. We will apply
the domain models as an integral part of the user interface, and we thus
have to find suitable means for handling the graphical presentation of a
model. The RML notation is given through the presentations in figures
4.1 – 4.7.
In addition we define view and filter as mechanisms to control the
presentation of a model and model fragments:

   A view is the graphical presentation of a model fragment. A filter is a complexity
   reducing mechanism applicable to a view. A filtered view of a model is thus the view
   that emerged as result of applying a filter to a view.

(Seltveit, 1994) have formally defined views and filtering mechanisms
and specified a set of filters applicable to the ERT language (McBrien et.
al., 1992), one of the predecessors of the RML language. In our

104                                                            The Referent Model Language
approach, we will apply Seltveit's framework for specifying filters and
adapt some of the relevant ERT filters to RML. In order to support the
user interaction for document classification and retrieval, we will also
specify some additional filters. Following is a brief description of some
aspects of a filter, as given by (Seltveit, 1994) and the adapted RML

   Level: A filter is applied to either language- or model-level. A
    language-level filter operates on the model constructs, while a
    model-level filter operates on the statements of a model.

   Inclusive/Exclusive: a filter may specify constructs to be either
    included in the filtered view from the full view or excluded from the
    filtered view. Since we only apply visual filters operating on the
    current user view of the model, we may denote these aspects as
    hide/show respectively. None of the filters we apply actually
    transform the underlying model.

   Determinism and scope of effects: A filter is deterministic if it
    produces the same filtered view Vf each time it is applied to the
    same view V. The scope of effects of a filter is either local (only
    affecting constructs within the original view) or global (effects
    propagating outside the view it operates on). We only apply quite
    simple filters as a complexity reduction mechanism on a
    presented view, thus all filters we apply are local to the presented
    view. Deterministic filters are preferred, as non deterministic
    filters are possibly confusing for a user as the results of applying
    the filter may change every time it is applied.
Table 4.3 shows the defined RML filters.

4.4. Summing up
We have presented the Referent Model Language (RML) that we will
apply in our approach to semantic modelling of documents. The
language is originally proposed in (Sølvberg, 1999) and defined in
(Sølvberg, 2002). Our contribution in this respect is to select and define
the RML-constructs we apply in our approach. We have further defined
the concepts of a model, model fragment, view and filters, that we will
apply in order to interpret the semantics of a document according to the
model, in the next chapter.

Semantic modelling of documents                                             105
                                        Table 4.3
Filter                          Level    I/E     Description
Non-relation concept filter     L        I       Shows all non-relation concepts and
                                                 generalisation operations, but hides all
                                                 relation concepts and attributes from the
Class concept filter            L        I       Shows all class concepts and subset
                                                 generalisation operations. Hides all individual
                                                 concepts and the corresponding element
                                                 generalisation operations, all relation
                                                 concepts and attributes from the model.
Individual concept filter       L        E       Hides all individual concepts (generic and
                                                 specific) and their corresponding element
                                                 generalisation operations.
Attribute filter                L        E       Hides all attributes.
Generalisation filter           L        I/E     Hides or shows selected is-a hierarchies.
                                                 Operates on all generalisation operations
Relation filter                 L        E       Hides all relations
Cardinality constraint filter   L        E       Hides all relation constraints from a selected
Coverage constraint filter      L        E       Hides all coverage constraints from a
                                                 selected relation
N-ary relation filter           L        E       Hides all N-ary relations
Derived relation filter         L        I/E     Hides or shows all derived relations
Selection filter                M        I       Hide everything but the current selection.
Document Hit filters            M        I       Shows all concepts and relations that is
                                                 marked with a hit according to the current set
                                                 of documents, hides everything else.
                                    Definition of RML filters

106                                                                    The Referent Model Language
   5. RML based document classification and retrieval

The classification and retrieval of documents by way of domain model
fragments and model based query operations is a fundamental part of
our approach. To achieve this, we define the referential semantics of an
RML model fragment in terms of the domain document collection. The
referential semantics of an RML model and model fragment are defined in
5.1 and 5.2.
RML models deal with class concepts and instances. To increase
precision in classifying a document, we allow for instantiation of a
selected model fragment. Instantiation of a model fragment at
classification time is defined in section 5.3.
For documents described by instantiated model fragments, we define a
model-based query notation and a set of model-based query operations that
enable retrieval. The model-based querying must conform to common
query processes and support the same capabilities as traditional IR
systems (section 5.4).
One of the goals behind our approach has been to be able to use the
model based retrieval approach on top of existing IR machinery. Section
5.5 specifies a strategy for the realisation of our approach by way of the
emerging web standard for document descriptions, the Resource
Description Framework (RDF). A mapping between RML and RDF is

5.1. Referential semantics of model fragments
This section describes the semantics of a RML domain model with
respect to a document collection, which is the fundamental idea behind
our RML based retrieval model. At modelling time, the semantics of an
RML concept may be interpreted in terms of its intension, extension and
referent set, as given in the RML definition (Chapter 4). However, in a
document retrieval and classification setting, we must also define a
concept based on its references in the document texts.
Referential semantics (see e.g. Svenonius, 2000) defines the semantics
of a concept by how they refer to occurrences in the document collection
rather than how they refer to real world referents. In our case, this is the
key to apply the model in dealing with documents and text, and not real
world referents or information system instances, in the regular sense.

Semantic modelling of documents                                         107
                                             Figure 5.1

              Triangle of meaning redefined according to the notion of referential semantics.

The original triangle of meaning was described in chapter 4, where a
concept is defined through terms and referents. The reinterpreted
relations in the triangle of meaning or referential semantics are shown in
figure 5.1.
Again, terms are the units of language designating a concept. Terms
occur in documents. More specific, a term occurs in a context within a
document. For the document analysis process we specify in chapter 6,
we define a document context to be a surrounding unit of words; either a
"window of words” or one of the document structure levels; sentences,
paragraphs and sections. For the discussion in this chapter however, we
retain the document as the context of a term.
Interpreting a model concept by way of a document collection, we
substitute the real world referents with the set of document contexts
referred to by the concept. Thus, the references of a model concept are
all documents that are interpreted by some user to be a reference to
this concept.

                                             Figure 5.2

 Classification and retrieval based on domain model fragments: A document is interpreted as a reference
 to concepts in a domain model fragment. The documents to answer to a query, expressed as another
 model fragment, are given by collecting the set of references of the concepts in the query model

108                                                   RML based document classification and retrieval
Figure 5.2 illustrates the process of classification and retrieval by way of
the domain model. The semantics of a document is given by its
interpretation as a domain model fragment. In our approach, we expect
the user to be able to perform this interpretation manually. The user
interpretation means that there is not a direct mapping between terms,
their document occurrences and the concepts. A term may refer to
several things, have several meanings or be deemed as not relevant in a
given context. Therefore, the “refers to” relation cannot be directly
inferred simply by examining the terms occurring in a document, even if
the terms are specified as designating this concept. In our approach, we
say that the occurrence of a term suggests a reference of a concept, but
the final interpretation is left to the user.
A query is represented as a domain model fragment. The set of
references of a domain model fragment is the set of documents – or
document contexts – referred to by the concepts in the model fragment.
Document retrieval is performed by collecting the set of references from
the model fragment that represents the query.
From this, we get the following definitions:
  The Domain Document collection is the set of documents considered specific to the
  domain described by the domain model, and therefore liable to classification by this
  domain model. A document collection may be Open if new documents are added
  continuously, or Closed if the number of documents is fixed.
  The Domain model is the set of statements about the domain explicitly represented by
  the stakeholders ("domain experts"). For document retrieval, the semantics of a model
  is defined by its intension and set of references:
       The Intension of a model is the model itself, i.e. the shared agreement on the
  definition of the domain semantics, as explicitly represented in the model, by the
       Set of references of a domain model are all documents (in the collection) classified
  according to this model.

  A model fragment is a subset of statements in the domain model.

A retrieval model consists of a document representation scheme, a
query language and a similarity function for calculating the similarity
between a given document and the query. Thus we have:
  A document representation Dr is given by a domain model fragment. This domain
  model fragment represents the interpretation of the document wrt. to the domain
  model. For the interpretation of a specific document, the domain model fragment may
  be instantiated.
  A query representation Qr is the domain model fragment reflecting the query Q.
  A similarity function, sim(Dr, Q) that calculates the similarity of two model fragments Dr
  and Q. The similarity function can be either boolean or weighted. A boolean similarity
  function deems Dr and Q to be either relevant or not, while a weighted similarity
  function calculates a measure of similarity between Dr and Q.

We will return to the actual similarity calculations later. For now, we may
apply the similarity function to define the semantics of a model
fragment with respect to the document collection:

Semantic modelling of documents                                                            109
   The set of references to a model fragment is the set of all documents from the
   document collection, which are calculated as relevant to the model fragment by
   applying the similarity function.

A model and model fragments are previously defined as a composition
of Referent model constructs (Chapter 4).
To further specify the referential semantics of an RML model in terms of
the document collection, we have to specify the semantics of the
different RML constructs.

5.2. Referential semantics of RML constructs
As with models and model fragments, for document retrieval, the
semantics of an RML construct is defined through their intension and its
set of references. For all constructs, the intension is similar to what is
previously defined for a model: The intension of an RML concept is the
domain specific shared agreement on the definition of the concept. This
definition applies to both non-relational and relational concepts. From
the relations of the triangle in figure 5.1, we have:
   The designators of a concept are the set of terms that designate the concept.
   The set of references is all documents returned by applying the similarity function to
   the model fragment consisting of that concept only.

In the sequel, we will specify the extension and set of references for all
RML constructs we apply in our model-based classification of

5.2.1.    Class concept
RML class concepts are the main semantic carriers of the domain model
in a similar way as nouns are considered the main semantic carriers in
text indexing models (Baeza-Yates & Ribiero-Neto, 1999).
The intensions of class concepts are captured in our domain models by
supplying all class concepts with a descriptive, human-readable
definition. This is a local subjective definition of a concept as given by
the stakeholders of the domain, i.e. the users of the retrieval system.
The set of references for a class concept are the documents from the
collection, classified by a model fragment including that concept or an
instance of that concept. In other words, the reference set of a class
concept includes the reference sets of all its instances. The reference set
of an instance however, does not include that of its class concept. To
exemplify: A user searching for documents about NTNU, one of the
universities of Norway, would not be interested in documents about
Norwegian universities in general, while a user searching for documents
about universities in general, might be interested in documents about a
specific university.

110                                             RML based document classification and retrieval
Instances and Attributes
Instantiation is allowed at classification time since instances are
considered to increase the precision of the document representation.
The RML language is designed to model both individual and class
concepts graphically, but listing all individual concepts at modelling
time is a tedious task and if illustrated graphically it will lead to a model
cluttered with detail.
Attributes are properties and values attached to the instances of a class
concept. In our current definition, attributes are only applied in order to
create instances of a class concept. Instantiating a class concept
requires a user to supply values for all defining attributes of that class
concept. Supplying a value for the non-defining attributes are optional.

5.2.2.   Generalisation operations
The generalisation/specialisation hierarchies of the domain model are
used to increase precision in classification and retrieval. By nature,
precision is increased by traversing the hierarchies downward, selecting
more specific concepts. By comparison:
  The reference set of a super-class Ref(C) constitutes a union of the set of references of
  all its sub-classes, i.e. Ref(C) = ∪i Ref(Ci) for all Ci ⊆ C.

The term lists for each class concept are still applied as designators for
the concept in text. Matching the document text against the term-lists of
the concepts in the model, may suggest the presence of several of the
concepts in the hierarchy. At classification time, a user can only specify
concepts at one level in the hierarchy. As a general rule of thumb, a
more specific concept should always be preferred (Almo, 2003). If
specific concepts are not applied whenever possible, the generic
concepts in a hierarchy will soon be overloaded with documents and
thus become increasingly useless in retrieval. Also classification along
generic concepts only will naturally limit the possibilities for
specialisation at query refinement.
At retrieval time, query expansion rules will indicate whether a selection at
one level also will include all specialised concepts.

5.2.3.   Binary Relations
For document retrieval, explicit represented relations provide the ability:
   1. To support domain model browsing and semantic "discovery" of
      the domain, thus enabling explorative query formulation and
   2. To restrict queries. Explicit selection of a relation represents a
      restriction with respect to a query only including the two concepts
      participating in the relation. That is: Ref(R) ⊆ Ref(C1) ∩ Ref(C2),
      where R represents the model fragment that constituted by the

Semantic modelling of documents                                                           111
         two class concepts participating in the relation (C1 and C2) and
         the binary relation between them.

Relations in the domain model describe the real-world relation between
two concepts. The realisation of a relation in the context of document
retrieval can be either:

     A statistical measure of relatedness between two concepts,
      calculated by their occurrences in the document collection at
      hand. (Haddad & Latiri, 2003; Sølvberg et. al., 1991; van Zwoel,

     A relation name adds semantics to the relation, i.e. it represents a
      specific aspect of the relation between two concepts. Naming a
      relation resembles the definition of a predicate in logic, e.g.
      examines(Doctor, Patient).

In both cases, the existence of a relation supports browsing and
explorative searching, since it visualises the relations in the domain and
provides the user with input in how to extend or specialise the search.
Statistically weighted relations are used in some approaches in order to
perform default expansion of queries (Sølvberg et al. 1991). A statistical
measure of relatedness provides an underpinning of the existence of a
relationship (or the validity of such) in the document collection. It also
offers a specialised search compared to a search for only the two end
concepts (without the relation).
In a user-centred approach, however, a pure statistical measure for a
relation does not provide the user with any added semantics for the
relation. The latter option, providing meaningful names for a relation
yields a human readable definition of the relationship. Also, domain
model relations may be quite different from relations that can be
determined by statistical analysis of the document collection. Linguistic
or statistical analysis (section 6.2) may yield input to a number of
different relations, based on general term association analysis, term
proximity analysis or cluster analysis, or by determining the syntactic
functions of a term or term combinations. While such automated
analysis provides interesting input to the modelling of relations, this
need not be the relations that are interesting to the domain modellers
and stakeholders. In our approach, we therefore opt for the use of
manually defined relations with meaningful names in the domain model.
Instantiation of binary relations – relation naming
For class concepts, we have allowed the possibility to add instances
during classification time. A similar approach may be taken for the
relations. Meaningful and generic relation names may be hard to find
during modelling time, and several names could be applied to the same

112                                        RML based document classification and retrieval
relation. Creating a model based on existing ontologies, dictionaries or
thesauri, we might also find several unnamed relations in the source
material. Based on this, we have selected to give the users the ability to
provide relation names at classification time. This is what we refer to as
instantiation of a relation, even if this is not instantiation in the proper
    Instantiation of relation is defined as the possibility to provide a relation name to an
    unnamed binary relation in the model at classification time. All relation names provided
    at classification time are stored with the model and will be visible for sub-sequent

When allowing instantiation of relations at classification time, one binary
relation may have several names. We therefore have to restrict the
interpretation of a binary relation.
    For relations with multiple names, the set of references are restricted to the documents
    that are classified according to the particular name, such that: Ref(Rn) ⊆ Ref(R) ⊆
    (Ref(C1) ∩ Ref(C2)), where Rn is the named relation, R is the unnamed relation and C1 and
    C2 are the two end concepts of the relation.

All relation names provided by users during classification are stored as a
synonym name for this relation and are provided as input to later users
selecting the same relation.

5.2.4.     Relation constraints
The RML model language defines cardinality and coverage constraints
for binary and n-ary relations. Neither of these constructs are defined for
semantic retrieval in our approach.
Coverage constraints are used to specify whether all elements of a class
concept must participate in the relation. The intuitive mapping of this to
document retrieval would be to require inclusion of all mandatory
relations for a class concept, whenever the concept is included in a
document representation. This is however a rather strict interpretation:

   In order to include a relation in a document representation, we
    have required both participating concepts to be selected. For
    relations where not both concepts must participate in the relation
    (i.e. where both concept do not have full coverage constraints),
    this immediately breaches the semantics of a coverage constraint.

   It may lead to a chaining of required selections, if the related
    concept has mandatory participation in another relation, this
    relation will have to be included too and thus the selection
    propagates throughout the model.

   It is difficult to define textual designators for mandatory
    participation. Using a statistical measurement of relatedness, one
    could define that all relations with a confidence above a certain
    threshold should be mandatory, but it is still likely that this
    relation does not hold for every document in the collection. Using

Semantic modelling of documents                                                             113
      relation names, mandatory participation would require that all
      documents classified according to this relation must include the
      relation name or one of its synonyms.
In the end, we deem coverage constraints to be a too strict requirement
for user selection of model-based document representations.
Cardinality constraints are used to define the number of elements from
the concepts that participate in the relation. We define the semantics of
a domain model construct in terms of its set of references to the
document collection, and for a given document representation, a model
concept is either selected or not. We do not include any measure of how
many documents that may be a referred to by a class concept. Thus,
cardinality constraints are not applied in our approach.
Approaches that create a knowledge base of meta-data statements
based on an ontology, such as the Ontobroker (Fensel et. al., 1998-
1999) system, apply both cardinality and coverage constraints.
However, these systems extract or generate meta data statements from
documents and consider the meta data statements as the instances of
the ontology. Queries are executed towards the knowledgebase directly,
rather than to the documents. In such an approach, the ontology then
defines the “schema” of the meta-knowledge. In our approach, our
intention is to establish direct references from model to the documents,
without constructing a knowledgebase of meta data statements.

5.2.5.     N-ary relations
By definition, the N-ary relation construct in RML consists of a Relation
Class concept and a set of binary ("part-of") relations to the participating
class concepts. The interpretation of a n-ary relation as a composition
mechanism (shown in figure 4.5) implies the part-of relations to be

     The reference set of an N-ary relation is defined as the
      intersection of the set of references of all the binary relations
      participating in the concept: Ref(NRC) = ∩i Ref(Ri), where Ref(Ri)
      represents the reference set of the binary part relation i
      participating in the N-ary relation.
At classification time, the user will have to determine whether only the
aggregated class is relevant, or if the details given by the “part”
relations are required, in order to express the interpretation of the
document at hand. At retrieval time, default expansion rules will define
the a priori interpretation.

5.2.6.     Composed relations
The way we have defined RML, derived relations represents a short hand
notation for naming relation paths. Thus, we have two options for
defining the semantics of a composed relation:

114                                    RML based document classification and retrieval
1. Treat a derived relation as a regular binary relation, with the
   semantics defined equivalent to a binary relation
2. The derived relation can be interpreted in terms of the path of the
   underlying relations. In this case, the set of references of the relation
   is defined as the union of all the relations it is composed of. This is a
   more specific interpretation.
In most classification schemes, application of more specific and thus
precise constructs are encouraged whenever possible (Almo, 2003).
Following the more specific interpretation naturally favours precision
rather than recall.
However, users interacting with the model may have difficulties
distinguishing a derived relation from an ordinary binary relation, and
the latter interpretation can lead to quite strong – and, for the unaware
user, unexpected - restriction of the query, especially with long relation
Both possibilities are meaningful and could be equally applicable. A
priori it is difficult to prefer one before the other. In general, we would
choose the use of the more specific option, while in cases where this
proves to strict or too complex for the users, the latter option would be

5.3. Classification as instantiated model fragments
A document representation for a single document is performed by
selecting and possibly instantiating a domain model fragment.
Classification then, must consider two steps:
     1. The selection of the appropriate model fragment.
     2. If relevant, instantiation of the fragment.
As discussed earlier, the selection of a model fragment is the
responsibility of the user. The selection reflects his or her interpretation
of the document at hand. Using the domain model lexicon and the
defined textual designators for the concepts in the model, the system
may suggest a classification, but we leave the final touch to the user.
However, all selections must be valid model fragments and are subject
to selection constraints and well formedness constraints. To illustrate
selection and instantiation of a model fragment and to motivate the use
of such constraints, we examine a small example:
Figure 5.3 shows an example RML model from the "health care in school
sector" (helseskole) domain59. The model shows that each school is

   This is an example from the Norwegian knowledge centre for IT in health services (KITH), The example model is created from the domain terminology document
 “Definisjonskatalog for helsestasjons- og skolehelsejenesten” (Nordhuus and Ree, 2002), and is an
 example we will apply repeatedly throughout the remaining chapters.

Semantic modelling of documents                                                                115
connected to a medical centre, in which their pupils get medical
examinations. A medical examination is modelled as a "Consultation" –
an N-ary relation between the conducting doctor, the location (the
medical centre) and the patient in question.
Along with the example model, we assume the following meta-data
statement that describes the contents of one document at hand:
 This document is about 'Surnadal VGS' routine medical examination of 'Jon
                         Atle' , performed by 'Arild'
Our task is now to select a well-formed fragment of the model, in order
to represent this statement. First, we observe the following:

     The statement refers directly to the class concept consultation
      (through its synonym medical examination) and the individual
      concept 'Arild' representing this particular physician.

     It refers indirectly to the concepts school and pupil through the
      mentioning of specific instances of these ("Surnadal VGS" and "Jon
      Atle", respectively).

                                              Figure 5.3

An example domain model. Medical doctors (DR.) are employed by a medical centre. A subset of medical
centres is connected to a school in order to provide mandatory public health services to the school’s pupils.
A pupil is hence modelled as a subset of patients. A medical examination is modelled as a consultation,
which is an aggregated class construct that defines the particular medical doctor, the location of the
consultation and the patient.

116                                                   RML based document classification and retrieval
   The statement also introduces "inconsistency" compared to the
    model, as it directly states that it is the school that is the location
    of the examination, while this in the model is represented as
    something to be conducted by the schools medical centre, and
    not the school itself.
   The statement refers directly to one relation: "performed by", a
    synonym of "C.Conducted by".
   The statement refers implicitly to the relations C.location and
    C.Of, through their participation in the N-ary relation Consultation.
   Originally the construct is fragmented, that is the referenced
    concepts do not form a connected model fragment.

The user is faced with several questions in order to select a model
fragment that reflects the statement:

   What is the boundary of the selection? How many concepts should
    be selected?
   Should all relations be selected in order to form a connected
    model fragment?
   How many instances should be listed, only one (the Dr. "arild") is
    explicitly represented in the model?

To "regulate" these and related questions, we define two sets of
constraints; Well-formedness constraints that define the nature of a well-
formed instantiated model fragment in terms of syntactic and semantic
requirements, and Selection constraints that is used interactively to
regulate users selections and enforce well-formedness constraints.

A well formed model fragment is defined in terms of its conformance to
the RML meta-model (Chapter 4), a definition that ensures syntactic
correctness of the fragment. For document retrieval, we must also
ensure that the model fragment is represented correctly according to
the referential semantics of the RML constructs.

Thus, an instantiated model fragment is well-formed if:
Class concepts and instances:

   All concepts referenced – directly and indirectly – in the meta-data
    statement are included in the fragment.

   All instances mentioned in the meta-data statement are
    constructed as instances and included in the instantiated model
    fragment. To be correctly represented, all instances must be
    supplied with all defining attributes of the class it is determined to
    be a member of.

Semantic modelling of documents                                               117
Generalisation operations:

     All selections of concepts in generalisation hierarchies are
      performed at one level (single-level specificity). A selection of
      concepts at several levels would violate the specificity of the
      generalisation hierarchy, as discussed above. Query operations
      will ensure proper retrieval behaviour for generalisation
      hierarchies. Within one level of the generalisation hierarchy, a
      selection is well formed if it follows the semantics of the particular
      generalisation operation that generalises this level; a selection at
      a disjoint level cannot include more than one class concept, while
      this is allowed in the general (isa) and overlapping operations.
Relation concepts:

     All relations in a model fragment are given a name. In particular,
      un-named relations in the domain model must be given a name
      (instantiated) at classification time.

     All composed relations are "spelled out" in the fragment, that is
      the complete path of the composed relation is selected. For a
      spelled out composed relation to be well-formed, all statements in
      the path must follow the well-formedness constraints. This
      conforms to the more specific interpretation of composed

     A model fragment should be connected. This ensures that a
      document is described in a cohesive manner, and focuses the
      meta-data statements to one part of the model.
While the well-formedness constraints regulate the quality of a selected
model fragment, selection constraints state how users interaction
should be limited or guided in order to produce well formed fragments.
Selection constraints must be implemented with adequate enforcement
mechanisms, such as default expansion of a selection, user warnings,
check lists etc. in order to help users through selection.
Both well-formedness and selection constraints are a troublesome
feature to define in advance for in the general case. Enforcing correct
model fragments can be awkward for users not familiar with modelling
or the applied modelling language. Not always is what is defined to be
correct from the model semantics point of view correct from a user
point of view. As an example, the well-formedness constraint that
requires a model fragment to be connected can be confusing to a user.
If a user at one hand is given the freedom of selection yet on the other
hand is limited to selecting only one "spot" in the model and make sure
everything is connected to this, it will limit the users ability to express
her view on the semantics of the document. Another problem with

118                                      RML based document classification and retrieval
selection constraints defined from the model point of view is that they
put a stronger requirement on the users' understanding of the modelling
language. The well formedness constraints given above, would for
example require the users to be familiar with all different generalisation
operations and also to be familiar with the concept of composed
The only solution is to allow the selection constraints and consequently
the enforcing mechanisms to be adaptive, and to determine the level of
strictness and stringency required according to the application setting.
In the general case, we may devise selection constraints that enforce the
well-formedness constraints given above.
   1. A selection must contain at least N, at most M concepts. This is a
      normative constraint only, and has nothing to do with the
      semantics of the RML language. This is simply a rule of thumb
      and adequate levels of N and M must be defined in accordance
      with the specific setting.
   2. In order for a fragment to be connected, the fragment is expanded
      through a shortest connecting path mechanism that determines
      the shortest connecting path between a concept that is not
      connected to the selection and the closest concept in the
      selection. Users are free to select another connecting path, if
   3. For a relation concept, both relation ends, i.e. the non-relation
      concepts participating in the relation, must be included in the
      selection. This is done through default expansion.
   4. All selected relations must be given a name. Names are suggested
      by a list of earlier supplied names for this particular relation. If no
      previous name has been provided, the user must construct a
      name by himself.
   5. Selecting and naming a derived relation triggers a default
      expansion of the selection with the complete derived relation path,
      or the parts of the path not already included in the selection.
   6. Selecting an n-ary relation class concept triggers a default
      expansion along all participating binary ("part") relations of the n-
      ary relation.
   7. A selection cannot contain a generalisation-operation concept.
      Selections to a generalisation operation concept are diverted to
      the more generic concept. Selections at several levels in the same
      generalisation hierarchy are denied. Within one level multiple
      selections are enabled or disabled depending on the
      generalisation operation construct "above" the level.

Semantic modelling of documents                                          119
5.4. RML as query language
Figure 5.4 shows the previously described retrieval process found in
modern retrieval systems, compared to the steps necessary for model
based retrieval. The steps of the query process specific to model based
retrieval are:

     Query Formulation: A query is formulated by selecting relevant
      concepts and relations in the domain model. If desirable, the user
      may start out by expressing the query in natural language. The
      natural language query is then matched with the domain model
      through the domain model lexicon and the query is visualised in
      the domain model. The user can then refine the selection

     Query Transformation: In our case, query transformation is
      performed through default expansions, defined based on the RML
      semantics. Traditional IR systems perform lexical transformations
      such as normalisation (e.g. stemming), spell checking and
      phrasing/anti-phrasing. In our case, this is already handled since
      both document representations and queries are represented as
      domain model fragments.

                                             Figur 5.4

         The query process: a) standard query process, b) steps specific to model based querying

120                                                  RML based document classification and retrieval
After the appropriate transformations are applied, queries are executed
and relevant documents are retrieved. The result set is analysed and
presented to the users. In traditional IR systems, result set analysis may
include ranking, clustering/relevance feedback and categorisation.60 In
our approach, we perform:
       1. Model based ranking and visualisation of results: Ranks are
          calculated according to concepts. The number of documents and
          hits for each concept is visualised in the model, in order to enable
          model based query refinement.
       2. Model interaction for query refinement: As the results are
          visualised in the model, model interaction will allow exploration of
          the result set and further query refinement.
Each of these steps will be explored in the sequel.

5.4.1.         Query formulation
In our approach, query formulation is performed by interacting with the
model and selecting model concepts. When desired, a user may also
start by formulating a query in natural language and have the query
mapped to a model-based query. In traditional IR systems, two common
notations exist: The MATH notation and the Boolean notation (Sullivan,
2001b). Table 5.1 shows the features supported by these notations and
the corresponding mapping to a model based query formulation.
Use of the "near" operator at search time does not give meaning in our
approach, as we are searching stored classifications and not the full
document text. Retrieval of concepts "near" each other must in our
approach be expressed by way of the relations between them.
As defined earlier, the technique of phrasing refers to detecting term
combinations and sequences that should be handled as a unit rather
than as separate words, such as proper names or compound noun
phrases. In our approach, the matching of query terms against a phrase
list is implicit in the matching of the query against the domain model,
thus we will detect those phrases that match a concept in the model or
are included in the model dictionary. Again, since we do not match the
query against the full text of documents the explicit marking of phrases
will give no effect in our approach.
Approximate matching may be applied in the matching between query
terms and the domain model. Approximate matching handling of
syntactic variations at term level, and strategies for handling these will
also be treated in chapter 6.

     These techniques are described in more detail in chapter 2.

Semantic modelling of documents                                            121
                                           Table 5.1
Feature              MATH notation              Boolean notation            Model mapping
Include term         + term                     AND term                    Select concept if the
                                                                            term matches
Exclude term         - term                     NOT term                    Not-select      the
                                                                            concept if the term
Match any term       Default                    OR combinations             Optional
Match all terms      Default or use of '+'      AND combinations            Default
Nesting              -                          (Use of Parentheses)        Canonical form only
Near                 -                          NEAR N Words                N/A

Phrasing             "use of quotes"            "use of quotes"             N/A
Stemming             Option                     Option                      Domain           model
Approximate          Option or wildcards        Option or wildcards         Optional         in   the
search                                                                      matching

Feature search       feature:                   feature:                    Separate part of UI
           Model-based query “notation” compared to standard search enigne query notations

Feature search allows for searching in the extracted contextual meta-
data about a document – such as its title, location (URL) etc. In our
approach, this will be handled separately, and a specific interface must
be made available to allow specification of search according to all
recorded meta-data attributes.

5.4.2.     Model based query transformations
The model based query transformations we apply is limited to
expanding the query by following the semantics of the RML constructs.
Each selection of a concept will trigger a default query expansion, such
      a) If the selection is a binary relation – both relation ends will be
         included in the query.
      b) If the concept represents the "whole" concept of an n-ary relation
         concept (aggregation) – all part-sets will be included in the query.
      c) If the concept is involved in a generalisation – specialisation
         hierarchy, the default expansion will include all more specific
         concepts in the query, i.e. all subsets will be included in the
We refer to the result of a selection with its default expansion as a
selection group. A selection group consists of the selected concept along
with all concepts included as a result of the default expansion from that

122                                                RML based document classification and retrieval
concept. Within the selection group, we apply an OR (or "match any")
strategy, that is, we will accept a document, classified according any of
the concepts within this group as a hit for this group. As is shown in the
figure, the same expansion rules also apply to not-selections.
By default, we apply a boolean search strategy. That is the created
search expression constitutes an and-combination ("match all") of the
selection groups. Our Referent-models are flat in the sense that we do not
allow concepts to be decomposed into a sub-model. Thus we have no
way of entering nested Boolean expressions, our "match all" queries
represents conjunctive normal forms: AND combinations of slection
groups (OR expressions).

5.4.3.    Result presentation and query refinement
In most web search engines today, advanced techniques are applied on
the query results, mainly in order to enhance presentation to the user
and to support subsequent query refinement. In this section, we will
define how the most common techniques apply to our model-based
Ranking schemes are applied in order to present the most relevant
documents first. In general three different measures are used for
calculating ranks (Baeza-Yates & Ribeiro-Neto, 1999):

   Term frequency (TF): The term frequency of a term                                  ti for a
    document d, is defined as the number of occurrences of                             ti within

                                           Figure 5.5

         Model-based search expressions constitute a boolean combination of selection-groups.

Semantic modelling of documents                                                                    123
                                            Figure 5.6



Default expansion and selection groups. A) The selection of a binary relation includes both relation ends
(concepts A and B). The selection of a general concept default expands to include all sub-concepts. These
are included within the selection group using boolean OR. Illustrates-rules 1 and 3 above. B) default
expansion can be modified by the user. Left: A2 is explicitly removed from the selection group. Right: An
explicit selection of the generalisation operation “inverts” the inclusion of the sub-clases.

     Document frequency (DF) and inverse document frequency (IDF):
      the document frequency of a term ti is defined as the number of
      documents that contains the term ti. The inverse document
      frequency is defined as the total number of documents in the
      collection divided by the number of documents that contain ti. In
      the commonly applied TF*IDF ranking scheme, the IDF factor is
      used to dampen the effect of the TF factor, since it gives higher
      ranks to terms that occur in few documents and marginalising
      terms that occur in all documents.

     Collection frequency (CF): The collection frequency of a given term
      is the total number of occurrences of ti within the whole document
      collection, that is CF(ti) = Σ d tf(ti,d). In settings with an open-
      ended document collection, the CF factor is normally not used.
With our model based retrieval, the use of these measures for ranking is
not trivial. We are not indexing on terms but on model concepts, and for

124                                                  RML based document classification and retrieval
a given document representation and a given query, a concept is either
selected or not, analogous to the boolean retrieval model, where all
weights are either 0 or 1. With a boolean model, there is no notion of
term frequency. Furthermore, at classification time, a user is free to
select model concepts that were not designated by any term in the
document, which also breaches the principle of TF based weighting.
With boolean weighting of terms, the CF measure equals the DF
measure. DF, or rather its inverse, the IDF factor is used to give higher
weight to more specific terms, since it favours terms that occur in few
documents. For a given query however, the IDF measure does not yield
discriminating weights with the boolean retrieval model, since all
retrieved documents are required to contain the concepts of the query.
Still, IR systems that apply TF*IDF ranking, partly support boolean
formulation of queries through the MATH notation that allows users to
require specific terms in the result documents. This is performed by
giving a significantly high weight to the required query terms. Again, we
will strive to apply standard IR machinery, and thus will have to conform
to the actual ranking system in use. Most web search engines use an
intricate formula for calculating ranks, which includes mechanisms to
boost certain aspects of a hit (e.g. terms found in the title seem more
relevant than terms found in the body).61 In general, ranking
configurations in practical use is a result of experiments and tuning, and
must be conducted with the actual IR system to apply.
In our current implementation, we simply calculate ranks based on the
occurrence counts of the found concepts and relations. That is, we boost
the documents that contain textual designators for the retrieved
concepts. All retrieved documents must match the query expression, but
the ones that contain the highest number of terms designating a query
is presented first.

Result presentation
The retrieved documents will be presented as a ranked list. However, the
domain model will be our main interface for subsequent query
refinement. We have developed a visualisation that present an overview
of the query results in the model, and thereby shows the distribution of
hits to the model concepts:

    The total distribution of domain model concepts in the documents
     are visualised, that is, all concepts and relations that were
     referenced in the retrieved set of documents are marked with
     proportional markers, indicating the number of hits for this
     concept. Figure 5.7 shows a stylistic example of the notation.

  Also other measures are used to rank documents, the most widely known is perhaps Googles Page-rank
 mechanism (Brin and Page, 1998)

Semantic modelling of documents                                                                  125
                                        Figure 5.7

                       Stylistic visualisation of hits in the domain model.

     Also concepts not included in the original search expression are
      marked with hits. These are concepts not originally considered by
      the user, but are concepts that will narrow the search if included
      in the search expression.

     The actual number of hits to a concept can be shown on request.

Query refinement
Fundamental to our approach is the ability to interact with the domain-
model and exploit the visualised results. Ideally, all query refinement in
our approach should take place by way of model interaction (even if a
user may choose to alter the original text query expression). Model
interaction for query refinement will have the following effects:

     Selecting or deselecting a concept will trigger an immediate re-
      execution of the now modified query. The modified query is
      executed on the complete set of documents, since a) restrictions
      to the query will anyhow not retrieve any documents outside the
      original retrieved set, while b) relaxations to the query will have to
      retrieve documents outside the original set.

     Selecting a concept with no hits produces an empty result set (as
      there are no retrieved documents containing this concept) while
      selecting a concept with many hits naturally returns a longer list
      of documents than concepts with fewer hits.

     The same default expansion rules apply for query refinement – as
      for the initial query transformation.
In order to support query refinement, most retrieval systems today offer
techniques that will support the users in narrowing down the search. In
Relevance feedback, one distinguished between three sets of
documents; Dr - the relevant documents retrieved , Dn – non relevant
documents retrieved and Cr – the set of relevant documents in the whole
collection. The idea behind a user relevance feedback strategy is to have
the user indicate the set Dr and then calculate a modified query that will
more accurately retrieve and rank documents from the total set of
relevant documents Cr. Supporting relevance feedback necessitates
some mechanism to group the retrieved documents into sets.

126                                              RML based document classification and retrieval
Two techniques are common here:
   Categorisation: If the search engine supports the use of
    categories, as mentioned earlier, the retrieved set of documents
    will be matched against the categories. The users may then
    choose the most relevant category to elaborate on. Either further
    retrieval may be limited to within this category only, or the
    category documents and the term-based definition of the category
    may be used to calculate a modified query that will retrieve more
    relevant documents from the total collection.
   C l u s t e r i n g : In clustering, terms from the most relevant
    documents (i.e. the n top-ranked ones, or the ones selected by the
    users), are used to retrieve a larger set of relevant documents
    from the Cr. An association cluster is calculated based on term
    similarity, and the original query will then be expanded with the
    highest ranked terms according to the similarity measures.
    Several strategies for computing clusters exist, which will result in
    anything from several small and cohesive clusters to larger and
    more diverse clusters.
Visualizing the query results in the domain model allows us to perform a
relevance feedback strategy based on user feedback at the concept
level. Rather than calculating new query-terms the user feedback
automatically provides modified queries. Thus at retrieval time, our
relevance feedback approach is similar to those applied in
categorisation, only performed on concept level, rather than the
generally more abstract categories. In our approach, the calculation of
term similarity and term sets (sometimes denoted "searchonyms") found
in probabilistic clustering approaches is rather applicable as an analysis
technique when building the domain model dictionary (Chapter 6).

5.5. Mapping from RML fragments to RDF statements
Recall that one of the principles behind our approach is to be able to
realise our system through available standard IR systems or machinery.
In terms of document description on the web, the emerging standard is
the Resource Description Framework (RDF) language and its
surrounding technologies. In the sequel we describe a realisation of our
system through the use of RDF. Choosing RDF as representation format
not only enable us to rely on the emerging web standard, but also
makes our document descriptions available to other systems and
information-seeking agents, thus opening up our system for
A brief presentation of RDF was given in chapter 2. Here, we start by
specifying a meta-model of RDF and RDF-S. We then present a RDF
based architecture for the realisation of our system through RDF, before
we define transformations from a domain model in RML into meta-data
statements in RDF.

Semantic modelling of documents                                             127
5.5.1.      An RDF meta model
To formalise our interpretation of the RDF and RDF-S concepts, we have
constructed a UML meta-model of the main constructs (Figure 5.8).
RDF statements are aggregates of subject (rdfs:Resource), predicate
(rdf:Property), object (any rdf:class, i.e. either a Resource or a Literal)
triples. All classes have a unique ID within the model namespace – id's
need not be human readable, this is the purpose of the Label. Resources
may also have a textual comment that can be used to explain its
Conceptual modelling features in RDF are introduced through the
rdf:Type, rdfs:SubclassOf, rdfs:Range and rdfs:Domain constructs. In
our mapping, we apply these extensively in order to represent class
membership, generalisation/specialisation and relations respectively.
In addition to the basic constructs, RDF also offers a set of collection
utility classes, also shown in figure 5.8.

                                              Figure 5.8

A meta model of RDF/S represented in UML. Prefixes specify the origin of the construct, RDF or RDF-S. This
meta-model represents our interpretation of the RDF-S specification of April 2002 (W3C-RDFS, 2002). In
particular, the construct of class and its relation to Resource and Literal is controversial, not only in our
interpretation, but is generally subject to debate and confusion in the RDF/S community. See e.g. (Pan &
Horrocks, 2001)

128                                                    RML based document classification and retrieval
5.5.2.   Using RDF in the RML retrieval system
Figure 5.9 shows a principal illustration of the use of RDF in our
approach to classification and retrieval. We apply our RML language for
both classification and retrieval. In order to apply RDF, we must create
document descriptions in RDF. Instantiated RML fragment (describing
one document) must then be translated into a set of RDF statements for
that document. Similar, our RML based queries will have to be
translated into an RDF query. In order to be able to utilise structural
RDF query languages such as RQL (Karvounarakis et. al., 2002), also the
domain model must be transformed into an RDF-Schema. Available
RDF-tools, such as the RDF suite (Alexaki et. al, 2001) can then be
applied to store the document descriptions and to answer the queries,
respectively. In the sequel, our focus is to define a mapping that enables
us to perform a translation from RML statements into RDF. Figure 5.10
illustrates the applications of such a mapping, by extending the original
model-based classification and retrieval tasks (figure 5.2).

5.5.3.   Constructing a mapping
The goal of mapping from RML to RDF in our case, is to be able to
generate meta-data statements in RDF based on a selected instantiated
domain model fragment. Mappings between modelling languages and
corresponding models and instances can be characterised according to
their level and nature (Bowers & Delcambre, 2000; Seltveit, 1994):

   Language: Representations are expressed in a language.
    Mappings can be designed within one language (intra-language) or
    between different languages (inter-language).

   Level: adapted from regular database approaches, level is one of
    model, schema and instance. Mappings can be inter-level or intra-

                                          Figure 5.9

             Principal application of RDF/S for model-based classification and retrieval.

Semantic modelling of documents                                                             129
     Preservation of semantics: The interesting aspect from our point
      of view is the preservation of semantics between the two
      representations. The degree of information preservation yields
      different mappings:

       -   Translation: A Translation turns the representation from
           symbols in one language into symbols of another
           language. If the translation is performed without loss of
           data a correspondence between the two representations
           can be established.

       -   T r a n s f o r m a t i o n : A transformation represents a
           conversion into a new representation. A conversion
           implies a "fitting" into a new content and structure.

       -   Generation: produce statements in one representation
           from an initial representation in another language. The
           constructs of the initial representation is considered as
           production rules for the generation of the final
           representation, no direct correspondence between the
           two representations is established.

     Determinism: A mapping is defined as deterministic if repeated
      applications of the mapping on the same original representation
      always give the same result.

                                           Figure 5.10

Model-based classification and retrieval, extended to apply RDF. The domain model must be transformed or
translated into an appropriate set of schema level RDF/S statements. A document description is used to
generate RDF meta-data statements, while a query must be translated into a structural RDF-based query.

130                                                 RML based document classification and retrieval
                                             Table 5.2
    XML           Levels            RML         RML Example           RDF            RDF Example

XML                Model      RML               Class Concept    RDF/RDF-S         RDF:Resource
DTD               Schema      Domain model      Person           Domain model      DM:Person
XML              Instances    Instantiated      Person ("Jag")   RDF    Meta-      DI:&P"jag"
Document                      model                              data-
                              fragment                           Statements
Language levels and example representations, RML and RDF. For instantiated RML fragments, the instance
and schema levels are somewhat mixed. For RDF there are no strict separation between levels.

Table 5.2 shows the model, schema and instance levels with examples
for the RML and RDF languages. The division into levels is not clear-cut
   For RML, the instance level model fragment is a mixture of schema
    (the model fragment) and instance (the instantiated parts of the
    fragment) levels.
   For RDF, there is no clear distinction between the levels. This is
    intentional in RDF and enables statements to be freely constructed,
    related and super-imposed. Namespaces are applied in order to
    separate the statements.
In the sequel we will show two variants of mapping from RML to RDF.

5.5.4.     First trial: Direct mapping
The first attempt at designing a mapping, we approach from an RDF
perspective, i.e.; we aim for a transformation of our metadata into RDF.
Again, a transformation represents a "fitting" of the representations into
the language offered by RDF. The constructed RDF statements are less
coloured by particular RML constructs. This yields as "pure" RDF data as
possible, ensures conformance with the RDF model, and enables the use
of standard RDF retrieval implementations.
The direct mapping illustrated in figure 5.11 has the following

   No explicit mapping takes place at model level. Implicitly, there is
    an approximate mapping that forms the basis for transforming
    RML constructs into RDF at the next level. This correspondence is
    used as production rules for the mapping at the schema-level.

   At schema level, we transform the domain model into RDF/RDF-S
    statements, following the production rules specified at model

      -    All statements are constructed within a domain model

Semantic modelling of documents                                                                   131
                                 Figure 5.11

                             First trial – direct mapping

       -   The transformation is an approximation, since RML
           statements are mapped into corresponding RDF/RDF-S
           constructs, but RML statements with no corresponding
           constructs in RDF are left out, or approximated to a
           more generic construct, if possible.

     The instantiated domain model fragment is used to generate RDF
      meta-data statements. Again, constructs from the higher level are
      used in the statement production rules; that from an instantiated
      model fragment will produce RDF-statements in conformance with
      the transformed schema. All meta-data instances are produced
      within a Document Instance level namespace.

Figure 5.12 shows the direct mapping applied to the meta-data example
in figure 5.3. As mentioned, name-spacing is applied in RDF to separate
statements with respect to origination. Our statements have the
following origins:

     RDF and RDF-S constructs: The basic modelling primitives.

     Domain model statements (DM): RDF representation of the
      domain model. In the figure, these statements represent the
      generic parts and are included in order to connect the instances
      and model fragment statements together. Normally, the model
      translation would be stored in a separate graph.

     Document (instance) level meta-data statements (DI): The actual
      meta-data statements representing the document at hand.
Each of these are given a separate namespace prefix, as shown in the
As indicated in the figure, yet another namespace of statements must be
considered – the RML level. There are constructs in the domain model –
such as cardinality and coverage – particular to RML that cannot be
represented with basic RDF/RDF-S constructs. In this direct mapping,
we have discarded such information (to be discussed later).

132                                        RML based document classification and retrieval
The RDF model in figure 5.12 is generated using the following
production rules:

   1. Domain model level
           a. Each non-relation concept in the domain model (DM) –
              (individual or class concepts) is represented as an RDF

           For all Ci(Name, Def) , where Ci ∈ DM(Individual concept) or Ci ∈ DM(Class concept):
                 RDFS:ID =,
                RDF:Label =Ci.Name,
                RDF:Comment = Ci.Def

                                          Figure 5.12

               Direct mapping: RDF-translation of the previous health-school example.

Semantic modelling of documents                                                                   133
      b. Each individual concept in DM is connected to its
         corresponding class concept, by RDF:Type relations, that is:

      For all element generalisation operations Eop(Ic, Cc)
          where Ic ∈ DM(Individual concept) and Cc ∈ DM(Class concept)
      RDF:Type relation (RDF:Resource Cr, RDF:Class C)
          where Cr = RDF:ResourceClass(Ic) and C = RDF:Class (Cc)
          and where RDF:ResourceClass(C) represents the RDF:Resource constructed
                          for the DM concept C, according to rule a)

      c. Each binary relation in DM is represented as an
         RDF:Property, class,:

      For all Ri(Name) , where Ri ∈ DM(binary relation)
            and with relation ends C1, C2 ∈ DM(non-relation concept)
      RDF:Property(RDFS:ID,RDF:Label,RDFS:Domain,RDFS: Range), where
          RDFS:ID =,
          RDFS:Label =Ci.Name
          RDFS:Domain = RDF:ResourceClass(C1)
          RDFS:Range = RDFS:ResourceClass(C2)
          and where RDF:ResourceClass(C) represents the RDF:Resource constructed
                           for the DM concept C, according to rule a)

      d. Each Subset generalisation operation Sop(Cs, Css) is
         represented through RDFS:SubClassOf relations (not

      For all Sop(Cs, Css), where Cs ⊆ Css
          RDFS:SubClassOf(RDFS:Resource RCs, RDF:Resource RCss), where
          RCs = RDFS:ResourceClass(Cs) and RCss = RDFS:ResourceClass(Css)

      e. Each attribute of a concept is represented as an property
         that points to a literal value:

      For all attributes Ai, where Ai ∈ Attributes (C) and C ∈ DM(Concepts)
          RDF:Literal LAi (RDF:Value = to be determined by instances),
          RDF:Property (RDF:Name, RDFS:Domain, RDFS:Range),where
          RDFS:Name = Ai.Name ,
          RDFS:Domain = RDF:Resource(C)
          RDFS:Range = LAi

      f. Each n-ary relation in DM is represented only by the binary
         relations it is composed of. No aggregation construct is
         introduced in this first/naïve mapping. (Not shown in the

134                                          RML based document classification and retrieval
    2. Document Instance level
           a. By the mapping in 1), each DM concept referred to by the
              meta-data statement is now represented by an
              RDF:Resource class. Thus for every individual concept
              represented in the document specific meta data

           For all Ci ∈ DI and C ∈ DM
               RDFS:Resource (RDFS:ID, RDF:Label), where
               RDFS:ID = DI.Name:Ci.Name
                   RDF:Label = Ci.Name
                   RDF:Type relation (RDF:Resource(Cr), RDF:Class(C) ), where
                   RDF:ResourceClass(C) represents the RDF:Resource constructed
                   for the DM concept C, according to rule 1a)

           b. By the mapping in 1) each DM Relation is represented by
              an RDF:Property class. Thus:

           For all DRi ∈ DI and there is an Ri(Name) , where Ri ∈ DM(binary relation)
                 and with relation ends either CI1, CI2 ∈ DI(non-relation concept) or C1, C2 ∈
           DM(non-relation concept)
                RDF:Property PDI (RDFS:ID,RDF:Label,RDFS:Domain,RDFS: Range), where
                RDFS:ID =,
                   RDFS:Label =DRii.Name

                   If CI1, CI2 is instantiated in DI according to rule 2a) :
                              RDFS:Domain = RDF:ResourceClass(CI1)
                             RDFS:Range = RDFS:ResourceClass(CI2)
                             Where RDFS:ResourceClass(CI) is the RDF:Resource constructed
                                     for the DI concept CI according to rule 2a)
                             RDFS:Domain = RDF:ResourceClass(C1)
                             RDFS:Range = RDFS:ResourceClass(C2)
                             where RDF:ResourceClass(C) is the RDF:Resource constructed
                                     for the DM concept C, according to rule 1a)
                   RDF:Type relation (PDI, PDM) where,
                   PDI is the DI level RDF:Property
                   PDM the corresponding DM level RDF:Property

In the direct mapping, we observe the following limitations from the
point of view of RML constructs:

   Generalisation hierarchy: RDF does not support the same
    hierarchy of generalisation mechanisms as RML. With respect to
    the intensional definition of the domain, this represents a
    relaxation of the domain model. However, in the retrieval
    semantics, we have not exploited the full power of RML
    generalisation operations, thus in our application, this relaxation
    is not as strong as in the general modelling case.

Semantic modelling of documents                                                                  135
      Another concern is the semantics of the RDF/RDF-S generalisation
      constructs: RDF:Type and RDF-S:SubClassOf. Are domain model
      level classes types of resources or are they subclasses? Likewise,
      are DI classes SubClasses or Type of DM classes. With our three-
      level interpretation of model, schema and class, we have chosen
      to view the representation at one level as an instance of the level
      above. Thus, all generalisation relations between classes at
      different level are constructed as RDF:Type relations.
     Direction of relationships: The property construct in RDF is not
      bidirectional. When an RML relation is transformed into an RDF
      property, this implies an enforcing of a direction on the relation.
      The obvious workaround is to construct a property for the relation
      in the opposite direction. Anyhow, naming of a relation very often
      implies a direction of reading the relation, as bidirectional names
      are awkward to find.
     Relation naming: RDF:Properties enforces a naming of
      DM:Relations. Relation naming, wrt. retrieval semantics is
      discussed earlier, and is nevertheless a requirement for a well-
      formed model fragment.
     N-ary relations: RDF lacks the construct of N-ary relations. In the
      above described production rules, N-ary relations are represented
      as a set of single relations.
     Relation constraints: Relation constraint constructs (cardinality
      and coverage) have no corresponding construct in RDF, and are
      discarded in this mapping. However, no retrieval semantics are
      defined for these relation constraints.
     Attributes: Attribute properties, (attribute kinds) have no
      corresponding construct in RDF, and are also discarded in this
      mapping. Defining attributes are required in the instantiation of
      RML class concepts. However, the instantiation of the fragment
      takes place on the RML side, i.e. before the mapping takes place,
      thus the constraint will be enforced before the mapping, and does
      not yield semantics required in retrieval once established.

5.5.5.     Second trial: Constructing RML classes in RDF
As discussed above, by performing the direct mapping, we loose some
of the RML specific information in the resulting RDF representation. In
this second trial, we approach this by defining RML constructs as RDF
Classes within a separate RML namespace and apply these in order to
conserve this information. This approach is taken elsewhere, for
example in order to construct a RDF representation for the OIL and
DAML + OIL (W3C-DAML+OIL, 2001) ontology languages. As an
example, figure 5.14 shows the OIL classes constructed to represent
cardinality constraints on RDF properties.

136                                    RML based document classification and retrieval
                                          Figure 5.13

                                          Second mapping:

Figure 5.13 illustrates the 3 required mappings:
    1. All RDF constructs are defined as RDF/RDF-S statements. This
       yields an RML namespace section to be represented alongside the
       domain model. This is a transformation of the constructs, and not
       a direct translation, as they must be defined indirectly through
       application of utility classes. However, no retrieval sensitive
       information is lost in this transformation.
    2. The domain model is then translated into a RDF representation,
       by using the RDF defined RML constructs.
    3. As above, RDF meta-data statements are generated from
       instantiated fragments.
To follow such an approach, we have to define RDF Classes for each of
the constructs in the RML meta-model that we are interested in using
within RDF statements. In the mapping presented in the sequel, there
are three "sets" of classes:

   Classes representing the basic RML concept hierarchies

                                          Figure 5.14

                   OIL utility classes in RDF - inferring restrictions on properties.

Semantic modelling of documents                                                         137
                                              Figure 5.15

                                RDF/S representation of basic RML constructs.

      Classes representing RML relations; binary-relations, attributes62
       and n-ary-relations

      Utility classes. Similar to the OIL example above, utility classes
       are introduced in RDF in order to represent constructs not
       normally present in RDF/RDF-S and that are not directly
       convertible to some existing RDF/RDF-S construct. As we have
       seen from the discussion following the direct mapping strategy,
       this concerns constraints on attributes and relations as well as the
       representation of n-ary relations in general.
Figure 5.15 shows a representation of the basic concept hierarchy
(relation and non-relation concepts) from the RML metamodell (figure
4.8) expressed as RDF Resource classes and related using RDFS
constructs. The concept hierarchy is mapped directly from RML to RDF,
using RDFS:SubClassOf constructs to enforce the hierarchy of
concepts63. The RML generalisation operations are represented as RDF
Property classes. RDFS:Domain and RDFS:Range constructs are applied
in order to enforce the constraints (previously specified in OCL) that the
specific operations only apply to particular operand classes but always
yields RML class concepts as results.

   Both in the generic definition of attributes in RML – as well as in the previously shown direct mapping
 from RML attributes to RDF, attributes behave like relations. Thus, the attachment of attribute constraints
 are here treated uniformly with relation constraints.
     RDF:SubClassOF – disjoint by default ?? (should be, analogous to UML and OO PL)

138                                                     RML based document classification and retrieval
Before presenting the mapping of RML relation classes, we must define
the utility classes that are needed to express relation and attribute
constraints. For this, we define an RML Constraint class in RDF,
analogous to the structure chosen in the OIL mapping.
Figure 5.16 shows the RML constraint structure of cardinality, coverage
and attribute kind constraints, represented in RDF. We have defined an
RML Constraint class as the top-level constraint.
A constraint is connected to an RDF class through a property ApplyTo,.
The ApplyTo property is defined as a mapping with RDFS:Domain from a
constraint class and any RDFS:Class as the RDFS:Range, i.e. a mapping
that applies the constraint to another class. Even if our constraints apply
to only specific classes (e.g. the RML:AttributeKind constraint only
applies to RML:Attribute classes), we have chosen to define one generic
ApplyTo property, rather than individual RelationConstraintApplyTo
properties. These utility classes resembles something of a “necessary"
workaround" in order to be able to represent complete RML structures in
RDF and we have opted for defining as few as possible of these.
Figure 5.17 shows an RDF model defining the representation of RML n-
ary and binary relationships. Analogue to the RML meta-model, an
EndPoint is introduced in order to connect relation constraints to the
proper end of the relation. The EndPoint BelongTo a binary relation and
is ConnectedTo the NonRelation concept that is the actual end of the
relation. Both BelongTo and ConnectedTo are represented as RDF
Property classes. Binary relations must have two endpoints, a rule not
represented in the RDF model. N-ary relations are modelled using the

                                         Figure 5.16

                Utility classes in RDF/S to represent RML binary and N-ary relations

Semantic modelling of documents                                                        139
RDF Bag container class, in order to state that a N-ary relation is a
collection of binary relations. Binary relations may thus be RDFS
Members of an N-ary relation bag.

5.5.6.   The example revisited
With all RML constructs defined as classes in RDF, it is possible to
revisit our RDF example, with the intension of constructing statements
that includes all details from the original RML model. The RDF
statements constructed this way, naturally become more verbose than
the original example, so we will only illustrate the extended mapping
through two smaller parts of the model.
Figure 5.18 shows the representation of individual and class concepts
(DI:jag and DI:Pupil respectively) and attributes (DM:P.ssn and
DM:P.Name). As we can see, the DI level representation have not
changed particularly. The only change is that we now apply the
RML:MemberOf property class, rather than RDF:Type relation in order to
specify that 'jag' is a Pupil. At DM level the only change is the inclusion
of the ApplyTo properties, that specifies both attributes of Pupil as
defining. The major difference is the inclusion of classes from the RML
level. Naturally, these classes don't have to be included for every
document, but they have to be referenced from all the classes in the
domain model that they apply to, such as the ApplyTo properties in the
example that connects the RML Defining attribute constraint to the two
DM level attributes.

                                          Figure 5.17

         Utility classes in RDF/S to represent attribute constraints and relation constraints.

140                                                 RML based document classification and retrieval
Figure 5.19 shows the extended mapping applied to the n-ary relation
DM:Consultation. Again, we see that there are no extensive changes to
the DI and DM levels.
The only new element in the DM level is the particular DM:EndPoint, that
represents the connection between the DM:ConductedBy binary relation
and the DM:Doc class concept in the original RML model.
Again, the major extension to the example is the verboseness that is a
result of the inclusion of RML level classes. However, the one difference
worth noting from the original example in figure 5.20, is that the two
concepts that are related (the DM:Doc and DM:Consultation concepts)
are now not directly connected, as they were initially. This is the
drawback of using the RML endpoint construct, which by design
dislocates the two related concepts.

5.5.7.      Applying the mappings
We have specified two alternative mappings from RML to RDF. The
purpose of these was to verify whether our RML-based approach to
document description and retrieval can be realised by way of RDF and
RDF tools, which represent the emerging standard for Semantic Web
For both mappings, we have specified how to produce meta-data
statements in RDF from an RML domain model and a corresponding

                                           Figure 5.18

From first to second mapping: Individual and class concepts. Greyed constructs are inferred in the second

Semantic modelling of documents                                                                      141
                                  Figure 5.19

                  Second mapping (cont’d): Binary and N-ary relations

instantiated model fragment. The first mapping was specified by taking
a direct approach with the goal of getting as pure and simple RDF
statements as possible. Any RML constructs that could not be
represented by way of existing RDF constructs were approximated if
possible or simply discarded. The second and more complete mapping
introduced utility classes in RDF, based on the RML meta-model, in
order to be able to produce a complete RDF representation of the
original RML model.
The first and direct mapping lose some details from the original RML
representation, but these are all details that we have not specified for
use in our description and retrieval approach. As long as our purpose is
simply to realise our approach by storing the meta data descriptions in
RDF and applying some RDF storage and query system, this direct
mapping would be applicable. As a side-note, it is interesting that the
constructs of RML that we have not made use of in our approach, does
not exist in RDF either, a language specifically designed for similar
If we, in a later development will include some of the RML constructs
currently not applied, we would turn to the second, complete mapping.
Also, if we were to use RDF more extensively, for example include the
ability to import meta-data from RDF into RML, or to exchange meta-
data with an RDF-based system, the more detailed approach would
come to use. However, the detailed mapping is also specified only for

142                                        RML based document classification and retrieval
production of RDF statements only and without considerations regarding
We have specified and exemplified the mappings to a level where we
claim that it would be possible to apply RDF and RDF tools for the
realisation of our approach. In our prototype implementation however,
we have not implemented our retrieval system by way of any RDF tools.
Realisation and experiments are needed in order to validate if the
specified mappings are sufficient. In particular, we have not examined if
our RML based query strategy and notation could be transferred into
corresponding queries in a RDF query language. The latter would be
central for the applicability of RDF in a realisation of our approach.

5.6. Summing up
In this chapter, we have defined the semantics of RML as a classification
and retrieval language. Based on the notion of referential semantics
from digital libraries, we have defined the semantics of model fragments
and RML model constructs. Following this, we have defined well-
formedness constraints and selection constraints, in order to be able to
regulate the description of a document into a well-formed RML model
fragment. In the end we have defined a boolean query strategy and a
model-based query notation for the RML language.
As an attempt to relate our approach to open and emerging standards
and as an exercise to verify the possible realisation of our approach
within these, we have specified a mapping from RML based document
descriptions into meta-data statements in the Resource Description
Language (RDF).

Semantic modelling of documents                                       143
 6. Linguistic analysis for semantic document modelling

This chapter of the thesis defines the use of linguistic techniques related
to our model based approach to semantic retrieval. We apply linguistic
text analysis techniques to automate and facilitate parts of the modeling
process as well as to detect the terms in the document text that serve as
designators for the model concepts.
NLP techniques are still not at a level where advanced text/language
understanding may be performed automatically on running text. Due to
performance requirements (such as speed and space), it is common to
apply a selection of techniques to perform smaller tasks in the indexing-
retrieval cycle (Gulla et al, 2002 ; Allan, 2000 ; Vorhees and Tice, 1999).
Our approach is designed in a similar manner, where a selection of
techniques is applied for specific tasks, and where the users evaluate
the results of the techniques, thus providing the final touch.
As defined in chapter 3, we apply linguistic techniques in all of the three
main processes of the approach:
   Modeling: A sequence of techniques constitutes a document
    analysis process that provides candidate words and phrases for
    representation in the conceptual domain model. The process also
    helps us define possible relations between concepts. This process
    and the applied techniques are described in section 6.2.
    To enable automated matching of text against the model,
    concepts in the domain model are accompanied by what we refer
    to as a domain model lexicon. This lexicon is the basis for the later
    matching of text against the domain model both at classification
    and retrieval time. The domain model lexicon is described in
    section 6.1.
   Classification: At classification time, the document is matched
    against the domain model lexicon and matching concepts will be
    provided as input to the user. The approach enables different
    matching strategies to be applied, from "exact" to "approximate"
   Retrieval: At retrieval time, a textual query expression is
    transformed into a model-based search. Again, this matching is
    performed with the help of the domain model lexicon.
    The matching strategies applied in our approach are discussed in
    section 6.3.

Semantic modelling of documents                                             145
6.1. Domain model lexicon
In concept-based retrieval, a concept represents an abstraction from its
various linguistic realizations in text. In order to support the matching of
text against the conceptual domain model we define a domain model
lexicon that for each concept and relation contains the words and
phrases that are considered textual designators for the concept. A
lexicon item for a concept will contain all the “searchonyms” (Baeza-
Yates & Riberio-Neto, 1999) and their word forms. To avoid confusion
between word, phrase, term and word-forms, we define:
       Word – any word in a document. A word may take on several word-
       Word forms – most words can take on several different -word forms,
         for example through inflections or other morphological
       Base form – denotes the main word form of a word. In cases where
         lemmatisation is performed according to a dictionary, the base
         form is the dictionary entry of a set of word forms.
       Phrase – a combination (i.e. a sequence) of words.
       Term – In our approach, we would like to treat words and phrases
         uniformly. We therfore use term to refer to both words and
         phrases, i.e. a term can be “a particular word or a combination of
         words, especially one used to mean something very specific or one
         used in a specialized area of knowledge or work.” (Encarta, 1999)
The domain model lexicon is maintained manually, though the contents
of the lexicon should be extracted from several sources:

     Actual text in the domain document collection. The document
      analysis process described in the next section is considered the
      main input for defining the model lexicon.

     Available relevant existing lexicons, thesauri or dictionaries.
      Implementation-wise, we support the DICT standard dictionary
      look-up interface (Section 7.5.3)

     We do not consider this lexicon an integrated part of the model; it
      is stored in a separate file (see section 7.3 on architecture) and
      connected to the model. In principle, the overall approach may be
      used solely based on the conceptual model, in which case this
      domain model lexicon is not needed.
The structure of the domain model lexicon is defined by the UML class
diagram in figure 6.1.
The lexicon item for a model concept consists of two major parts:
      1. The definition of the concept.
      2. The set of terms that designate the concept.

146                                   Linguistic analysis for semantic document modelling
Relation concepts are not given a textual definition, but simply have a
set of terms that represents the relation name. An example extract from
a domain model lexicon file is shown in figure 6.2.
In the sequel, we present the different parts of a lexicon item in more
Concepts must be given a domain specific human readable definition.
The purpose of this definition is primarily to provide explanation of the
concepts in the model to users unfamiliar with the domain or the
domain model.
In our domain model lexicon the definition is stored in a free text
format, and the users are free to enter any textual definition they may
find appropriate. In general, we suggest that the definition of a concept
is given by either:

   A "handwritten" definition provided by the stakeholders.
   Adopting and adapting an existing definition from some source,
    such as general dictionaries or domain specific thesauri or
    terminology definitions.
   “Example", i.e. by extracting and selecting a set of sentences from
    the document collection in which the concept occurs (denoted
    occurrence in Figure 6.1). This is an approach similar to (Voss et.
    al, 1999), where a concept is solely defined in terms of the text
    fragments where it occurs.

                                               Figure 6.1

Structure of the domain model lexicon. A model concept is connected to 1 lexicon item. Each lexicon item
consists of 1 or more Terms and 1 or more definitions. A term is recorded with a base-form and several
possible lexical forms. A definition is either a free-text definition that explains the concept, or occurrence
examples extracted from documents. Definitions are recorderd with their source, author and date. Relations
do not have definitions, but may have several possible names (terms).

Semantic modelling of documents                                                                          147
Also, a definition may be given by a combination of the above. The need
for explanation of the model is by nature situation specific, and the
users will have to design a way of defining concepts that suits their
need. In line with the general principles behind our approach, it is more
desirable to give concepts a definition specific to the actual domain.
The set of Terms
The term list of a concept contains all actual words and word forms that
designate this concept in the text, and that will produce a hit for this
concept during matching. The idea is to store all word forms of a term in
the domain lexicon, to avoid having to perform any linguistic analysis
and lexical lookup at matching time. The size of the domain lexicon will
naturally depend on the size of the domain and the document collection,
but it will nevertheless be significantly smaller than general dictionaries
of a given language.
We have chosen to treat words and phrases uniformly, and have defined
term to include both single and multiword terms. Phrases are
"lemmatised" into base forms, simply by lemmatising each word in the
phrase. All word forms – i.e. all conjugated forms for the term – are then
stored in an array alongside the term itself.
As presented in chapter 5, a domain model may be instantiated by the
users. These instances are stored in the domain model lexicon. In
general, instances of the concept will be proper nouns, names
abbreviations etc. For example, we have four major universities in
Norway: the Norwegian University of Science and Technology (NTNU),

                                          Figure 6.2
 <referentinfo referent="x9">                                             Pointer to domain model
 <definition                                                              The definition of the concept
          author=”Iver Nordhuus”                                          in free text. The example
          source=””               definition if from a medical
          date= “2002-01-07”>                                             terminology document,
          Den delen av helsetjenesten som kommunene har                   provided by KITH.
          ansvaret for etter lov
          om helsetjenester i kommunene.
          Kommunehelsetjenesten omfatter miljørettet
          helsevern, helsefremmende
          og forebyggende arbeid, diagnose og behandling,
          habilitering og
          rehabilitering og pleie- og omsorg.
 <synonym word="kommunehelsetjeneste">                                    Base-form, from dictionary
          <wordform word="kommunehelsetjeneste"/>
                                                                          List of possible full forms
          <wordform word="kommunehelsetjenestene"/>
          <wordform word="kommunehelsetjenestens"/>
          <wordform word="kommunehelsetjenester"/>
Example domain model lexicon entry. The lexicon is stored as a separate XML file that refers to the
concepts defined in the Referent model XML file.

148                                          Linguistic analysis for semantic document modelling
University of Oslo (UiO), University of Bergen (UiB) and the University of
Tromsø (UiTø). Both the full names of these universities as well as the
various acronyms will be considered textual designators for, say, the
concept of "university" in a document from the Norwegian governmental
or educational domain. Instances are not visible in figure 6.1, as they
are treated similar to the terms and stored within the same structure,
even if linguistically instances may not behave in a similar way as
regular noun based terms. As an example: In Norwegian, proper nouns
are not inflected like common nouns, but they do take on genitive form.
For the Norwegian university example, we would select the full name as
the base form and then add both the full form and the acronyms as
word forms.
In some cases it is not meaningful to expand the dictionary exhaustively,
but rather to store matching patterns that can be used to test if a given
word form can be matched against the base-form. Matching strategies
and matching patterns are discussed in section 6.3.
The set of Relations names
As figure 6.1 shows, relations are treated similarly to concepts, except
that they do not have any textual definition attached to them. In fact, the
terms attached to a relation represents a set of acceptable names for
this particular relation. Relation names can be suggested by the
linguistic analysis process, but can also be added by a user at a later
stage. In our current implementation, relation names are entered by the

6.2. The document analysis process
In order to extract the relevant information from the document
collection, we apply a whole battery of linguistic tools that have been
organised into a linguistic workbench (Gulla et. al, 2004). The
workbench architecture allows for chaining and configuration of these
tools into different document analysis paths or processes (Kaada,
All the linguistic tools that we apply in this process are limited (small
scale) and well known techniques based on morphological and statistical
analysis (Jurafsky & Martin, 2000 ; Manning & Schutze,1999).
Figure 6.3 shows an overview of the document analysis process. The
process consist of two parts:
   1. An automatic sequence of analysis steps that extracts information
      from the original documents into a result document. This part
      constitutes the pre-processing part of the overall analysis.

Semantic modelling of documents                                        149
      2. In part two, the modeller interactively works with the result of the
         previous part in order to produce the desired input for the
         conceptual modelling phase, and to extract relevant information
         into the domain model lexicon.
The illustrated sequence of process steps for the pre-processing part,
represent what we consider an “ideal” sequence of steps for our
purposes. However, not all of these steps would be applicable or
relevant for any setting; there is no need to perform language detection,
if the documents only exist in one known language, or if a part-of-speech
tagger is not available, word class based phrase detection must be
substituted with a pure statistical phrase detection approach. Within the
linguistic workbench architecture, analysis components can easily be
substituted, added to or removed from the analysis process.

6.2.1.     Process results
As can be seen from figure 6.3, the outcomes of this process are
Conceptual model suggestions and Domain model lexicon entries:

                                            Figure 6.3

Overview of document analysis process. The process consists of an automated sequential pre-processing
phase (top) and an interactive part where the user examines the results and gradually builds the domain
model lexicon.

150                                            Linguistic analysis for semantic document modelling
   Conceptual model suggestions: The major outcome of the
    process is a list of "concept candidates" i.e. a list of words and
    phrases that are considered relevant with respect to the analysed
    documents. This list of phrases is ranked according to a
    specificity measure for the particular document collection (section

    In addition to listing the concept candidates, the analysis also
    tries to indicate relations between the extracted words and
    phrases. We use a statistical measure of relatedness to produce a
    list of possibly related terms for each candidate (see section

   Domain model lexicon entries: If a concept candidate is selected
    for inclusion in the domain model in the interactive part of this
    process, we are able to extract results that can be included in the
    domain model lexicon for this concept.

Result Document
Results from each analysis step are stored in an XML document – in a
format we have named DOXML. This XML document is gradually built
through each document analysis step, by incrementally adding result
sections for each step. This way, every analysis step can operate on the
appropriate results from previous steps and will either add its own
section or operate inline on an already existing section.
In the end, this document will contain all results and all desired
intermediary results from the process. The exact format of the XML
document is presented in realisation chapter (section 7.5).
Process steps
In the sequel, each process step is described through its input (definition
of the input data, parameters and required resources), function
(definition of the algorithm), results and dependencies (to previous
process steps) along with a small example.

6.2.2.   Extraction
The components of our text analysis process all operate on pure-text.
The first thing we have to do is then to extract the text from the original
documents and store the pure text in the format of the XML result
document. This is the task of the extractor. The extractor reads the
collected documents, and for each document in the collection, it
constructs a DOXML-document.

Semantic modelling of documents                                           151
Input:      D c:      The document collection to be analysed. The document
                      collection is a set of document files, accessed either as
                      a list of files, a list of URL’s or as a document directory.

            Sr:       A set of structure delimiter rules. A structure delimiter
                      rule consists of a pattern and a structure tag. The
                      pattern is one or more character sequences that
                      designate a structure delimiter in a document. The
                      structure tag is inserted into the result document if the
                      structure delimiter pattern is discovered.

            Mr:       Meta-data detection rules. A meta-data detection rule is
                      a pattern that matches known meta-data statements in
                      a document. The meta-data detection rules are used to
                      transform existing meta-data statements into the
                      format of the result file.

            Rr:       Removal rules. Removal rules are patterns that are used
                      to detect unwanted text that is not carried over to the
                      result document, i.e. removed.

            match( wordsequence, Rule)
                      A matching function that matches a sequence of words
                      from the document with the patterns specified in the
Function:   For each document di in Dc, insert document into result file, s.t.:

             1. Create document head tag
                // insert a new document section
             2. ws = next n words in di
                // Read document by a sequence of n words : ws.
             3. For each ws
                      a. If match(ws, Mr) – create meta-data element
                     b.   If match(ws, Sr) – insert structure delimiter tag
                     c.   If match(ws, Rr) – skip to next ws
             4. Insert ws into result document
             5. Upon end of document di – write all meta-data elements.
             6. Insert document end tag.
            The Result is simply an xml section for each document, where
            existing document tags and structure delimiters are removed,
            and document text is unaltered but embedded within our own
            document structure tags, and where known meta-data items are
            converted into our own meta-data tags.
Example:    <document>
                   <s> Start of new sentence:
                      Entry for each word, word positions
                       within the sentence are recorded.
                     <w pos="0" v="3"></w>
                     <w pos="1" v="begrunnelser"></w>
                     <w pos="2" v="for"></w>
                     <w pos="3" v="klinisk"></w>
                     <w pos="4" v="unders&#248;kelse"></w>

152                            Linguistic analysis for semantic document modelling
                       The set of meta-data extracted from the current document
                       <meta name="filename" value="c:\doc\helseskole\A98-
                          <meta name="title" value="Description of health school
                          <meta name="lastmodified" value="2002-07-17"/>
                          <meta name="author" value="NN"/>
                       End of document

Dependencies:        None

                            Text extraction and DOXML generation

The extractor also removes unwanted text from the documents and
transforms known meta-data into the meta-data sections in the result
document. Examples of unwanted text is text specific to the original
storage format of the documents, such as html tags and style sheet
definitions. By known meta data, we refer to meta data that we know
exists in the documents and that we apply a pattern matching technique
to detect, i.e. we do not try to discover new meta-data at this point.
The pure text of the documents is extracted while keeping the existing
document structure. Currently we detect chapter, paragraph and sentence
boundaries and text fragments are placed within corresponding XML
tags in the result document. In html, boundaries are determined based
on the existing tag structure, and of the punctuations and symbols in
the text.
Within an intranet setting, html documents are normally formatted
according to a standard or common template, and it is therefore
possible to extract with accuracy some meta-data attributes from the
document. We currently gather all information found in the html meta-
tags, but also allow for the specification of regular expressions that will
match text containing particular meta data items, such as "Last
modified by: <author> at <date>".
Naturally, these extractors have to be developed specifically for each
case, depending on the originating source and format of the document
collection. In building the analysis process, we have examined
documents from three different sources:

   In an example from a large Norwegian oil company, “Company N”,
    documents are stored in Lotus Notes generated html format on
    their intranet server. In this case, we have to perform an html
    "clean up" as the structure and nature of the notes generated html
    is quite intricate. For example, every document is surrounded by a
    table (for layout purposes) containing the page menu and
    standard navigation links. The text in this surrounding layout
    template will be considered as noise in our case and is removed
    Only the text found in the "body-part" of the page is kept as
    document text for further analysis.

Semantic modelling of documents                                                    153
     In the case of the document collection that forms the basis for the
      medical terminology documents developed by KITH, the text is
      already available in ASCII form. In ASCII form, we cannot rely on
      HTML tags and patterns to detect document structure boundaries
      (sections and paragraphs). In this case, we can count the number
      of carriage returns between sentences to determine paragraphs.

     In the case of the medical record documents we have examined,
      these are all extracted from the medical record system into ASCII
      text files, but the text in a medical record is structured according
      to specific section headings, so for this particular case, we use
      "cue words" and regular expressions to determine specific sections
      of the document.

6.2.3.     Language detection
The detection of the document language is important for the outcome of
the analysis process. Several of the later process steps are language
dependent, and will need language specific dictionaries and resources
such as grammatical and syntactic rules. In general, our approach may
support retrieval in multiple languages, but we then want to construct
separate domain model dictionaries for each of the languages in
question. Some documents may be multi-lingual: In the example from
Company N, subsequent document paragraphs would be Norwegian and
English. In our approach, we support language detection on each of the
structural levels determined by the extractor, that is section, paragraph
and sentence.
Language detection is currently a major issue in web search engines
(Gulla et. al., 2002). Their task is much harder than within our
controlled intranet setting. They have to cope with a much larger set of
languages, they have no way of knowing in advance what languages they
will discover and in some cases a document only consists of a few words
(even worse in the case of a query). In our case, the task is much
simpler and amounts to distinguish between a few languages known in
For each of the languages in question, we extract simple term lists from
documents given in that language. Currently, we retrieve documents
classified according to this language with the Alltheweb search engine
and extract a list of the most frequent terms, sorted by their raw
frequencies. We then extract a sequence of consecutive words from the
document – or document fragment. Each term in the sequence is looked
up in the term list for each language, and a score is calculated for the
whole sequence. The language term list that gives the highest score (i.e.
the highest number of hits) is selected as the language.

154                                  Linguistic analysis for semantic document modelling
 Input:                 L            a set of target languages
                        Tli          a list of terms for a target language li
                        Ds           a document structure section
                        Sl           The structure level on which we want to perform
                                     language detection.

 Function:              lookup(word, Tli)              a lookup function that returns a score if
                                                       the word exists in Tli.

                        For each Ds dsi
                              1.   ws = first n words of dsi
                                     // perform detection on first n words of section
                              2.   for each word wi in ws, score_li += lookup (wi, Tli)
                                      // add the score for the current word
                              3.   language = maxi(score_li)
                                     // select language with highest score
                              4.   insert language tag in result document structure section

 Result:              The result of language detection is the insertion of a language tag (i.e. label) within each
                      documet structure section that is examined
 Example                <s language=”no”>.
                                <w pos="0" v="helsestasjon"></w>
                        <s language=”eng”>
                                 <w pos=”0” v=”health unit”></w>

 Dependencies:        None, but if language detection is to be performed on one of the structure levels, rather than on
                      the whole document, these must be determined in advance.
                                           Language detection

This is a relatively simple algorithm for language detection, albeit one
that is useful in the controlled settings we deal with. In our example
from Company N, all documents are either Norwegian or English, or a
mix of the two.
More advanced techniques do exist; examples of which are the use of
frequencies and term probabilities in term lists as well as confidence
thresholds for language assertion, the removal of duplicate words from
the term lists (terms that exist in several languages), and the use of N-
Gram analysis in the term-extraction for each language. Our approach
may be augmented by following either of the first two strategies, while
the N-Gram analysis requires a higher degree of pre-processing and
computation (Zhdanova, 2002).

6.2.4.       Text cleaning
Text cleaning comprises text encoding and transliteration of special
language characters. Transliteration is important to support languages
with particular use of special characters, and where these characters
discriminate between the meaning of words, such as the use of umlaut
in german (ü,ö,ä).

Semantic modelling of documents                                                                                  155
Text cleaning is substantially more difficult on the web in general, than
within our controlled settings. We know in advance the encoding of the
original documents and since we are dealing with a small set of
languages, we can perform transliteration with quite crude rules specific
to those languages.
For encoding, we have chosen to use the Unicode64 standard UTF-8
encoding. The UTF-8 is the default standard encoding for XML
documents. In our current implementation, this encoding is performed
already in the extractor step when the result XML document is created.
However, handling of encoding and transliteration in the general case
should be handled in a separate component and performed after
language detection. This is due to two reasons: First, in some cases the
original encoding of the source file will give hints to the language
detection step. As an example, if the encoding is ISO-8859 part 5
(Latin/Cyrillic), and there is extensive use of "non-ASCII" characters, the
language is limited to Russian, Ukrainian, Bulgarian and their variants
(Shimizu et. al., 1997). Second, transliteration rules will differ according
to the language, or language class.
An example of the effects of variations in use of accents is given by
(Gulla et. al., 2002) for the French word “évènements” (current correct
spelling) which returns 76.000 documents from the Alltheweb search
engine, while its variations return a total of 581.200 documents:
“événements”65 (420.000 documents), “evenements” (35.000),
“evénements” (95.000), “evènements” (22.000) and “évenements”

 Input:                         Ds Document section to be transliterated

                                Tr a set of transliteration rules. Transliteration rules are
                                   character transformation rules that given a character or
                                   character sequence, simply returns another character or
                                   character sequence, that represents the transliterated version
                                   of the initial sequence.
                                      1. ws = Read sequence of words from Ds
                                      2. wst = transform (ws, tr)
                                      3. write words in proper encoding
 Result                       Normalised encoded DOXML document text in UTF-8 format with transliterated text parts.

 Dependencies:                Language detection (in order to enable language specific transliteration rules.)


  “événements” represents the former correct spelling, which may explain the high number of hits for this

156                                                       Linguistic analysis for semantic document modelling
Transliteration amounts to the handling of special language characters,
signs, symbols and punctuations. The main objective is not to use
punctuations and signs to add semantics to the indexing process, but
rather to employ a standard approach, to perform matching and index
computation on normalised lexical forms.
For this, it is generally advisable to define a default procedure and
specify exceptions if needed (Baeza-Yates, Ribero-Neto, 1999 ; Fox,
1992). We have developed a general set of rules that are applied to
punctuations and symbols for both Norwegian and English.
For punctuations and symbols, we adopt the following general rules:
     1. All accents ( ´ , ` and ^) are removed
     2. All cases of Umlaut and Diaesis are removed (ï,ö,ä etc.).
     3. All use of hyphens inside words are removed (state-of-the-art >
        state of the art)
     4. Punctuations inside word-boundaries are kept, e.g. in
        abbreviations ("f.eks." and "o.l.")
     5. All remaining dashes, hyphens and quotes are also discarded.
        Dashes and hyphens may be used to determine separate sentence
        parts, but we do not consider such an analysis today.
     6. All variants of parentheses are removed ("()", "{}", "[]", "<>") – as in
        point 5, these may be used to determine part sentences, but
        again such analysis is not considered today.
For the Norwegian special characters ("Æ","Ø","Å"), we take no special
measure, as these are well established in Norwegian and as they are
supported by the UTF-8 encoding.66

6.2.5.     Part of speech tagging
Part-of-speech tagging is a common linguistic analysis technique,
specified in several text books (Jurafsky & Martin, 2000 ; Manning &
Shutze, 1999) and with available online implementations. In general
three strategies exist:
     1. Rule based: Tagging is performed based on a large set of
        handwritten rules (for example Voutilainen & Heikkila, 1993 ;
        Heikkila; 1995)
     2. Stochastic: Tagging is based on probability calculations. Requires
        training on a manually tagged training corpus (Stolz et al. 1965 ;
        Bahl & Mercer, 1976).

  A common transliteration for these characters are ae for æ, oe for ø and aa for å. For Norwegian texts,
 only the "aa" is in some use today, and then almost exclusively in proper nouns and names (person names
 and geographic locations).

Semantic modelling of documents                                                                      157
       3. Transformation based: A combination of the above.
          Transformation based taggers are commonly called Brill-taggers,
          after its inventor (Brill, 1995).
The part of speech tagger we use is implemented by (Nordgård, 2002).
The tagger is a stochastic Hidden Markov Model based tagger
implementing the Viterbi algorithm (Vintsyuk, 1968 ; Bahl & Mercer,
1976). The tagging process then computes the most likely part-of-
speech tag sequence for the whole sentence. It uses a quite coarse
grained tagset for the 11 most common Norwegian word classes. The
tagset is shown in table 6.2, and compared to the relevant extract from
the Penn-Treebank tagset (45 tags total) (Marcus et. al., 1993).
                                                              Table 6.2
           Tag        Description                                   Example                   TN TAG
           CC         coordinating conjunction                      and                       KONJ
           CD         cardinal number                               1, third                  TALL
           DT         Determiner                                    the                       DET
           EX         existential there                             there is                  -
           FW         foreign word                                  d'hoevre                  -
           IN         preposition/subordinating conjunction         in, of, like              PREP
           JJR        adjective, comparative                        greener                   ADJ
           JJS        adjective, superlative                        greenest                  ADJ
           LS         list marker                                   1)                        -
           MD         Modal                                         could, will               HJVERB
           NN         noun, singular or mass                        table                     SUBST
           NNS        noun plural                                   tables                    SUBST
           NNP        proper noun, singular                         John                      SUBST
           NNPS       proper noun, plural                           Vikings                   SUBST
           PDT        predeterminer                                 both the boys             PREP
           POS        possessive ending                             friend's                  -
           PRP        personal pronoun                              I, he, it                 PRON
           PRP$       possessive pronoun                            my, his                   PRON
           RB         adverb                                        however, usually, here,   ADV
           RBR        adverb, comparative                           better                    ADV
           RBS        adverb, superlative                           best                      ADV
           RP         particle                                      give up
           TO         to                                            to go, to him             Å – INF_MERKE
           UH         interjection                                  uhhuhhuhh
           JJ         adjective                                     green                     ADJ
            VB        verb, base form                               take                      VERB
           VBD        verb, past tense                              took                      VERB
           VBG        verb, gerund/present participle               taking                    VERB
           VBN        verb, past participle                         taken                     VERB
           VBP        verb, sing. present, non-3d                   take                      VERB
           VBZ        verb, 3rd person sing. present                takes                     VERB
           WDT        wh-determiner                                 which                     PRON
           WP         wh-pronoun                                    who, what                 PRON
           WP$        possessive wh-pronoun                         whose                     PRON
           WRB        wh-abverb                                     where, when               - ?? (ADV)
                 Excerpt and examples from the Penn67 tagsets compared to the TN tagset we apply.

A previously tagged corpus is used to compute bigrams for training of
the Markov model. Currently the tagger is trained by a small manually
tagged training set. The tagger today has an accuracy between 90 to
95%. A small training set will have an effect on the accuracy on
unknown words. The tagger applies three strategies to handle unknown


158                                                             Linguistic analysis for semantic document modelling
   1. The ending of the unknown word is matched against a list of
      suffixes from a Norwegian dictionary, which also specify the word
      class for words with this suffix.
   2. The ending of the word is matched with the suffixes of words with
      open ended word classes found in the training corpus.
   3. The ending of the word is matched against the regular inflection
      rules for Norwegian word classes.
If all of these fail, the most frequent tag in the training corpus is
selected, i.e. considered as the most probable word class.
Input:                  Ct          a training corpus of manually tagged documents.
                        S           a set of sentences to be tagged
                        T           a set of tags (i.e. labels) for the word classes we wish to

                        Lwt         list of word-tag probabilities. Calculated from training
                        Ltt         list of tag-tag sequence probabilities (bigram).
                                    Calculated from training corpus.
                        Lsuff       list of word classes for common suffixes (from lexicon).
                        Lsuff_t     list of suffixes computed from open class words in
                                    training corpus.
                        Linflect    a list of common inflection-endings.
                                  Computed from training corpus.
Function:               lookup(w, Lwt) a lookup function that finds probable tags for a
                        match(w, L)     a matching function that matches word endings
                                        with known or computed suffixes and inflections.
                        For each sentence si in text
                              1.   For each word wi in si
                                       a. possible tags wi = lookup (word, Dt)
                                       b. if wi not in Dt, unknown (wi)
                                        c.  handlespecialcase (wi)
                              2.   For wi in si
                                       a. Tag wi = ti such that ti maximizes
                                          Σi P(wi, ti) * P(ti | ti-1)
                              3.   Unknown(wi)
                                        a.   Tpossible = match(wi, Lsuff) || match (wi, Lsuff_t)
                                             || match(wi, linflect)

Result              The result of the tagging process is to add an XML tag with the word class tag for each word in
                    the result document.
Example                <w pos="0" v="prosjekt">
                       <w pos="1" v="for">
                       <w pos="2" v="videreutvikling">
Dependencies:       Language detection: The part of speech tagger is language dependent

                                        Part-of-speech tagging

Semantic modelling of documents                                                                               159
6.2.6.   Lemmatiser
Lemmatisation is the act of transforming a term into its base form or
lemma. A lemma, according to (Jurafsky & Martin, 2000) is an
abstraction, denoting the set of lexical forms having the same stem, the
same part of speech and the same word sense. In other definitions, such
as (Merriam-Webster, 2003), lemma is defined as the dictionary entry
for a term. In order to avoid confusion, we will use the notion of base-
form to represent the dictionary entry for a given term.
By transforming a term in to its base-form, lemmatization removes the
lexical variants of a term caused by inflection or other morphological
variation. In traditional IR, the transformation of terms into a lemma is
considered a normalisation of the index, which may remove ambiguity
caused by inflection and also possibly reducing the size of the index
(Baeza-Yates, Ribero-Neto, 1999).
For Norwegian, we have implemented a dictionary based lematiser,
based on the NKL Norkompleks lexicon, containing approximately
250.000 forms (Nordkompleks, 2000). In the cases where a term may
have several possible base-forms, the word class determined by the part
of speech tag is used to determine the correct base form.
Currently we have only implemented lemmatisation for Norwegian, as
we currently have access only to proper Norwegian dictionaries. While
lemmatisation requires language-specific dictionaries, an alternative
approach is to apply available standard stemming algorithms, such as
for example the Porter stemmer for English (Porter, 1980). Stemming
algorithms reduce a term to its main morphological unit, the stem,
usually through series of rewrite rules that do not rely on exhaustive
lexicons (Frakes & Baeza-Yates, 1992). The value of lemmatisation in
English IR is still a question of debate (Baeza-Yates & Ribeiro-Nieto,
In our approach, multi word noun phrases are simply lemmatised word
by word. In the general case however, multi-word phrases introduces
additional problems for normalised matching including permutations
and phrase head detection. Sometimes, permutation is considered a
part of the normalisation of multi-word noun phrases. (Arppe, 1995)
suggests a strategy for permutation of multi part NP's around
prepositional phrases (“bilen til Ole” -> “Ole bil”, “uncertainty principle
of quantum mechanics” -> “quantum mechanics uncertainty principle”,
“exact form of the correct theory of quantum gravity” -> “exact correct
quantum gravity theory form”). Word order is clearly an issue that
affects normalised matching, but it is difficult to devise generic
strategies. Other strategies are based on the detection of the phrase
head and then order and group the remaining words (“modifiers”)
according to this (Arppe, 1995; Voutilainen, 1993)

160                               Linguistic analysis for semantic document modelling
In our approach, no permutation of words is performed. Norwegian has
a higher degree of compounds than English (“Informasjonssystem” vs
“information system”), so the amount of multi-word np’s will be less,
but this is still an issue to explore.

 Input:                  L            a lexicon of word forms also containing word-classes
                                      for each word.
                         Ds           a pos-tagged document sequence of words to be

 Function:               lookup (word, L)               a lookup function that finds the possible
                                                        base forms for a word

                               For each word wi in Ds
                               1. Possible base forms = lookup(wi, L)
                               2. If several possible baseforms
                                       a. For each baseform bf
                                       b. Match (pos-tag (wi) , pos-tag (bf) )

 Result              The result of lemmatization is an added XML tag for each word in the result document containing
                     the base form of the word..

 Example                 Lemmatisation of single word terms:
                         <w pos="1" v="somatiske">
                         <w pos="2" v="undersøkelser">
                         Lemmatisation of phrases (multi word terms):
                            <np pos="1" v="somatiske undersøkelser helseundersøkelser
                                  <npstem>somatisk undersøkelse helseundersøkelse

 Dependencies:       Part-of speech tagging in order to determine the word class of in multiple stem situations


6.2.7.       Reduction (Filter stop words & word-classes)
Not all words in the document are interesting for our further analysis. In
order to increase the efficiency of the later analyses, we therefore
remove unwanted words from the text. How much to remove is a
question of the required accuracy of the later analyses. In our case, we
are only interested in noun phrases as concept candidates and the verb
phrases that relate these, thus we perform a quite crude phrase
detection analysis and we may remove quite large parts of a sentence in

Semantic modelling of documents                                                                                   161
order to save both space and computational time. The drawback of
removal is that even if a word in the general case is uninteresting
semantically, it could in fact be a necessary part of a larger phrase – a
phrase that will no longer be detected if the word is removed. The
classic example here is Shakespeare's "to be or not to be" – which in
general reduction strategies would be reduced to the single word "not".
In our approach we remove words according to two different strategies:

     Stop words: We remove all words that are considered stop-words.
      Stop words (e.g. Jurafsky, Martin, 2000) are high frequency words
      considered to carry little semantic weight, and thus uninteresting
      for semantic parsing or IR tasks. Separate stop word lists have to
      be maintained for each language in question. We do not currently
      remove stop-words for Norwegian, but we filter out words
      according to their specificity for a given domain (section 6.2.9).

     Word-classes: Based on results of the tagging, we remove all
      words that are tagged according to the word-classes not
      necessary for phrase detection. Currently, tagging and phrase
      detection is only performed according to Norwegian. The accuracy
      of word class based removal is naturally dependent on the tagset
      applied in the part-of-speech tagging.
For Norwegian, we currently remove: Pronouns (pron), Determiners
(det), Ordinals (ord).
We are left with words tagged with the following word classes:
Prepositions (PP), Nouns (NP), Adverbs (ADV), Adjectives (ADJ), verbs
(VP), and connective conjugations (CC). Currently, we only use nouns,
adjectives and verbs in the further analysis process.

Input:                 Lsw          a list of stop words.
                       Lwc          a list of removable wordclasses

Function:              For each word wi in Ds
                           a. If ( lookup(wi, Lsw) || lookup (pos-tag(wi), Lwc) )
                                // remove the entry for this word tags
                                Remove wi

Result             The result of reduction is simply a reduced or pruned XML result document where all word
                   entries that matched either a stop word or a removable word class are removed, including all
                   enclosed information for this entry (pos-tag, base-form)

Dependencies:      Dependent on part-of-speech-tagging, if word-class based removal is used. If part-of-speech
                   tagging should be a part of the analysis process, removal should in any case be performed the
                   tagging step, since removal of words will alter the probability calculations used by the tagger.


162                                            Linguistic analysis for semantic document modelling
6.2.8.   Phrase detection
As mentioned in chapter 2, detecting concepts or phrases from text is
currently applied in several approaches to advanced IR, in KM systems
or in ontology learning approaches. In most cases this includes a
combination of statistical and linguistic techniques. Pure linguistic
based phrase detection is found in several applications (Voutilainen,
1995 ; Voutilainen, 1993 ; Neumann & Schmeier, 2002 ; LePriol, 1999 ;
Schiller, 1996 ; Klavans & Muresan, 2000)
In our case, we are interested in:

   Detection of arbitrary Noun Phrases. We define a noun phrase as
    a sequence of zero or more adjectives followed by one or more
    nouns (Adj*N+). In the general case, also adverbs are sometimes
    used in noun phrases (e.g. “very large data bases”) but we do not
    consider this. Such constructs are more rare in Norwegian, where
    the adverb is often substituted or contracted with an adjective
    (e.g. “kjempestore databaser”)

   Detection of arbitrary Verb Phrases, possibly connecting the Noun
    Phrases. We only record verb phrases from sentences that contain
    a noun phrase. The reason for this is that our interest in verb
    phrases is limited to possible suggestions for relation names.

   Dectection of proper names (Sankt Olavs Hospital) or acronyms
    (St.Olav HF)

Since we have already performed part-of-speech tagging, detection of
noun phrases and verb phrases can be performed based on the word-
classes determined by the tagger. For detection of proper names and
their acronyms, we currently rely on manually compiled lexicons for
each domain. The phrase patterns we want to detect are specified as
regular expressions over part-of-speech tags.

The phrase detection is implemented as a finite state transducer
(FST) (Jurafsky & Martin, 2000). A FST accepts input:output pairs
where the regular expression matching and state-transitions are
calculated based on the input part, while the output is composed of
the output parts of the accepted inputs. In our case, the pos tag is
the input, while the base form of the word is the output that is used
to compose the accepted phrases.

Semantic modelling of documents                                         163
 Input:                  Phrase rules, expressed regular languages over pos-tags:
                          NounPhrase = Adj* N+
                          VerbPhrase = (Np)+ V || V (Np+)//verb phrases are single verbs
                                                           // but are only recorded from
                                                           // sentences with a noun phrase

 Function:               acceptTag(state, tag) : a function that determines if the current
                                                 tag can be accepted in the given state

                         finalstate(state) : a function which determines wheter the current
                                            state is an acceptable final state

                         finalisePhrase(phrasestring) : a function that finalises the
                                                        accepted phrases

                         For each sentence si

                             currentstate = init
                             Phrase = “”
                             For each wi in si
                               Wtag = tag(wi)
                               If acceptTag(currentstate, wtag)
                                     phrase = phrase + wi
                                     If finalstate (currentstate)
                                             currentstate = init
                                             phrase = “”

 Result              Phrase detection is performed on sentence level .For each sentence, detected phrases are
                     added in a separate XML section immediately before the end-of-sentence XML tag.

 Dependencies:       Dependent on part-of-speech-tagging.

                                           Phrase detection

6.2.9.       Frequency analysis (Counting and Selection)
At this point in the analysis process, we have detected all the words in
the domain document collection as well as all phrases, which will
inevitably result in a huge set of terms and phrases, even if reduction is
performed. In order to select the most prominent terms of the domain
specific collection we perform a simple frequency analysis. Frequency-
based selection can in its simplest form be performed just by counting
the raw frequency of terms (after removal of the most common stop-
words, supposedly all remaining high frequency words should carry
meaning). However, we adopt a technique for frequency-based selection
to calculate a measure of specificity for a term in a collection – the so-
called Ahmads Weirdness co-efficient (Ahmad, 1994). This component
can be configured to perform the analysis either on each individual word
or on the detected phrases.

164                                             Linguistic analysis for semantic document modelling
Input:                 Ddm         Domain specific document collection
                       Dgen        Generic language document collection – the contrast
                       I   the set of items to be counted: words, phrases or both.

                            1.    For each document Di in Ddm and Dj in Dgen respectively
                                      a. Count number of occurrences for distinct Items
                            2.    Count total number of Items in Ddm and Dgen respectively

                            3.    For each item i
                                      a. Weirdness Coefficienti =
                                                 (# occurrences of i in Ddm / # items in Di)
                                                         (# occurrences of i in Dgen / # items in Dgen)
Result               Items in a sorted list of decreasing specificity (weirdness co-efficient).

Example                <rank class="weirdness"
                       <rf w="kommisjon" freq="1" nfreq="8.301993E-6">0.0058156564
                            <rf w="klinisk unders&#248;kelse" freq="20"
                            <rf w="helseorganisasjon" freq="86"
Dependencies:        Operates on phrases, not documents; dependent on the whole phrase detection process. The
                     weirdness calculations also require a pre-processed “reference” collection.

                            Computation of the weirdness coefficient

6.2.10. Manual interactive analysis
The automatic analysis results in a ranked list of terms as well as
attached linguistic information to all words in the documents. The
interactive part of the process allows users to query and examine the
results from previous analysis, in order to:

   Extract terms and term variants from the corpus

   Examine occurrences of terms

   Define terms

   Create subsets of terms (in the result) for further analysis

   Extract a list of related terms for each term.

Semantic modelling of documents                                                                      165
                                     Figure 6.4

                                Manual interactive analysis

In addition, a standard dictionary query interface (DICT68) allows for
extraction of definitions and word-forms from available online
dictionaries. For Norwegian, we currently support the definitions and the
full-form lexicon of Norkompleks (Norkompleks, 2000).
Term variant extraction and occurrence examination
Extraction of terms simply amounts to detecting the term in the sorted
list of sorted terms provided by the result document. A term is
represented with its base form in this list. Variant extraction is then
performed either:
       1. By extracting all full-forms that are registered for this base-form.
       2. By extracting all terms that contains this term, i.e. as a suffix,
          prefix or an infix.
       3. Looking up the term in a full-form dictionary.
Occurrences for a term are extracted by collecting all sentences that
include this term.

Related word analysis
So far, the analysis process has only produced a list of terms. Further
analysis is needed in order to identify relations between terms. Users
can select to perform this analysis on all terms or on a subset of terms.
Our analysis of relatedness is based on co-occurrence counts, and is
conducted as follows:


166                                      Linguistic analysis for semantic document modelling
   1. Compute a term – document section occurrence matrix. We count
      occurrences of terms within each document section. A document
      section is one of Document, Paragraph and Sentence, i.e. the
      structure levels detected by the extractor.
   2. Compute a similarity measure for each term – term combination.
      A row in the term-document section matrix represents the term-
      document occurrence vector for this term. The similarity measure
      is then simply based on comparing the two vectors for the two
      terms. Currently, we apply the standard cosine similarity measure
      for comparing the two vectors:
                                                     ∑ f '(t ,d ) ∗ f '(t ,d )
                                                              1        i        2   i
                               sim(t1,t 2 )=
                                                ∑ f '(t ,d )
                                                          1    i
                                                                       ∗   ∑ f '(t ,d )
                                                                                    2   i

                                                 d                         d

       The raw occurrence frequency is dampened so that:
                          f '(t,di )=1+ log( f (t,di )),if (t,di ) > 0
   The dampening function simply plays down the importance of term
   frequency, i.e. a term that occurs 3 times within a document is more
   important than one that occurs only once, but not necessarily three
   times as important (Manning & Schutze, 2000).
   3. Extract the n most “similar” terms for each term
   4. If desired extract all sentences that includes two related terms
Some examples of discovered relations are shown in table 6.3

                                                     Table 6.3
      Term                       Related term                                               Similarity measure (0,1)
      Mobbing (“mobbing”)        mobbing                                                    1
                                 sosial ferdighet – “social skill”                          0,538840267
                                 skole (“school”)                                           0,484954924
                                 trivsel (“well-being”)                                     0,432441213
                                 rusmiddel (“drugs”)                                        0,42924933
                                 ernæring (“nutrition”)                                     0,42354378
                                 selvmord (“suicide”)                                       0,407053963
                                 ungdom (“youth”)                                           0,403442849
                                 elev (“pupil”)                                             0,394998416
      Ernæring (“nutrition”)     Ernæring                                                   1
                                 Helseopplysning (“Health information”)                     0,574623654
                                 Kosthold (“eating habit / diet”)                           0,572780457
                                 sped- (“newborn”)                                          0,54625895
                                 småbarn (“infant”)                                         0,544175638
                                 generell informasjon (“general information”)               0,54397925
                                 ungdom (“youth”)                                           0,543597618
                                 forelder (“parent”)                                        0,53765619
                                 amming (“breast feeding”)                                  0,520782129
                                           Example discovered relations

The related term analysis provides the user with a list of related terms
and optionally also a list of sentences where the terms co-occur. The
user still has to select what they accept to be proper relations and what

Semantic modelling of documents                                                                                        167
is noise from the analysis process. This analysis does not distinguish
between different kinds of relations, such as hierarchical or binary
relations. This is left as modelling decisions to the user. The analysis
does not provide suggestions for relation names, but only lists the
sentences where the two terms co-occur. In a previous version of the
approach, we experimented with a sentence analysis approach that
would suggest relation names (appendix A) but this was discarded since
it is not applicable in cases where the two terms do not occur within the
same sentence and since it did not produce “natural” relation names,
but rather awkwardly fragmented phrases.
Other strategies for extracting relations exists (Maedche & Staab, 2001 ;
Agichstein, 2000 ; Hatem & Latiri, 2003). In general, co-occurrence
based algorithms, such as ours, can only provide unnamed relations and
at best a weight indicating the strength of this relation given the
evidence found in the analysed corpus. Strategies for discovering named
relations include:
     Pattern searching: (Byrd & Ravin, 1999) defines patterns of “cue-
      words” in support of the relations they want to detect. Examples
      of this is “<person> is the CEO of <company>”, “<person> of
      <company> or <person> (<company>)”. Such patterns can
      produce evidence and instances for pre-defined relations, or the
      patterns can be used as seed-relations for an analysis that detects
      similar relations within the document collection (Agichstein et. al.,
     Co-location: Co-location analysis can analyze text that surrounds
      two co-located terms. Co-location of two terms is defined as two
      terms located within a distance of n-words of each other. In this
      case, grammatical analysis or pattern analysis can detect either
      predefined relations (such as agent-of, instrument-of) or discover
      lexical patterns that commonly combine the two terms.
     For compound noun-phrases, analysis of the compound itself can
      be used to detect relations (Rosario & Hearst, 2001), for example
      by analysing attachments and roles within a noun-compound, i.e.
      discovering the parts of the compound attaches to each other and
      the role of the attachment.
Many of these approaches requires input, such as a pre-existing concept
hierarchy (Maedche & Staab, 2001 ; Srinkant & Agrawal, 1995) or
patterns. One benefit of our approach is that it is based on no input at
all, other than the text collection itself. The goal of detecting relations in
our approach is to simplify the task of the modeller. If elaborate
preparations and knowledge on what relations to look for where needed
to discover relations in the corpus, it might simply be easier to define
the relation directly in the model. However, if a modeller is uncertain as
to whether or not a relation in the model holds, the strategies presented
above could be used to verify the existence of such a relation in the
document collection.

168                                  Linguistic analysis for semantic document modelling
6.3. Text to model matching
In our approach, text to model matching is applied for two purposes:

   1. At classification time, the full text of a document is matched
      against the model in order to provide the user with a
      suggestion of concepts in the documents

   2. At query time, the text of query is matched against the model
      in order to transform the query into a model-based query.

Matching is performed by matching a sliding window of words (i.e. a
sequence of words) from the text (document or query) against all the
terms in the domain model lexicon. A sequence of words is needed
since a term in the lexicon can consist of several words. If the
sequence matches a term, the corresponding concept or relation is
selected as a suggestion in the domain model. The window size is set
to the number of words in the longest multiword term in the domain
model lexicon. The matching is performed by first matching the full
window and then parts of the window:
Process:                    ws = sequence of n words
                            dml = set of all terms in the domain model lexicon

                            forall ti in dml
                                        (head, tail) = split(ws, pos)
                                        if (match(head, ti) || match(tail, ti))

In IR systems, different matching strategies exist for matching the
text of a query with the index words or the full text of the indexed
documents. Approximate matching (Navarro, 2001; Hu, 1999;
Baeza-Yates & Ribero-Neto, 1999; Hall & Dowling, 1980) is applied
in IR systems in order to handle lexical variations between query and
index terms, for example to handle misspellings. A common
distinction is between detection of equivalent terms and similar terms
(Hall & Dowling, 1980). Detecting equivalent terms comprise the task
of detecting possible variants of a preferred or canonical term, such
as “database, data base and data-base”. Detecting similar terms are
concerned with detecting terms that are misspelled or otherwise
minor lexical variants of the term in question.

Approximate matching can be performed through various
mechanisms (Hall & Dowling, 1980 ; Navarro, 2001):

Semantic modelling of documents                                                   169
     Thesauri: The simplest approach to detecting equivalent terms is
      by using a thesaurus where equivalent terms are listed for each
      preferred term. The set of equivalent terms is used to lead the
      actually query-to-index matching into the canonical form, i.e. the
      preferred term. (Soergel, 2003) argues for the use of a large lead-
      in vocabulary in thesauri for IR tasks.
     Suffix matching: Several languages have common word endings
      or suffixes that do not transform or significantly change the word
      meaning, for example match, matches, matching. Suffix matching
      can be performed by either detection and removal of the suffixes
      before matching takes place, or through transformation rules that
      transform the word into a root form before matching. In some
      systems more generic pattern matching is applied through
      matching a word as a substring of other words, thus handling also
      common prefixes and words that are infixed in other words
      (Baeza-Yates & Ribeiro-Neto, 1999).
     Transformations: Transformation based matching is used either
      to transform each word into a root form, as in the suffix matching,
      or to transform words into spelling variants in order to match
      possibly misspelled words. For the latter, syntactic based
      transformations, we have edit distance algorithms and phonetics
      based algorithms:
       -   Edit-distance algorithms can match strings by
           computing the cost of editing one string into another,
           through the edit operations: Insert character, delete
           character and substitute character. Each edit operation
           is assigned a cost and a threshold is set to limit how
           many operations can be allowed before the strings
           cannot be said to be similar. Common examples of such
           algorithms are the Levenshtein distance (Levenshtein,
           1996) and Hamming distance algorithms.
       -   Phonetics based algorithms are intended to handle the
           kind of spelling errors that is likely to occur when a
           word is typed “as it sounds”. Soundex (Knuth, 1973)
           and Meta-phone (Phillips, 1990) are common phonetics
           based transformation algorithms.

Transformation based algorithms can be computationally costly, and
it is generally not considered feasible to spell check every word in all
documents to be indexed. An alternative is to apply transformations
on query terms to check if a transformed words produces more hits
with the index. Transformations can be added to a query or the
transformed variant can be presented to the user, such as in the
Google web-search interface where the transformed variant is
presented as “did you mean…”.

170                                 Linguistic analysis for semantic document modelling
In our approach, the domain model lexicon constitutes the equivalent
of thesaurus containing terms that lead to a concept. All equivalent
lexical variants of a concept are the terms that are stored in the
domain model lexicon for this concept. As presented before, both
dictionaries and term-variant extraction from the domain document
collection is used in order to generate the domain model lexicon.

Approximate matching are still considered relevant in our approach,

   Even if supported by using dictionaries and term-variant
    extraction, the domain model lexicon is still manually built. The
    domain model lexicon would have to be exhaustive in order to
    improve the quality of the matching. This is a tedious task.

   The domain model lexicon does not contain spelling errors. Term
    extraction algorithms could detect all spelling variants that are
    located in the domain document collection, but not variants and
    errors that can be entered in a query.

For this, we design the use of approximate matching such that:

   In order to facilitate construction of the domain model lexicon, we
    enable the entry of matching patterns that can be used to match
    known variants of the term. Currently matching patterns are used
    for common Norwegian word endings, such as the possessive
    marker “s”, and the common derivation suffixes: “-ing”, “-else”, “-
    het” and “-sjon”.

   In order to handle possible spelling errors in the query, query
    terms that do not match any term in the domain model lexicon
    are transformed by an edit distance algorithm and matched
In our current implementation, matching is performed using regular
expressions that by nature support pattern matching. Our dictionaries
are stored by way of the DICT proctocol that supports generation of both
Lehvenstein and Soundex variants of a word.

6.4. Evaluating the analysis
KITH – the Norwegian competency centre for IT in health services – is
responsible for the definition of terminology documents for the medical
domain. Their terminology documents are handcrafted, but with support
of selected linguistic analysis tools, most notably the Word-smith toolkit
(Scott, M., 1999). In this section, we examine the results produced by
our linguistic analysis with an existing terminology document

Semantic modelling of documents                                           171
“Definisjonskatalog for hesestasjons- og skolehelsetjenesten” (Nordhuus
& Ree, 2002).
KITH has provided us with the document collections that formed the
basis of their analysis, all prepared in ASCII text:

     Domain specific collection from the “Helseskole” domain: A
      compilation of 51 documents, average length of 7.777 words,
      total size 2.9 Mb, total words: 396.600

     Reference collection: A compilation of 4 years of generic public
      Norwegian documents (“Norsk offentlig utredning”), 115
      documents, average length of 127.000 words: total size 249 Mb,
      total words: 29.500.000.
From this material, KITH has produced a terminology document defining
111 terms from the domain. The terms are defined in a table structure,
as shown in table 6.4.
                                                 Table 6.4
No.      Term      Definition                                                          Cross-reference
67       Patient   A person contacting the health services to request health care,     Departed patient 100
                   or a person that is offered health care by the health services at   Day-care patient 17
                   some time                                                           Guest 23
                                                                                       Patient ready for departure 104
                                                                                       Patient no. 70
                       Example KITH terminology document item. (our translation)

We performed an analysis process consisting of the following process

     DOXML generation

     Part-of-speech tagging

     Lemmatisation

     Phrase-detection

     Word-class removal

     Counting

     Weirdness analysis
The weirdness analysis was performed on the domain specific collection
(“Helseskole”) using the NOU collection as reference. Our analysis
process produced a number of 2366 terms (multi and single word) with
frequency counts and weird-ness coefficient computed. Out of these,
1362 terms had a higher normalised frequency in the domain specific

172                                                  Linguistic analysis for semantic document modelling
collection than the generic collection. 38 terms were unique to the
domain specific collection.
The interesting measure is to investigate how many of the terms we
detect are terms actually included in the domain terminology document:

                                          Table 6.5
                                  exact    part   head   weirdness
                             #     49      56      49       41
                             %     44      95      87       80

Table 6.5 shows the number of exact terms, from the domain
terminology document that was discovered in our weirdness analysis. In
addition we count the number of terms in the weirdness analysis that
constitutes a part of the included term (example: terminology
document: “vaksinasjonsdekning”, weirdness analysis: “vaksinasjon”,
“dekning”). The table shows a coverage of 44% if only exact matches
are counted. If we include all the part terms we detect, the percentage
rises to 95%, that is we detect half of the terms exactly and parts of the
terms for roughly the other half. If we only include the part terms where
we actually discover the head of the term, the total number drops to
87%. Further, if we only include the part terms that are indicated as
more specific to the domain collection (i.e. with a weirdness coefficient
>1), the total number of detected terms (exact and part) drops to 80%.
Nevertheless, this provides an encouraging measure. Only 5 terms from
the domain model were not detected at all.
The weirdness measure is the only measure of specificity that we apply,
where the range from 1 to infinity is the range of domain specific terms.
It is interesting to investigate where in this range we detect the most
terms from the terminology document, in order to check if this is an
appropriate measure. Figure 6.5 shows the weirdness range from 0 to
Infinity and the number of terminology terms that are detected within
steps of this range. The most terms are detected in the lower positive
part of this range (1-50). This means that the weirdest terms are not
necessarily good modelling input. An inspection of these show that these
are mostly particular word-forms not detected by the lemmatisation or
acronyms and abbreviations that probably are rarely used. Some of
these are also terms in the alternate official Norwegian written language
(“nynorsk”). These findings compares well with statistics of word use,
e.g. (Manning & Schutze, 2000) where the most rarely used terms often
represent special cases or sometimes even erroneous forms.
We also performed some experiments with detecting relations between
terms (section 6.2.10). While the relations we found seem good at first
glance, such as the ones shown in table 6.3, only few of these were

Semantic modelling of documents                                        173
included in the domain terminology document from KITH. The reasons
for this may be twofold:
      1. The terminology document does not treat relations in any detailed
         or extensive manner. All of these relations are un-named relations
         marked as cross-references (“see also”).
      2. Our analysis was performed on the weirdest words in the
         collection, without looking at the terminology document. Hence,
         we detect relations between all terms in the weird word list, also
         relations to terms not included in this document.
Because of this, we cannot draw any conclusion about the quality of the
suggested relations. Further experiments are needed. As mentioned,
there are several possible strategies for detecting relations or
associations between terms and it is necessary to experiment with a
variety of these in order to determine the best strategy.

174                                 Linguistic analysis for semantic document modelling
                                      7. The Prototype Realisation

This chapter presents the prototype realisation of our approach. The
chapter is focused on the functionality, rather than the technical details.
A small case study is used as illustration.
The implementation is of a prototype quality. A main goal has been to
implement the system in a component based manner, so that relevant
components can be replaced at a later stage. In a given setting, existing
retrieval machinery and document management systems will be
available. To be efficient, our system would have to be integrated with
existing infrastructure. We have tried to enable this by having
components and protocols adhere to relevant standards.

7.1.   Components in the realisation
The system has four main parts:

   The modelling environment: Our work is conducted within the
    information systems group, which has a tradition for developing
    modelling languages as well as methodology and tool support for
    modelling. We have not developed new modelling tools, but are
    using the current baseline of modelling support from the IS group.

   The retrieval system: For prototyping purposes, we have
    developed a small set of tools that supports our model-based
    classification and retrieval tasks. This prototype is developed as a
    set of server-side components where we will seek to substitute the
    actual IR components with adequate retrieval machinery at a later
    stage, and simply interface the model-specific components with
    available IR systems.

   The CnS client: This is the graphical user interface for model
    based classification and retrieval. The implementation of the CnS
    Client was to a large extent done as part of a master of technology
    thesis, and is described in more detail in (Steinholm, 2001).

   The linguistic workbench: The document analysis process is
    implemented by a collection of components that can be tailored
    and sequenced to form a complete analysis process, suitable for
    the actual document collection (Gulla et. al., 2004). This is being
    performed within the so-called linguistic workbench which was
    implemented as part of another master of technology thesis, and
    is described in more detail in (Kaada, 2002).

Semantic modelling of documents                                            175
7.2.       The modelling environment
The referent model language and corresponding tools are developed at
the Information Systems group at IDI, NTNU. The IS group has a long
tradition of research and development of modelling languages,
methodologies and tool support. The RML language is a recent language
that originated from the PPP69 integrated modelling environment
(Lindland, 1993), (Krogstie, 1995). PPP initially contained support for
several modelling languages; a Process Model Language – PrM, an
extended ER modelling language (ONE-R) and a rule modelling language
(PLD), and also comprised specifications and partial implementations of
extensive methodology support; versioning mechanisms (Andersen,
1994), view generation (Seltveit, 1994), concepts and notation for
hierarchical modelling (Sindre, 1992), prototyping and execution
(Willumsen, 1991) as well as explanation generation and translation of
models (Gulla, 1993).
Later work has refined the initial modelling languages and also added
new languages. The most recent are the RML concept modelling
language (Sølvberg, 1999) and the APM workflow modelling language
(Carlsen, 1998).
Currently, the whole portfolio of tools is being reengineered and ported
to newer technology.
The toolset we are using for the manual modelling process consists of
the following components:

      An RML modelling editor. The editor is a standalone windows tool
       that stores the models as XML files.

      An XML based model repository with support for consistency
       checking and versioning (in progress).

      The IGLOO framework for cooperative product development
       (Farshchian, 2001). IGLOO is defined as an “operating system” for
       cooperative support. IGLOO can be used for integrating already
       existing tools into a coherent cooperative environment. The model
       repository will be integrated with IGLOO to enable support for
       cooperative model development.
The cooperative effort in defining the domain model is of importance to
our approach. Cooperation in this respect does not emphasise
versioning or consistency checking of the final model. These are
important aspects and are taken care of by the model repository, but
the true cooperative aspects of modelling is to enable discussions and
awareness of issues in the model between the set of modellers working
on it. IGLOO offers shared workspaces, both synchronous and


176                                                     The Prototype Realisation
asynchronous awareness mechanisms as well as annotation
mechanisms. In our setting, these are the kind of cooperative support
mechanisms we would like to emphasise.

7.3. Architecture
An overview of the system architecture is presented in figure 7.1. The
client, denoted CnS client (Classification n’ Search), is implemented as
a standalone Java application. The CnS client, has two modes, as
indicated by the name. For classification, the client supports working
simultaneously on a set of documents that are locally managed in a user
profile. Also the linguistic enhancement – i.e. the definition of the
domain model lexicon – of the model is performed using this client. In
the search mode, the client communicates with a standard Web-browser
for listing and viewing documents. The CnS client is presented in the
next section.
The server side components are implemented as a variety of add-ons to
a standard Web server; mainly cgi-scripts and servlets. The client
communicates with the server by executing http get/post commands
and receiving XML encoded replies. As it is our goal to interface our

                                                        Figure 7.1
                                                                                               Classification 'n' Search
              Server side components

                                                                                                  Java 1.1 application
        • Domain model                                                                                 Model Viewer
        • DM Lexicon             Domain           getModel(Domain name)
        • Domain                 model            match(model, document)
        meta-data                                 updateDomainLexicon()                               Classification
        DTD                     Services          getMetaDataAttributes()

          Document              Document        TCP/IP Socket communication

         Classification                           storeClassification(XML file)
             store               Indexer          getClassification(Doc)                              Web Browser
                                                  executeSearch(XML QE)

                                            Prototype system architecture
         Linguistic Workbench

                          Lemmatization    Stop-words              Tagging        Phrase Extraction     Statistical Analysis

Semantic modelling of documents                                                                                                177
system with existing retrieval machinery or document storage systems,
we have not developed any extensive solutions to these functions, they
are best considered working prototypes that are applicable for our test
purposes. Currently all server-side storage of information is done using
XML files.
The main components on the server-side are:

     Domain model services: These are cgi-scripts written in Perl that
      “deliver” the domain model and its additional files upon request.
      As mentioned, the Referent modeling editor stores models as XML
      files. The domain model lexicon, are stored in a separate XML file
      that refers to the concepts and relations in the model file. This
      lexicon is updated from the CnS client.

      The domain services also include the document-model matcher
      that is used to match the text of a document against the concepts
      in the model by using the corresponding domain model lexicon.
      The matcher returns match-sets, describing for each matched
      concept, the match count as well as the list of sentences that
      included a reference to these concepts. The sentences are used to
      provide examples of concept occurrences at classification time70.
      Currently the matcher only accepts pure text and HTML
      documents. The domain model matcher is also used by the
      enhanced document reader, in order to mark-up concepts when
      browsing a document.

      A classified document is described in a separate meta-data
      document, denoted Object Descriptor File (ODF), stored in XML.
      The entire meta data scheme – including the contextual meta-data
      - for this particular setting or this particular domain is defined as
      an XML DTD.

     The document manager is a simple Java servlet that accepts
      documents and descriptor files and stores these in a document
      directory, providing a unique URL for these. This component is
      only used when necessary in order to ensure that all documents
      can be accessed through a URL, the preferred point of access to
      documents from the other components. If documents to be
      classified already have a stable URL, then there is no need for
      such a component.

    The indexer is a cgi-script written in Python. The name is
     somewhat misplaced, as no actual indexing takes place in our
     prototype, the script just accepts the final document description

  Sentence examples to be used in the modelling process are extracted as part as the document analysis
 process performed by the linguistic workbench, and follows a slightly different algorithm.

178                                                                        The Prototype Realisation
    XML files created by the client and stores and manages these. The
    name indicates however, that in a real setting, this module will be
    replaced by an interface to existing indexing and retrieval
    machinery. The indexer also accepts query expressions in XML
    format, evaluates these against the stored description files and
    formulates the XML containing the query results returned to the

7.4.     The CnS client
In this section we present the CnS client “by example”, by walking
through a small case study, in which our system was used to classify
project documents for a Norwegian company - Company N.

7.4.1.    The example from Company N
Company N is a large scale Norwegian company that has been
concerned with document management for several years. Company N’s
main document management system is Lotus Notes. When projects are
carried out, all documentation is stored in separate project databases in
Notes, where documents are organized according to the appropriate
project activity (or sub activity and task). Documents in this sense can
mean anything from small half-a-page notes, written directly in the Lotus
Notes text editor, to special documents like requirements specifications
that may be several hundred pages long and are added to the database
as attachments. For documents written in Notes, the system offers a
free-text search facility. The database naturally structures information
according to properties relevant for projects, such as activity names,
titles, descriptions, responsible, deadlines and so on. These are the
contextual meta-data attributes that facilitate browsing and retrieval
from the database.
Even if Lotus Notes provides advanced search facilities, these are
hampered by several factors; Notes search is not efficient in cases
where the volume of documents increase, there are large variances in
document types, and of course when the document text is not available
in Notes, but only as a “binary” attachment. Consequences of this is
that users feel they have to really know which database to search, and
even sometimes be familiar with the structure and content of the
database in order to successfully find documents.
Company N also uses a web-based Intranet built on top of Lotus Notes.
In order to improve distribution of documents and information, selected
documents are extracted from the Lotus Notes database and published
on the fixed intranet.
Our example from Company N is selected from an actual project
database in Lotus Notes. For the subject domain “collaboration
technologies” - a high level terminology document with definitions of

Semantic modelling of documents                                           179
central terms from the domain has formed the basis for our domain
model. These terminology definitions are part of an initial ”mission
statement” for the project. However, the abstract terms found here are
rarely found in the text of the actual project documents. Thus, in order
to bridge the gap between the texts and the high-level terminology
document, we have to add more detailed concepts to the model. In
particular, we use the general abstraction mechanisms of our modelling
language to specialize the original abstract concepts. Furthermore, we
run the texts through the linguistic analysis in order to extract terms
that can be associated with the concepts in the model. It is interesting
to note that in the Company N example, the linguistic analysis discovers
the same terms in both Norwegian and English. Most of the documents
are written in both languages. Since classification and retrieval is
performed by concepts and we can attach terms in both languages to
these concepts, our approach can classify documents in both languages.

7.4.2.     Classification
Figure 7.2 shows our CnSClient in classification mode with a fragment of
the particular domain model and a small set of corresponding
documents. The domain model fragment shows the hierarchy of
collaborative processes, which in the terminology definitions from
Company N is defined as an aggregation of coordination, production and
other activities. The fragment also shows the specialization of the
concepts representing coordination and production activities.

                                             Figure 7.2

                                                            View tabs

                                          Model view

                Document List

The CnS client in classification mode – summary view of matching a list of documents against the model

180                                                                           The Prototype Realisation
The classification of a document according to the domain model
amounts to selecting the model concepts considered relevant for this
document. In the classification mode, the toolbar provides the user with
the options of getting suggestions from the server-side model-matcher
as well as quick-keys for accepting or rejecting these suggestions. The
“Fast” classification button lets the user accept whatever suggestions
the server provides without further examination – our alternative to
automatic classification. The document management function allows the
user to manage a local set of documents that are under classification,
these may be divided in folders.
The interface supports the use of models or model views, by showing
each model or model view in a separate tab. The tabs are also used to
present the text of the document to be classified, as well as for providing
an interface for filling out contextual meta-data.
While working in a classification mode, a user may switch between
working with one document at a time or a selection of documents
simultaneously – a summary view. Figure 7.2 shows the summary view of
all the selected documents. In this view the user has first received
suggestions from the model-matcher and is now manually refining these
suggestions. Suggestions are marked with a green triangle. In the
document list, documents with unprocessed suggestions are marked
with a small green triangle, while documents where the user has made
actual selections (or accepted the suggestions) is marked with a filled
In figure 7.2, the model-view shows a summary of how all the
documents in the list match the model. Green triangles illustrate
concepts that have suggestions; the size of the triangle (along the
bottom line) illustrates the percentage of documents in which this
concept is suggested. The more suggestions (i.e. the more documents
the concept is located), the more the triangle grows from right to left.
Similarly, if a suggestion is accepted or a concept is manually selected,
the triangle becomes a rectangle. As with suggestions, in the summary
view the top of the rectangle grows from right to left according to the
relative amount of documents in which this concept is selected. For
example the concept of archiving is suggested in a little more than half
of the documents, while it is actually selected in roughly one third.
The concept of authoring however has only a small number of
suggestions and no actual selections. The blue dots in the upper left
corner of the concepts is a proportional measure on how many hits this
concept had in the last matching, and this number is unaltered by the
actual selections.
The summarizing of matches and selections for a set of documents
allows the users to view how the model-concepts reflect the text of the
documents while illustrating how well a model is able to discriminate
documents in a collection.

Semantic modelling of documents                                        181
Figure 7.3 shows a user working on one single document. The colouring
of the concepts follows the same rules as in the summary view, but with
a single document, the suggestions are shown as a full triangle
(archiving and publishing) and actual selections become a fully coloured
green rectangle (coordination). Users may accept and reject a single
suggestion by clicking either on the green or white part of the concept
respectively, or all suggestions at once by using the toolbar buttons.
When examining classifications for one document, the user can examine
the full text of the document by clicking on the document tab.
Before saving the finished classifications, the user must also enter meta-
data attributes for the documents. For this, a simple tabular interface is
supported. As previously mentioned, two sets of meta-data attributes
are used, the domain specific and document related attributes (in the
case of Company N, only document authors and title is used) as well as
attributes regarding the classification (classified-by and date of
classification). The classification attributes are added automatically,
while the document attributes can only be added automatically if they
are included as meta-tags in the HTML document.
The client performs a check of the selected document’s classifications
and provides warnings to the user according to some predefined rules.
Currently, the system signals if:

     Classifications are saved with unprocessed suggestions. In this
      case, the user is given the options to accept the suggestions,
      leave out this (these) documents for later examinations, or cancel
      the save.

                                          Figure 7.3

                  The CnS client in classification mode – view of single document.

182                                                                         The Prototype Realisation
   A document is classified according to less than a specified
    number of concepts. The default number is one. Classifying a
    document according to one concept only, represents a minimal
    classification of this document, and is considered somewhat
    contrary to our objective, which is to enable semantically rich
    document descriptions. In this case, the user can choose to
    accept this, leave the document for later examination or cancel
    the save.
   A set of documents are all classified according to the exact same
    concepts. This represents a non-discriminating classification that
    will retrieve all these documents for a matching query. Again, the
    user is given the same options as above (accept, leave for later
    and cancel).

7.4.3.     Retrieval
In the retrieval mode, the user may start with a string representation of
the query and then refine the query by interacting with the model.
Search results are computed continuously during model interaction.
Figure 7.4 illustrates a search, initially expressed with a single model
concept (collaboration process). By default, query expressions are
expanded along the hierarchical abstraction mechanisms of the model
(here through following the aggregation and generalization relations)
illustrated by a lighter shade of green. The default expansion may be
overridden simply by clicking on the abstraction symbol. Also in search
mode, the blue dots are used to indicate the number of hits for a

                                            Figure 7.4

The CnS client in search mode. The initial selected concept (“Collaboration Process”) is expanded through
the abstraction constructs of the model. The result set of documents is shown to the right. Blue dots
indicates the number of document that matched aconcept.

Semantic modelling of documents                                                                      183
Figure 7.5 shows a manual refinement of the initial query. Two specific
concepts are selected (coordination and archiving) and again a selection
over an abstraction mechanism is expanded by default. Blue dots are
shown also for concepts not included in the selection, but in order to
illustrate which concepts will return more documents. Figure 7.5 also
illustrates the use of “not” – concepts marked in red (“coordination”) are
not desired, also for this, the default expansions are used, again marked
with a lighter colour.
The retrieved documents are presented in a list manner within the
client. Clicking on a retrieved document, will present this document
within an ordinary web-browser, possibly using the enhanced document
reader (optional).

7.5. Linguistic Workbench
The linguistic workbench is the set of tools that we have developed in
support for our document analysis process. The actual document
analysis process is presented in chapter 6. As mentioned, and as
pointed out in the approaches presented in (Gomez-Perez & Macho,
2003), text analysis for automated thesaurus or ontology construction is
a hard task. Most of the state-of-the-art approaches rely on a
combination of linguistic and statistical analyses. Among the reasons
that makes this difficult are the immense variation in language and
documents both from a given domain and between domains:

     Documents vary in length, style and structure. In our example
      from company N, the Notes database contains documents that
      range from small, unstructured memos of half a page, to highly
      structured requirements specifications of hundreds of pages.

     The variation in language is high, even within a specific domain.
      Some domains have highly professional languages with high

                                             Figure 7.5

           Query refinement. Selecting more specific concepts (left) and non-selection (right).

184                                                                            The Prototype Realisation
    numbers of domain-specific terms. Highly specialised terms,
    abbreviations and specific names will not be available in common
    dictionaries. If complex, they are a prone to misspellings and
    variations in use. A good example of this is within the medical
    domain, where the medical record for one patient will contain
    both popularised diagnosis names mixed in with the proper
    medical term and even the Latin term. In our example from
    Company N, some documents were bilingual (Norwegian and
    English), sometimes the same document would be written in two
    versions (one for each language), while the majority of documents
    were written in Norwegian only.
Based on these, two major requirements for our analysis toolset are:

   Flexibility: For a given setting, we need to be able to select and
    configure the adequate techniques that should be applied to the
    documents at hand. We should also be able to perform several
    different analysis steps and compare and aggregate the results.

   Extendability: We must be able to add new techniques into the
    analysis process and effortlessly substitute components in the
Furthermore, we will opt for the configuration and execution of analysis
jobs to be performed in a user-friendly manner.
As a result of the above, we have developed the following set of tools to
support document analysis and definition of the domain model lexicon:

   A web-based interface and a controller for configuring analysis jobs
    as a sequence of analysis components. The interface allows users
    to select components for a job, specify the sequence of
    components and to configure each component according to a set
    of predefined parameters. The controller takes care of the actual
    execution of jobs and the data-flow between components. A
    template for defining components in order to ensure interaction
    with the controller is specified, along with an XML based data
    structure for storing the results.

   A query interface for querying and extracting information from the
    result of an analysis job. Not all analyses can be performed within
    a sequential job and without user interaction. The query interface
    is defined both to enable examination of temporary results,
    constructing a subset of the results for further analysis and to
    extract information from the result set that can be added to the
    domain model lexicon.

   An interface for working with a domain model and defining the
    domain model lexicon.

Semantic modelling of documents                                            185
                                        Figure 7.6

                The linguistic workbench: controller and component architecture.

The web-interface, the controller and the analysis components are
implemented and tested on a large set of documents. The interface for
defining the domain model lexicon is implemented and integrated within
the CnS client. The query interface is at a prototype stadium and still to
be further developed.

7.5.1.    Controller and component design
The architecture of the controller and components are presented in
figure 7.6. Figure 7.7 shows the web-based user interface for defining a
specific analysis process.
Users define and configure jobs in a regular web browser. The controller
stores and manages the configuration of jobs in a local database. A job
is a sequence of steps where each step will be executed by a
component. Execution of a component is performed through an XML-
RPC interface.
Components can be implemented in any programming language and
can be executed on any networked computer, as long as the XML-RPC
interface is supported. Components must support a standard interface
for describing its possible parameters, indicate dependencies to other
components (an example of this is the phrase detection, that is
dependent on the part-of-speech tagging) and for receiving parameter
settings and specification of input and output. A component template
(i.e. a generic component class) is defined for this purpose. Currently,
the following components are implemented:

     DOXML Generation (HTML extractor): Extracts meta-data from
      HTML head and meta-tags. The extractor detects document
      structure by way of heading, paragraph tags and sentence
      boundaries, and creates the DOXML result document.

186                                                                      The Prototype Realisation
   Part-of-speech tagger: The POS tagger designates a word-class to
    each of the words in a sentence.

   Phrase detector: The phrase detector uses the POS tags in order
    to detect noun and verb-phrases.

   Lemmatiser: The lemmatiser uses a lexicon in order to detect
    possible lemmas for each word in a sentence.

   Weirdness analysis: the weirdness component counts term and
    phrase occurrences and calculates a weirdness measure using
    Ahmads Weirdness coefficient (Ahmad, 1994) for a term in the
    current document collection, compared to its occurrences in the
    contrast collection.

   Correlation analysis: The correlation analysis detects co-
    occurrences of terms and phrases within a document, paragraph
    or sentence, and for each term or phrase lists the n-most likely
    terms (or phrases) that seem correlated to it, by way of a term-
    document vector comparison.
Each component is defined in detail in chapter 6 – “the document
analysis process.”
The document collection(s) to be analysed are stored as regular files.
Components must have direct access to the files through the file-
Results are stored in an XML file in a format named DOXML. The
DOXML file format is designed so that all results from job step can be
stored sequentially and that components can add information to this file
by building on the previous results. Figure 7.8 shows an example
DOXML file from an analysis of Norwegian documents in the medical

                                           Figure 7.7


                                                  Change order of steps

                                                                 Edit parameters

The linguistic workbench: web-based user interface. Defining an analysis job as a sequence of process

Semantic modelling of documents                                                                  187
domain. The figure exemplifies how subsequent components either add
information within the sections created by a previous component (the
part of speech tagger, and the lemmatiser both adds information to the
sections created by the sentence and word recogniser), or add their own
sections (the phrase detection adds a new section after each sentence).
The verboseness of the XML structure combined with the size of a
document collection implies that the DOXML result file becomes quite
large. As a result of this, all analysis of this is performed sequentially. It
has been impossible to keep the entire data structure of the result in
memory simultaneously.

7.5.2.   Result query interface
Not all analyses can be performed in an automatic sequence of job
steps. In some cases users may want to examine results and create
datasets containing subsets of the results for further work. Also, users
must be able to inspect the final results and extract information from
this to add to the model or domain model lexicon. For this, we are
defining a result set query interface.
Since the result document is stored in XML, the query interface exploits
standard XML techniques such as XPATH (W3C-Xpath, 1999) based
queries for detecting patterns in the XML document and xml stylesheet
transformations XSLT (W3C-Xslt, 1999) for transformation and
presentation of (parts of) the results. Currently, the interface simply
supports searching for terms within the result document with the
purpose of either detecting variants of the search term or a subset of
results that matches the search term.
The detection of variants is important in order to handle morphological
variation in cases where this is not detected by the lemmatisation. In
order to detect variation, a set of matching strategies is supported;
approximate matching (by way of the Levenshtein minimum edit
distance algorithm), prefix and postfix matching. All these are
implemented simply through standard regular expressions. Figure 7.9
shows the simple result query architecture and our prototype interface.

188                                                      The Prototype Realisation
                                              Figure 7.8
<?xml version="1.0" encoding="UTF-8"?>                              DOXML header
<doxmlDocumentCollection>                                           Start of document collection
     <document filename="c:\doc\helseskole\A98-IK2611.ans">>        Start of a document file
       <s>                                                          Start of sentence
         <w pos="0" v="3">                                          Word found
           <tag>tall                                                Part of speech tag
           </tag>                                                   (in this case – a number (“Tall”)
         <w pos="1" v="begrunnelser">                               New word
           <stem>begrunnelse                                        The stem of a word
           </stem>                                                  (result of the lemmatizer)
         <w pos="3" v="klinisk">
         <w pos="4" v="unders&#248;kelse">
         <ph>                                                       Detected phrases are listed after each
             <np pos="3" v="klinisk unders&#248;kelse">             A noun phrase (Clinical examination)
               <npstem>klinisk unders&#248;kelse                    The stem of the noun phrase
               </npstem>                                            (i.e. the stem of each term in the
         </s>                                                       End of sentence
        </document>                                                 End of document
         <rank class="weirdness"                                    Start of weirdness results. Weirdness
ref_coll="file:/home/a/5/brase/weirdness/doxml/nou.xml">            analysis needs a contrast collection.
        <rf w="kommisjon" freq="1"                                  The weirdness count for a word: The
                      nfreq="8.301993E-6">0.0058156564</rf>         word “commision” has an absolute
                                                                    frequencey of 1, a normalised
                                                                    frequency of 0.0000083, and a
                                                                    weirdnesscoefficient of 0.0058.
           <rf w="klinisk unders&#248;kelse" freq="20"              The weirdnesscount for the above
                          nfreq="1.6603987E-4">24.721827</rf>       detected phrase “clinical examination”
           <rf w="helseorganisasjon" freq="86"                      Weirdness coefficients with value
                        nfreq="7.1397144E-4">Infinity</rf>          “infinity” represents a term or phrase
                                                                    that does not exist in the contrast
</doxmlDocumentCollection>                                          End of document collection

                                 Example DOXML document analysis result.

Semantic modelling of documents                                                                          189
7.5.3.   Interface for defining the domain model lexicon
The final tool in our portfolio is the interface for defining the Domain
model lexicon. This is implemented in a separate window, integrated
within the CnS client. The interface is simply accessed by right-clicking a
concept in the domain model view. The definition of the lexicon is a
manual process, but information may be extracted from existing
thesauri and lexicon. Our interface implements the DICT protocol for
querying and matching lexical resources. This means that all lexicon
resources we need to access must be converted into a DICT readable
format. The benefit of using DICT is that it provides standard dictionary
server implementations that include query and matching algorithms,
and supports a rather simple and flexible storage formats for lexicon
resources. DICT implementations offer access to a set of available online
dictionaries, including the WordNet ontology for English words.
Currently, we have converted the Norkompleks Norwegian lexicon
(Norkompleks, 2000) into DICT format.
The interface supports manual pure-text definition of a concept and the
inclusion of term lists for each concept. Figure 7.10 shows the simple
architecture of the DICT based implementation and a screendump from
the CnS client. The screendump shows how DICT queries are used both
to look up definitions from online dictionaries and to extract word-forms
from a lexicon.

                                   Figure 7.9

                      The Query Interface component architecture.

                             The prototype query interface

190                                                                 The Prototype Realisation
                                             Figure 7.10

                                The DICT Interface component architecture

  The CnS user interface for manual entry of definition, term lists and wordforms, as well as DICT lookup.

7.6. Summing up
We have presented the realisation of our approach. The implementation
covers a number of components that cover most of the specified
approach. Some of the implementation parts have been trough a series
of iterated implementations, in particular the main component, the CnS
client interface. However, the status of most of these components is still
at a prototype level and not considered directly applicable for large
scale or production settings.
For our purposes, the interesting issue is how to device an evaluation of
the approach based on the existing implementation. The experiments
and quality of the linguistic analysis parts were discussed in the
previous chapter (section 6.4). The main part of the contribution is
however the model-based interface for classification and retrieval. The
evaluation of this will be based on the existing CnS client and is the
topic of the next chapter.

Semantic modelling of documents                                                                          191
                                                    8. Evaluating the approach

This chapter presents the evaluation of our approach. The evaluation is
performed on a real but limited case from the Norwegian public health
care sector, selected from current terminology work at KITH. The main
research goal of this thesis was formulated as:
    Research goal: To investigate if a semantic modelling language can be used directly in a
    tool that will assist the users in their semantic classification and retrieval of documents.

We have specified a methodology for using the Referent Model Language
(RML) to define the vocabulary of a particular domain and to classify
and retrieve documents. While the definition of a domain model with
relations to a domain specific document collection may find several
usages in itself, the main test of our contribution will be to investigate
the effect on document retrieval quality provided by the model based
query interface.
A priori, the assumed effects of the model based query interface are

   For a given information need, the domain model will assist users
    in defining precise domain-specific queries and retrieve more
    relevant documents.

   A linguistically enhanced model will reduce the language-gap
    between query and documents experienced in purely syntactic
    retrieval systems.
The evaluation is designed to evaluate the effects on query formulation
achieved by the model-based interface, not the quality of the underlying
retrieval systems.
For this we have defined a user-based comparative relevance evaluation,

   A set of test users are presented with a number of query topics
    and asked to formulate and execute both model-based and
    regular text only queries. Both the model-based and the regular
    queries are executed against two comparative search engines.

   Then, for each query-result (both model-based and regular) the
    users have to evaluate and rank the top N retrieved documents for
    relevance, according to their interpretation of the query.

   In the end, query scores will be computed and compared for each
    query and for each search strategy and engine.

Semantic modelling of documents                                                                193
8.1. Aspects of retrieval evaluation
Traditionally IR approaches are evaluated according to performance
criteria - system performance evaluations and retrieval performance
evaluations (Baeza-Yates & Ribeiro-Neto, 1999). System performance
evaluations consider metrics such as response time, index size and
storage efficiency. Retrieval performance evaluations are normally
focused on the classic precision and recall measures.
System performance criteria are not relevant for our trial. First, our
implementation is a prototype not implemented with system
performance criteria in mind. Second, our approach aims to be applied
with existing retrieval machinery, and such system aspects would rather
give an evaluation of the underlying machinery than of the approach.
Precision and recall measures are designed to measure the set of
relevant documents in the answer set compared with the relevant set of
documents in the whole collection. Two obvious problems with these
measures is that relevance is a subjective measure and the set of
relevant documents in a collection with respect to any given query is
generally not known. In standard retrieval tests such as the TREC
series71, expert judges review the selected set of documents in order to
define the relevant ones for each predefined query.
Our approach to retrieval is user focused in the sense that the novelty
with respect to retrieval is the model-based interface for query
formulation, expansion and refinement. From our perspective, we have
to select an evaluation approach that records relevance from a user
point of view. User centred relevance measures are of course subjective
and hence difficult in the sense that a document considered relevant for
one user, may seem irrelevant to another for the same query.
Furthermore, one user's relevance considerations could even change
with time, as his knowledge of the domain evolves.
Following the discussion above, we define the following measures to be
applied on our trial:

      Perceived relevance: Perceived relevance is defined as a users
       subjective relevance measure for a document with respect to a
       given query. Users will be asked to score documents based on
       their immediate perception of its relevance to the query.

      Comparative relevance: Comparative relevance is calculated by
       executing the same queries against multiple search engines,
       scoring the results in the same way and then comparing the


194                                                    Evaluating the approach
Evaluation approaches based on user-scoring of documents and
perceived and comparative relevance measures are common today in
commercial search engines (Gulla, 2003 ; Hawking et. al., 2001).

8.2. Evaluation approach
Figure 8.1 presents an overview of the evaluation approach. A real
domain with an existing domain model is selected. For the experiment,
we have selected the domain of “Helseskoletjenesten”, one of the
domains currently defined by KITH. The terminology document
produced by KITH is translated into a Referent Model and the model is
enhanced by terms detected by the linguistic analysis process as
described in section 6.4.
   A set of test users are presented with a set of information needs,
    defined in natural language but not formulated as queries.
   Two existing search-engines containing documents from the domain
    are selected. The model-based query interface is implemented to
    post its expanded queries to these search-engines and retrieve, re-
    rank and present the results according to the domain model.
   For both search-engines, the test-users have to formulate each query
    both as a regular query for each search-engine and as a model-based
    query. If desired, users may refine their queries up to two times. For
    both queries, the test users have to assess the top N returned
    documents for relevance. A query score will be computed per query.
   Comparative relevance is measured by comparing the query-scores
    for each query for each search-engine.
   In the end of the evaluation, the test users will be interviewed in
    order to obtain general feedback of the system as well as to discuss
    the computed results.

                                       Figure 8.1

                              Outline of the evaluation process.

Semantic modelling of documents                                        195
8.2.1.        Scope
The scope of this evaluation is limited to the following:
      Referent Modelling: We do not create a Referent model from scratch.
       As the domain is already partially modelled and the terminology
       defined by KITH, we translate the terminology document into a
       Referent model and then enrich this model with term lists extracted
       by the linguistic analysis (Chapter 6).
      Query execution: The prototype classification and retrieval
       machinery is deemed too incomplete and unstable for the user-based
       evaluation. For the evaluation, this machinery is replaced by an
       interface to the two retrieval engines that are used for comparison.
       The expanded model-based query is executed as a set of separate
       queries to the underlying search engines, and the total retrieved
       documents are re-ranked and presented according to the model – as
       specified in the original CnS System (section 5.4 and section 7.4.3).
      User Query Evaluation: Based on the domain model, we construct a
       set of query topics. Users have to interpret the topics in order to
       formulate queries. For each query, users will score the 10 highest
       ranked retrieved documents on a simple point scale, and a total
       query score will be computed per user per query.
      Query Scoring and Comparison: The same queries are formulated
       and executed 4 times72 by each user:
       1. Using the model-based interface and applying the ODIN73
          governmental search engine as the underlying retrieval system.
       2. Against the original search interface offered by the governmental
          web service (ODIN).
       3. Using the model-based query interface and applying the Alltheweb
          internet search engine as the underlying retrieval system.
       4. Against the original Alltheweb web-search interface.
       For each of the 4 executions, users will score the top 10 returned
       documents. If the initial query formulation does not produce any
       relevant or satisfactory documents, users are allowed (but not
       required) to reformulate the query up to two times. This will give
       each user the task of formulating 16 queries 2 times (text only
       and model based) and possibly reformulating each query 2 times
       more, giving a maximum of 96 possible query formulations and
       correspondingly 960 documents to score.

  Each query is not necessarily formulated 4 times; The query formulation for the two model-based queries
 (1 and 3 above) and the regular, text-based queries (2 and 4) will essentially be the same. On the other
 hand, the users are allowed to refine each query (for each engine) up to 2 times.

196                                                                            Evaluating the approach
8.2.2.   Limitations
Observe that:

   We do not test the effects of model quality on retrieval. The model
    is defined by KITH as domain experts. The goal of the evaluation
    has been to evaluate the effects of the model-based interface on a
    real model from a real domain.

   The model is a real model from the domain, but it does not
    necessarily reflect the document collections contained in the two
    search engines. For the evaluation, we have selected two
    publically available search-engines and have no control over the
    documents they contain. As mentioned in section 6.4, KITH build
    their terminology document based on a text-analysis, but on their
    own internal document collection.

   The classification part of the system is not tested. This could have
    been performed by downloading all relevant documents from
    these two engines and then classifying them according to the
    model. This is what now takes place for each query but we are
    only relying on the automatic classification provided by our
    system – what we have previously denoted “suggestions”. No
    manual classification is performed. This would increase the
    workload of the test-users significantly. Also, by using our own
    prototype classification system, we could have risked to skew
    results due to the differences in the quality of the indexing

   Using an existing model rather than a model the users have taken
    part in defining, as well as only using the model for one half of the
    classification-retrieval cycle, means that we cannot test the
    model’s ability to solve the “language problem” in retrieval.
    Furthermore, our available test-users are not real users from this
    domain. The decision to use an existing model from a given
    domain that we have worked with is motivated by the desire to
    have a real world test-domain.

8.3. The Retrieval trial – set up and preparations
This section presents the actual set-up and preparations for the

8.3.1.   Comparative search-engines
As comparative search engines, we have selected two publicly available
search engines: ODIN and AllTheWeb.

Semantic modelling of documents                                             197
ODIN – Offentlig Dokumentasjon i Norge (Public Documentation in
Norway) – is a governmental web-portal with public documents from the
government and each governmental department. Documents published
at ODIN vary from short press releases, proposals from the departments
to longer studies and reports. The ODIN search interface has two modi,
“simple” and “advanced” and supports a rich query language – with
phrasing, Boolean search expressions and structure and meta-data
search (e.g. title:”<title>”, publisher:”<department>”). ODIN is included
in the test since it is a major source of public information in Norway.
ODIN is also the subject of experiments currently performed at KITH
with improved search based on semantic networks generated from the
terminology documents.74
The Alltheweb web-search interface is the public search interface of
FAST (Fast Search and Transfer)75 and is one of the biggest search
engines on the market and one of Google’s closest rivals. Alltheweb
contains a number of linguistic techniques that separate it from
traditional search engines: Language identification, query normalization,
phrasing, anti-phrasing and clustering (Gulla et. al., 2002). Comparative
studies have shown that Alltheweb performs well on topic relevance and
homepage finding (Hawking et. al., 2001).

8.3.2.     CnS Client set-up, modifications
In order to simplify the users tasks in formulating and executing queries
to both search engines and then comparing the results, the CnS Client
search interface was slightly modified (figure 8.2). Modifications from
the original interface (section 7.4.3) are:

     The field for entering a text-only search is now used to enter the
      regular text-only search to be sent directly to the underlying
      search-engine. Thus, the original CnS-client ability to start from a
      text search and refine this in the model is now removed. Also, this
      removes the possibility to enter free-text to accompany a model-
      based search.

     Users have to select which search-engine to send their searches to
      from a drop-down menu. Furthermore, users explicitly have to
      click on the search button to have their search executed.

   These experiments have so far resulted in the Volven experimental search interface:
  During the summer of 2003, the web-search part of FAST was acquired by Overture, which subsequently
 was acqiured by Yahoo.

198                                                                         Evaluating the approach
                                                 Figure 8.2

                                              Modified CnS interface

      The first item in the result list marked “[Search results]” is now a
       pointer to the filtered result-list produced by executing the text-
       only search in the search-engine.

      Documents are presented from their original URL in a regular
       web-browser. No enhanced document reader or marking of search
       matches (neither concepts nor words) in documents were applied.
The CnSClient search-engine interface was implemented specifically for
this trial by a master student (Fidjestøl, 2003). If both a model-search
and a text search is supplied in the client, the search-engine-interface
(SEI) will receive both a text search and a model-search from the CnS-
client encoded in an XML-file. A model-based search is processed as
       1. All concepts in the model-based search expression is expanded by
          adding all terms for this concept from the domain model lexicon
          as an OR combination. No weighting between the concept name
          and the various word forms are performed.
       2. The complete model based search expression is translated into a
          search expression for the underlying search engine. Both the
          ODIN and Alltheweb search-engines support Boolean search
          formulation, but with different syntax.
       3. The query is executed in the desired search engine, the regular
          search-engine list of results is retrieved, and the URL’s to the
          result-documents are extracted. Documents in other formats than
          pure text and Html are discarded. The top 2076 remaining
          documents are downloaded and matched against the model, by

     From its original design, the CnS-Client asks for 20 documents at a time.

Semantic modelling of documents                                                  199
                way of the domain model lexicon. Documents are then ranked
                according to the number of domain model concepts they contain.
                If several documents contain an equal number of concepts, their
                internal ranking is kept as returned from the search engine.
                Search results are then returned to the client and presented as in
                the original implementation.
             Text-only searches submitted via the CnS-client are sent unmodified
             to the underlying search engine. Again, documents in other formats
             than pure-text and html are removed from the result-list, in order to
             keep the searches more comparative with the model-based search.
             For a text-only search, a new “result-page” is then computed, and the
             reference to this is displayed in the CnS-client.
       In order to examine experiment data, logging of queries was
       implemented in the search-engine-interface. Table 8.1 shows the
       contents of the query log.
                                                   Table 8.1
SID   IP-    Engine     #doc        #doc      # documents     #words      #concepts   # terms    The text   The
      addr              returned    returne   overlap (top    in text     in model    in         search     concept
                        from text   d from    10)             search      search      expanded   string     names for
                        (total)     model                                             model                 included
                                    (total)                                           search                concepts

                                                Fields in the query log

       In order to work on comparative values, the logfile was only used for the
       queries which included both text and model searches. Most users
       performed the test this way, but some users also experimented with the
       regular Alltheweb and Odin interfaces for the text-only queries.

       8.3.3.         The domain model
       Excerpts from the domain terminology document are shown in figure
       8.3. The translation of this into the RML domain model we apply in the
       experiment is shown in figure 8.4. In general, we may state that the
       model is:
            Fairly small: It only includes 27 concepts, 10 binary relations and
             4 main hierarchies (public health services, health personell,
             journaling, contact)
            Generic: In general, all concepts in the model are somewhat
             generic. The terminology document, while defining terms specific
             to the domain, tends to include higher level concepts from the
            Shallow: None of the four major hierarchies in this model are
             deep, at least not compared to hierarchies found in general
             thesauri or web catalogues, such as Yahoo or Mozilla Open
             directory project.

       200                                                                            Evaluating the approach
                                             Figure 8.3

Original domain model as developed by KITH. The KITH terminology document consists of tabular definitions
of each term, with ”see-also” cross-references (top). For overview and clarity, some of the terms are also
presented in UML model fragments (bottom).

   Few relations are included. All relations are unnamed. This is the
    way the terminology document is formed. While we have added
    some relations based on the definition texts and the “see also”
    cross references, we have not added names to the relations in the

8.3.4.      User selection and preparation
Evaluation users were invited from within our own research environment.
The evaluation task and the procedure was presented, demonstrated
and discussed at a common meeting for all interested users. Initially 15
potential subjects were available at the start-up meeting, but only 5
users completed the full trial.
All of the users are familiar with modelling and conduct research within
modelling and ontology settings. Some, but not all users have had a
prior demonstration of the original CnS client, roughly half of the users
had never seen the tool before.
The only training given to users was a 1-hour presentation of the task,
followed by a general discussion and question answering session. A
walk-through demonstration of querying and ranking was given.
In addition, a short user-manual with step-by-step instructions for the
evaluation task was made available on the web. Also, users were given

Semantic modelling of documents                                                                      201
                                           Figure 8.4

The RML version of the domain model. Translated from the model shown in figure 8.3 by connecting the
UML model fragments. The ”connection” of the UML fragments was done by inferring binary relations for
some of the ”see also” references in the definition texts.

help in person, over telephone or through emails whenever they
None of the users are domain experts or particularly familiar with the
domain of public health services, apart from what they may have been
exposed to in private. Everybody with kids in kindergarten or elementary
school will have been in contact with the public health-services,
municipal health-stations or school-health services. All inhabitants of
Norway are assigned to a primary doctor. Through this, all of the users
were familiar with some of the concepts in the model and thus parts of
the domain. In this sense, the users are outsiders, but are likely to know
something about the domain.
However, as the queries were defined in advance by model-examination
and without user participation, the users did not share the information
need presented by a query. Apart from by chance, the queries did not
have any relevance for the users, and relevance assessments were hence
not motivated by a genuine need for this information. This represents a
breach of one of the principles for retrieval evaluation put forth by
(Gordon and Pathak, 1999). This principle is however relaxed in
(Hawking et. al., 2000) and (Vorhees, 1998).
Half of the participants are non-native Norwegian speakers, but all users
that completed the task are capable readers and speakers of Norwegian
without any problems of reading and evaluating documents for

202                                                                        Evaluating the approach
relevance. Written language may be harder than spoken, and this may
have affected the formulation of the regular text only queries.
No resources were available for rewarding or paying users for this
evaluation. Users were simply volunteers, and the only reward given for
participation was a free private lunch.

8.3.5.      The Query set
The set of queries for our evaluation were constructed based on an
examination of the KITH terminology document that was the basis for
the domain model, and experimental queries using the two public
search engines. A total number of 16 queries were defined, after pruning
and modifying an initial set of 25 queries. The final queries were
screened for relevance by a medical doctor.
The initial experiments with query formulation, using the two reference
search-engines illustrated that the domain is sparse, i.e. that there are
few publicly available documents for the domain defined by the model.
The modification and pruning of the query set was influenced by this, in
the sense that the topic and wording of the queries was modified in
order to try to ensure that searches would actually return a sufficient set
of documents.
Each query is described similar to what is done in the TREC query
specifications (table 8.2)
                                       Table 8.2
    Property             Description
    ID                   Numerical ID of each query (for logistical purposes only)
    Topic                Short topic description of the query
    Description          A more thorough description of the query topic. Possible follow-
                         up questions and directions.
                                    TREC query attributes

Formulating the predefined queries as a topic, rather than as finished
search expressions, the task of actually formulating the query becomes
part of the evaluation.
The final queries are presented in table 8.3. Only the short topic
description is shown here, the full descriptions are given in appendix B.

Query categorisation and grouping
In order to investigate results in some detail, the queries were grouped
in two different ways, related to their generality and model coverage.
The category and group of each query is shown in table 8.3.

Semantic modelling of documents                                                             203
                                      Table 8.3
Query    Category   Group   Topic                          Translation

1        C4         G2      Min fastlege?                  “My primary doctor”
2        C2         G2      Helsetilbud til gravide?       “Health services for pregnant
3        C1         G1      Hva er en helsestasjon?        “What is a health station”

4        C1         G1      Kommunehelsetjenesten          “The     municipal        health
5        C4         G1      Min Pasientjournal             “My medical record”
6        C3         G3      Helsetjenester for barn og     “Health services for children
                            unge?                          and youth”
7        C3         G3      Pasient i konsultasjon med     “Patient       and        doctor
                            lege                           consultations”
8        C1         G1      EPJ                            “Electronic medical record
                                                           (EMR) ”
9        C3         G3      EPJ system                     “EMR system”

10       C4         G2      Kan jeg ha med ledsager?       “Can I bring an accompanying
                                                           person to the doctor?”

11       C3         G1      Fysioterapitilbud for ungdom   “Fysiotherapy for youth”
12       C2         G3      Barns helsekort                “Childrens health card”
13       C2         G2      Konsultasjoner, fastlege       “Consultations,          primary
14       C2         G1      Helsestasjon for ungdom        “Youth health station”
15       C4         G2      Kontakt fastlege               “Contact, primary doctor”
16       C1         G1      Dokumentasjonsplikt i EPJ      “Duty of documentation in
                                       The queries

This categorisation and grouping of queries was defined prior to the
test, but not known by the test-users. In general, query categorisation is
a vague science, since this is very dependent on the users own
perception of his information need (Hawking et. al, 2000). A normal
distinction, which is also applied in Trec is to classify information needs
according to the amount of desired information: “Question Answering”,
“Homepage finding” (or single page finding) and “document selection”
(“research purposes”). However, what is deemed sufficient is always an
issue with the user, and it is difficult to specify in advance. As an
example, (Hawking et. al 2000) uses home page finding, where a user
searches for the home-page of a specific company, and several
homepages exist, in different languages or in different
countries/regions. Question answering searches would normally require
only a few sentences in response, however it might be that users would
like to have an answer verified from a reliable source or several answers
to compare certain angles of the question. Another issue is that ranking
and relevance measures will change across such categories. For
example, if the information need is one specific page, such as an air-line

204                                                               Evaluating the approach
time-table, there is no meaning in computing a query-relevance score for
the top 10 retrieved documents.
In our task, the queries were designed with the categorisation in mind,
but since the queries were tuned and modified before arriving at the
final query set, this original distinction was blurred. An unfortunate
effect of the query modification was that many queries became quite
similar in both style and topic/wording.
The generic-specific dimension
The queries are categorised into 4 different categories, according to the
precision level of the information need (generic-specific) and the
expected number of documents that would be considered relevant (1-
many) – see figure 8.5:
   C1: These are queries that would be considered generic within this
    domain and where the user would be likely to accept a number of
    documents as relevant for the query. In other words, this is a rather
    vague query within this domain.
   C2: Queries considered generic in topic, but where the user is
    expected to accept only 1 or a small set of documents as relevant.
    These are queries where the information need is formulated in a
    generic and vague manner, but the user actually is looking for a
    specific answer, that may be found in only one or a small set of

                                             Figure 8.5

Query categories: Queries are categorised according to the generc-specific dimension and the expected
number of relevant documents (1, many). Our assumption is that the model-based queries will perform well
for the most generic documents (C1). Modelling activities (M) may make a model more specific, and increase
its performance along the specific dimension. Linguistic enhancements (L) may make close the gap between
model concepts and document text, and may enable the model to perform better for queries that are
required to find only a small number of documents.

Semantic modelling of documents                                                                      205
     C3: Queries where the information need is precisely specified and
      where the user is looking for one particular document.
     C4: Queries where the information need is precisely specified but
      where the user might accept several documents as relevant.
The prior assumption is that since the concepts in the model are mostly
generic concepts from the domain, the model would prove a benefit for
the most generic information needs in category C1 (marked in the
figure). The concepts in the model will give users a starting point for
formulating a query and the model-structures and the default query
expansion will help users to make a vague query more specific. For the
specifically formulated queries, the translation into a regular text only
search string would be more direct and the model would not be needed
to a similar degree. Along the number of relevant documents axis, it is
more difficult to assume the outcome, but generally queries that seek to
find one particular document would have to contain words specific to
this document and thus the model-based interface should perform
better where a number of relevant documents are expected.
As shown in the figure, the assumed benefit of the model-based retrieval
interface can be extended from C1 in two directions by way of 1)
linguistic enhancements (L), or 2) more detailed (specific) modelling

     M: A model can be made as generic or specific as is desired. More
      specific concepts and constructs may be added to the model in
      order. In our case, this was not done, the original KITH model was
      used unaltered.

     L: The connection between document-text and model concepts is
      found in the linguistic enhancements. To make the model more
      specifically related to the document collection, words from the
      documents should be added. However, as our linguistic analysis
      was performed on another – only partly overlapping – document
      collection from the one queried in the test, we have no guarantee
      that this will increase performance.
Domain model coverage
Another and somewhat orthogonal grouping of the queries was found by
examining the domain model’s coverage of the information need. Such a
grouping gives us a division into 3 groups:

     Group 1: the concepts referred to in the information need are well
      defined in the model, there exists relevant relations and model
      constructs for navigation and query composition. Information
      needs defined in this group should have a straightforward
      translation into a model-based query.

206                                                      Evaluating the approach
   Group 2: The concepts referred to in the information need exist in
    the model, but they do not completely cover the information need.

   Group 3: Related concepts exist in the model, but no direct
    translation exists from the model to the information need.
The prior assumption is that the model will prove a benefit in the cases
where the information need is well covered in the model (groups 1 and
2), but not where the model does not cover the information need well
(group 3).

                                        Table 8.4 Scoring
                                Score          Description

                                -1             Trash
                                0              Non-relevant or Duplicate
                                1              Related
                                2              Good

                        The total query score is calculated according to the formula:

                                                  1 10
                                         Qs =      * ∑ PD * PPi
                                                  2 i=1 i
     , where PDi = the individual score for document Di, and PPi = the weighting factor for position Pi.

                                          Position           Weight
                                          1                  20
                                          2                  15

                                          3                  13
                                          4,                 11
                                          5                  9
                                          6, 7               8
                                          8, 9               6
                                          10                 4

If a query does not return a full set of 10 documents, the "empty" positions are given a score of 0.
Following this scheme, the maximum score for a query is 100 ( 2/2 * (20 + 15 + 13 + 11 +9 + 8 + 8 + 6 + 6
+4)), while the minimum is -50 ( –1 / 2 (20 + 15 + 13 + 11 +9 + 8 + 8 + 6 + 6 +4)). If all documents in the
query would be considered relevant (score = 1) then the query score becomes 50. A query with no hits would
score 0.

8.3.6.      Scoring and calculations
We adopt a query scoring and calculation strategy implemented by
commercial internet search engine companies (Gulla, 2003). For each
query, the users scores each of the top-ten ranked documents according
to the scoring scheme presented in table 8.4.

Semantic modelling of documents                                                                            207
Position weights are used to reward high rankings of relevant documents
and penalise high rankings of irrelevant documents. Calculating a total
score for each query enables a numerical comparison between different
search engines for the same query.

8.4. Results
This section presents the result of our evaluation. We start by presenting
the overall results (section 8.4.1) and a discussion of uncertainty factors
(section 8.4.2) before we investigate the results in more detail (section

8.4.1.      Overall results
The plots of the overall query scores are shown in figure 8.7. The plots
show the average query score for each of the 16 queries. Figure 8.7a
shows a comparison of all the 4 search strategies, while figure 8.7b and
c show a comparison between the Odin searches and Alltheweb searches
respectively. The actual numbers behind the plots are shown in figure
The only conclusion to be drawn from the overall picture is that the
results are fairly similar. No search strategy stands out as more
successful than the other, and likewise, no strategy seems to be
significantly poorer. The average of all query scores for each of the 4
strategies show no significant difference at all (table 8.5)
                                               Table 8.5
             CnS-Odin                        Odin                   Cns-Alltheweb           Alltheweb
               17,66                        17,65                        16,41                 16,72
                                        Overall average query scores

                                               Figure 8.6

Overall query scores: The table shows average query scores (across all users) for each query and for each of
the 4 search strategies. For each of the search strategies, the average score for the initial query(1) and the
optional refinements (2 & 3), as well as the average over all iterations is shown.

208                                                                               Evaluating the approach
Two main reasons for the small variation in query scores may be noted:
   As mentioned, the domain is sparse and queries generally return
    few relevant documents. The overall scores are low (approximately
    17 points in average, out of a possible 100)
   Our model-based searches are executed “on top of” the
    comparative engine which makes a huge difference unlikely.
    Queries are expressed within a limited domain and it is possible
    to foresee that the resulting searches will be quite similar.

                                                  Figure 8.7




                  30,00                                                                            cns-odin
                  20,00                                                                            atw




                          1   2   3   4   5   6   7   8   9   10     11   12   13   14   15   16









                          1   2   3   4   5   6   7   8   9   10     11   12   13   14   15   16









                          1   2   3   4   5   6   7   8   9     10   11   12   13   14   15   16

                                  Overall average query scores per query.

Semantic modelling of documents                                                                               209
The scores in figures 8.7a-c are calculated from average query scores
per user per query. If a user has refined his query several times, the
average score (across all refinements) is used in this calculation. Figure
8.8 a-c show the plots of query scores based on 1. iteration queries only,
i.e. with no refinement scores included. While the actual query score
changes slightly, the internal ordering of the scores are not altered. For
Odin queries, only 1 query is affected (Q7) where the model-based
search wins on the average-based calculation and the regular search
wins for the 1.iteration scores. With Alltheweb, only 3 queries change
internal positions; in queries Q1 and Q7 model-based search scores
higher in the 1.iteration queries while the regular search scores higher in
Q2. Based on this, it seems safe to conclude that this study does not
show a significant effect in comparative relevance query scores because
of the iterated refinements. The reasons for this might be:

     sparse domain, few relevant documents exist.

     refinement in this trial is not pure search within search, rather a
      “second try”. Since none of the strategies support search within
      results, a refinement is just the submission of a modified search.
The query scores are calculated with position weights and thus the
ranking of documents in the result-set has an effect on the computed
query scores. With only few documents returned, their internal ranking
is not always a relevant factor. In addition, our model-based ranking is
based on a simple matching of documents against model and not
computed by way of traditional indexing measures. It is interesting to
look at how the scores would be computed without the position
weighting. Figure 8.9 shows the overall comparison, based on
calculations without the position weight, and for first iteration queries
only. Again, the ranking or position weights do not alter the internal
ordering of the scores. In fact, only one query (Q1, Odin) changes
internal positioning compared to the original 1. iteration results.
It is difficult to conclude why ranking has little effect for the comparison
in our results, that is if all ranking schemes are equally good or if the
few number of relevant documents makes ranking irrelevant in this case.
Since we apply our ranking strategy on the documents returned from the
search engines, which are already ranked, it is again unlikely that a
major difference is possible. The small number of returned documents
would reduce the difference further. In any case, we will use the initial
query scores that are based on ranking and position weights, for the
further examination in this chapter.

210                                                       Evaluating the approach
                                                 Figure 8.8






                     1   2   3   4   5   6   7    8   9     10   11   12   13   14   15   16











                     1   2   3   4   5   6   7    8   9     10   11   12   13   14   15   16






                     1   2   3   4   5   6   7    8   9     10   11   12   13   14   15   16



               Overall average query scores per query. 1st iteration query scores only.

Semantic modelling of documents                                                                           211
                                                          Figure 8.9




                           3,00                                                                           cns-odin



                                  1   2   3   4   5   6   7   8   9    10   11   12   13   14   15   16




                           4,00                                                                            cns-atw


                                  1   2   3   4   5   6   7   8   9    10   11   12   13   14   15   16



       Overall query scores, no position weighting applied and hence no effect of ranking is visible in the
                                         computation of query sscores.

8.4.2.       Uncertainty factors
Some possible uncertainty factors with our results require attention:
     Sparse domain
     User variation
     Effects of evaluation set-up and procedure

Sparse domain
The selected domain has few publicly available documents. Of the log
sample of 330 combined searches, an average of 1012 and 7786,34
documents were retrieved from the model-based and regular searches
respectively77. The model-based search has a median value as low as 0

  These are the total numbers of documents reported by the two search engines. Cut off values of 20 and
 50 were used to produce our desired resultset of 10 documents, which means not all of these were
 retrieved and reranked for relevance for each query.

212                                                                                                             Evaluating the approach
returned documents (table 8.6), which means that half of the searches
do not return any documents.
                                             Table 8.6
                          Overall                      Alltheweb              Odin
                     Model         Regular    Model         Regular    Model    Regular
          Average    1012,18       7786,34    1711,26       13603,48   94,0     145,9
          Median     0             92         0             112,5      15       43
                                   Number of returned documents

The low number of returned documents and hence the low query scores
gives very little variation in the results and makes it difficult to
significantly distinguish the scores from the various search strategies.
However, we see that the model-based searches in general return fewer
documents than the regular searches. This is explained by the default
query expansion performed in the model-based searches. Where the
total average of concepts in a model-based search is 2.91, the average
number of words in the expanded search, which is sent to the search-
engine, is 26.54. This is a significantly larger number than the average
number of words in a regular search (2,4). While not all the words in an
expanded search represents a specialization of the query expression, as
this depends on the and/or combination in use, this indicates that the
model-based searches are more specific than the text-only searches. In
the general case, the more specific a search the less documents
returned. In our sparse domain this results in several queries with very
few returned documents – and based on the query-score formula – a
lower possibility for query-scores.
Another question is whether there is a significant overlap among
documents returned from the text and model searches. Table 8.7 shows
the number of documents in overlap between the top 10 documents for
each query (the result-lists presented to the users). By overlap, we mean
the number of documents that exists in both lists, regardless of
                                             Table 8.7
                          Overall             Alltheweb                Odin
           Average           0,55                 0,13                 1,12
           Median              0                   0                    0
                                   Document overlap in result sets

An inspection of the combined searches illustrates however that there is
a high overlap between the words in a text search and the names of the
concepts used. This would mean that either it is easy to find query
terms from the query descriptions, that are similar to the model-
concepts, or users have learned the model and apply the concept names
in their text searches also.

Semantic modelling of documents                                                           213
User variation
Users perceived relevance is the key measure in our test. Variations in
user “performance”, that is; domain knowledge, accuracy of query
formulation and not least perceived document relevance - will therefore
represent a likely source of error.
Figure 8.10 shows plots of the user score for each of the 4 query
strategies. As the figure shows, user variation is generally high. This
holds for all of the queries and for all of the search strategies.
From the query-log, we see that there is small variation in the number of
words and concepts the users have applied in their formulated queries.
The variation in average number of concepts in a model search ranges
from 2,3 to 3,3 while the average number of words in a regular search
varies from 1,9 to 3,3. The minimum and maximum numbers of
concepts illustrate some difference, where one user has applied
maximum of 4 concepts in his/her model based queries and another as
much as 10 concepts in a query. However, we lack the knowledge of
which particular query and what exact concepts these numbers stem
from, and cannot induce any specific assumptions from this.
The largest variation in user-performance visible in the log-file is the
number of documents returned from regular searches, where the
average ranges from 13,79 to 43,2. This indicates a significant variation
in the actual words of the query formulation. For model search – the
variation in returned number of documents is much lower, and not
significantly different (from 1,3 to 2,33).

     User motivation
      The test represented a huge task for the users. Potentially each user
      had to formulate 96 queries and correspondingly rank 960
      documents. While the actual number of reported queries was only
      slightly more than half of that (approximately 52 query formulations
      per user in average) this is still a formidable task. The task was
      initially given with a 1,5 weeks deadline. While one user actually
      delivered on that deadline, the last set of results was not delivered
      until 4 weeks later. No explicit measure was taken to discover
      variations in motivation and applied effort for query formulation and
      document relevance inspection; this is difficult to measure and it was
      important for the users not to feel “on trial”. All users were familiar
      with regular web-search and the Referent Model Language.
      A more important issue is that the queries do not represent an actual
      information need for these users (Gordon & Pathak, 1999 ; Hawking
      et. al., 2001). Hence, they will have differing views as to what is
      relevant or not for a query, and it will affect their ability to determine
      whether a document is relevant or not.

214                                                         Evaluating the approach
   User Domain knowledge
    None of the users was domain experts. Half of the users were foreign
    citizens, with a previous low exposure to Norwegian public health
    services. It is safe to assume that the domain knowledge was varying.
    However, most of the query topics were formulated as “beginners”
    questions, that is, general information about the domain, and hence
    the variation in background knowledge should not significantly
    hamper the relevance judgments.

   Variance in perceived relevance
    Perhaps more interesting than the pure numbers of the query logs
    are the variations in query scores per user for a given search-
    strategy. Figures 8.10 show the plots for each of the users that
    completed the test, one plot for each search strategy. While the plots
    show that a user in general follows his/her path, significant variations
    exist. For each of the search strategies, 2-3 queries have a difference
    in query-score of more than 50 points. Most suspicious are the
    extreme values where one user has scored a query with 0 points and
    another has given a high score (e.g. query 8 in the CnS-Odin search
    strategy). Another variation, difficult to understand, is the cases
    where a user has generally scored low, and then suddenly scores top
    for one or two queries. It is difficult to conclude anything from the
    material we have, only to state that there are in some cases high
    variations in user perceptions of what is relevant.
    However, this is the case in web-search as well, and this is the
    motivation for applying an evaluation based on perceived user
    relevance. What is considered relevant and the perception of the
    purpose with a given query is highly user-dependent (Hawking et al.,
    2000 ; Gordon & Pathak, 1999; Loose & Paris, 1999)

Side effects caused by the evaluation procedure
It is possible to foresee possible unwanted effects due to the design of
the test procedure, but we do not have log-data or result-data that can
verify that such effects exists. Two general effects should however be
noted – the learning effect and the carry-over effect.
The learning effect refers to how a user gradually will familiarise with the
task, and thus adjust the performance throughout the evaluation. In our
case, we may assume that the users both will have familiarized
themselves with the domain throughout the test and to some extent also
learned how to improve their searches. After formulating and evaluating
52 queries, it would be safe to assume some sort of learning among the

Semantic modelling of documents                                         215
                                                 Figure 8.10



      50                                                                                      user1
      40                                                                                      user3
      30                                                                                      user5



               q1    q2    q3    q4   q5   q6   q7   q8   q9   q10 q11 q12 q13 q14 q15 q16




      20                                                                                      user2
      10                                                                                      user4
               q1    q2    q3    q4   q5   q6   q7   q8   q9   q10 q11 q12 q13 q14 q15 q16







      40                                                                                      user1
      30                                                                                      user3
      20                                                                                      user5


               q1    q2    q3    q4   q5   q6   q7   q8   q9   q10 q11 q12 q13 q14 q15 q16





       20                                                                                     user4

                q1    q2    q3   q4   q5   q6   q7   q8   q9   q10 q11 q12 q13 q14 q15 q16



                                            Variations in user scores

216                                                                                      Evaluating the approach
To counter the learning effect, the sequence of the queries should have
been randomized among the subjects. This was not done in our test. In
fact, the design of the evaluation form – a stapled document with 1
query per page and where the queries appeared sequentially from 1 to
16 – would encourage most users to run the test from 1 to 16.
However, the result-data do not show any increasing trend in query
scores (as can be seen from figure 8.10). The only visible change over
time is the number of iterations users were willing to try. While all users
reformulated query 1 two times (the maximum allowed), only 2 users
did the same for queries 2-8 (different users for each of the queries) and
for queries 9 to 16 no user reformulated any query 2 times.
The carry over effect in our case refers to the cases where users might
remember the model-concepts when they are formulating text searches.
As mentioned, there is a degree of overlap between the wording of the
text searches and the concept names in the model. The way the CnS-
client was set up for the test actually encourages this, since the model
was visible to users both when entering their text searches and the
model search. This was done in order to facilitate the users tasks in
formulating and posting a query repeatedly. However, this was pointed
out in the presentation of the task to the users and the rule-of-thumb
was to hide the model when formulating the text search. In any case,
after the first few searches, the users would have been familiarised
enough with the model to gain an effect from the model when
formulating their regular text searches. As discussed above, however,
the scores do not show an increasing trend.
Buggy implementation
The implementation is at a prototype level, and some bugs are still
present. The test was monitored and bugs were corrected upon
discovery/request. Two kinds of problems may have affected the

   Odin interface problems: Odin truncates long search strings. Our
    expanded search strings (average 26 terms, where a term might
    be multiword) tend to generate long strings. Some, but not all
    searches were cut off. That means an initial OR combination
    would in some cases change into an AND combination, and we
    end up with a much stricter query than initially intended.

   Slow response/java exceptions. In some cases, a search was slow,
    especially if a lot of documents had to be scanned, i.e. not in local
    cache. Depending on users configuration of their java-runtime
    environment, they would not detect exceptions from failed
    searches. (This happened initially due to encoding problems and
    variable document formats). Some users might have lost patience
    or misinterpreted a slow or failed search as a zero-result search.

Semantic modelling of documents                                             217
      These bugs were corrected when discovered, but some users that
      completed the evaluation early did not report problems with this.
      Either they did not have problems, or they incorrectly marked a
      search as having zero hits.

8.4.3.     Results in detail
Since it is difficult to draw any conclusions from the overall results, we
have to investigate the results in some more detail. In particular, we

     Query scores with respect to the 4 query categories

     Query scores with respect to the 3 query groups

     Individual user scores

     What are the characteristics of the queries where we win
      compared to the ones we loose?
Figure 8.11 shows the average query scores and the plots for each of the
4 search strategies and for each of the 4 categories. Our prior
assumption was that we would win in the most generic category (C1).
The average query scores show that we win by 7 points over ODIN and 4
points over Alltheweb. The statistical plots show a small difference in
medians compared to both search engines, but statistical analysis yields
no significant difference. For C1, The Wilcoxon signed rank test
produces a P value of 0.1250 for the ODIN results and a P-value of
0.625 for the Alltheweb test. Especially for ODIN, the value can be said
to show a tendency for this category, but not significant. As is shown in
the plots, there is a too large variance in the query scores to expect a
significant value.
For the other categories, there are no significant differences. In average,
regular Alltheweb performs best in C2, model-based queries using
Alltheweb (CnS-ATW) in C3 and regular ODIN in C4. An interesting issue
is the fact that we win over regular Alltheweb in C3, i.e. the most specific
queries. Again, this is not a significant measure (P-value of 0.1250). A
possible explanation for this would be that even if these queries are
specific in nature, it could be difficult to find the appropriate words to
formulate a regular text only query.
As mentioned, categorisation of queries is a difficult task. In particular,
how many documents a user would accept in a response is difficult to
guess in advance. An analysis of a conflated categorisation, where the
number of expected documents dimension is removed and the queries
are simply divided into generic-specific, did however not reveal any new

218                                                         Evaluating the approach
                                             Figure 8.11

                                  CnS ODIN      ODIN CnS ATW ATW average
                   C1 (G/N):        25,85       18,75 22,15  18,03 21,19
                   C2 (G/1):        11,08       13,88 10,68  21,43 14,26
                   C3 (S/1):         8,50       12,30 15,05   8,98 11,21
                   C4 (S/N):        18,28       20,30 12,02  11,98 15,64

                   Category 2 queries                                  Category 1 queries
       50                                                  50

       40                                                  40

       30                                                  30
       20                                                  20
       10                                                  10
        0                                                   0
             CnS ODIN   ODIN    CnS ATW   ATW                    CnS ODIN   ODIN    CnS ATW   ATW
                        Search strategy                    -10              search strategy

                   Category 3 queries                                  Category 4 queries
        50                                                 50

        40                                                 40

        30                                                 30

        20                                                 20

        10                                                 10

         0                                                   0
             CnS ODIN   ODIN    CnS ATW   ATW                    CnS ODIN   ODIN    CnS ATW   ATW
       -10              Search strategy                    -10              Search strategy

Results pr. query category. a) average query score per category for each of the 4 search strategies. b) plots
of the 4 queries in each category for each of the search strategies. The lines show the median value.
Categories are ordered in the same sequence as in they were specified (figure 8.5)

Figure 8.12 shows the plots of the query scores for each of the 3 query
groups. The prior assumption here was that we would perform well in
groups 1 and 2 (where the topic of the queries are well covered in the
model), but would loose in group 3 (poor model coverage).
As the figure shows, no specific difference is visible throughout the
various groups. Again, surprisingly, we perform slightly better than
Alltheweb in the group 3 queries (P-value = 0.1250). Initially, the model
should provide little or no support for these queries. This is a similar
effect as we noticed in category C3. In fact, 3 of the 4 queries are shared
between group 3 and C3, and the explanation could be similar.

Semantic modelling of documents                                                                         219
                                                    Figure 8.12

                         Group 1 queries                                    Group 2 queries
        50                                                   50

              cns-odin     odin   cns-atw    atw             -10 cns-odin         odin   cns-atw   atw
        -10               Search strategy                                     Search strategy

                                                    Group 3 queries





                                         cns-odin     odin   cns-atw        atw
                                   -10               Search strategy

 Results per query group. The figures show the plots of the average query score for each of the queries in
 the group and for each of the 4 search strategies. Lines show the median value.

Users variation
As we have seen, the variation in query score is high, which makes it
difficult to get any significant results from the statistical analyses. The
high variation is a result of the variation in user scores. It is therefore
interesting to investigate the individual user results for the generic-
specific dimension, in order to try to understand these results in more
Individual user scores exhibit the high variance, even within one users
results. Figure 8.14 shows the computed variance over individual user
scores for each query, separated into generic and specific queries. One
general remark can be made: The user variation is generally lower for
the specific queries than for the generic queries. From the data, only
one user shows a large variation for the specific queries.

220                                                                                       Evaluating the approach
                                                                        Table 8.8
             Q2                  Q3               Q4                 Q8                  Q12                 Q13            Q14                 Q16       Total

        O         A        O          A     O          A       O          A         O          A        O          A   O          A         O         A   O         A

U1      W         W        W          W     L          -       W          L         L          -         -         L   L          L         W         W   4-3       3-3

U2       -         -       W          W     W          W       L          -         W          -        L          L   L          L         W         W   4-3       3-2

U3      L         L         -          -     -         W       L          -         L         W          -         L   L          L         W         L   1-4       2-4

U4       -         -       L          L     W          -       W          L          -        W         W          -   W          L         W         -   5-1       1-3

U5       -         -       L           -     -         w        -         L          l         -         -         -    -         l         l         w   0-3       2-2

Total   1-1       1-1     2-2         2-1   2-1    3-0         2-2       0-3       1-3        2-0       1-1    0-3     1-3      0-5        4-1      3-1

                                             Individual user evaluation of the generic queries

                                                                        Figure 8.14



                                                                                                                            generic:     cns-odin
                                                                                                                            generic:     odin
                                                                                                                            specific:    cns-odin
                                                                                                                            specific:    odin



                                      1      2             3        4          5          6         7          8





                        400,00                                                                                               generic:    cns-atw
                                                                                                                             generic:    atw
                                                                                                                             specific:   cns-atw
                        300,00                                                                                               specific:   atw



                                      1      2             3        4          5          6         7          8


                            Variance in user scores generic vs. specific queries: a) Odin b) Alltheweb

     Semantic modelling of documents                                                                                                                          221
For the generic queries, we may summarise the results according to win-
lose (table 8.8). Winning and losing is calculated simply by the
difference in query score between the model-based search and the
regular search, for each search engine. We only record a win or lose for
the queries where the difference in query score is >= 10 points. The
total columns shows how many wins and loses we record, where the
result of the model-based search is shown first. For example, a 4-3
result would mean the model-based queries won 4 times and lost 3
Again, the variation is visible, both for the individual users, and more
importantly for each query. Only for a few queries can we single out a
clear trend in a positive and negative direction, but then only for one of
the search engines.

Model topics
So far, we have not analyzed the results in connection with the actual
domain model; what parts of the model performed well and what parts
performed poor? Which concepts and model structures were included in
queries that produced a higher – and lower – query score for model-
based queries compared to regular queries?
To examine this, we need to divide the queries according to “winning”
and “losing” queries, by calculating the difference between model-scores
and regular scores such as:
  Diff(qi) = (QScns-odin – QSodin) + (QScns-atw – QSatw), where
  QS<search-strategy> = the average 1. iteration query score for query i for that
  search strategy. A positive difference means the model-based strategies in
  total had a higher query score than the regular search strategies, i.e. a
  winning query, and a negative difference indicates a losing query.

  We can then   mark each query as a winner or a loser, such that:
  Clear win     for queries with Diff >= +10
  Win           for queries with Diff > + 5 and < +10
  Lose          for queries with Diff < -5 and > -10
  Clear lose    for queries with Diff =< -10

Figure 8.15a shows the differences and the division of queries into
winners and losers. Figure 8.15b shows a marking of the concepts in the
model according to whether they are included in a clear-win, win, lose or
clear-lose query. If a query refers to several concepts, all of them receive
a mark. From figure 8.15b we see a tendency that queries referring to
leaf-nodes in the hierarchical structures are included in the loosing
queries, while queries that refer to the more general concepts in the
same hierarchy are included in the winning queries. Table 8.9 shows the
counts and percentages for different kinds of concepts in the winning
and losing queries.

222                                                                  Evaluating the approach
                                                           Table 8.9
                               Leaf-concepts                 Generic concepts                Non-hierarchic
          Clear-win            0                             4 (80%)                         1 (20%)
          Win                  2 (67%)                       1 (33%)                         0
          Lose                                               1 (50%)                         1 (50%)
          Clear-lose           5 (100%)                      0                               0
                            Types of concepts included in winning and losing queries

The default behaviour of the CnSClient when a generic concept is
selected is to expand all sub-concepts and add it to the query. However,
we do not know whether or not this expansion was accepted by the users
or if this was deselected before executing the query. In any case,
however, it seems as if the model-based queries are able to expand a
generic and vague query with more precise concepts to produce relevant
hits. For leaf-nodes, no such model-based expansion is performed.

                                                       Figure 8.15

     Query   1       2       3       4       5       6      7       8       9      10   11   12      13       14      15     16
      Diff -6,20   -1,50   23,00   15,00   -8,70   -8,80   2,10   -9,90   11,50   0,77 4,30 4,80   -15,70   -41,80   6,20   16,80
      w/l    cl      -      cw      cw       l       l       -      l      cw       -    - w         cl       cl      w      cw


 Winning and losing queries wrt model constructs: a) winning and losing queries, Diff values.
 b) concepts included in winning and losing queries

Semantic modelling of documents                                                                                                     223
The log has shown that model-based queries retrieve substantially fewer
documents than a regular query. The model-based and linguistic
expansions generally narrow down a search. For queries initially generic,
the results give an indication that the model is able to pick out the
relevant documents from the collections more precisely; while for the
initially specific queries the model-search becomes too limited and
narrow, and either misses the target or retrieves too few documents.
The test-users have commented that for specific queries, the model
often did not contain the specific words they were looking for and it was
thus easier to simply type this in a regular text-search.
In sum, the analysis of winning and losing queries with respect to model
topics, underpins the trend we have seen so far; the model-based
queries perform better for queries that are more generic. Our
understanding of this is that the model helps us specify generic queries
in cases where users are not familiar with the detailed domain
terminology. An additional conclusion to be made from this, is that the
quality and granularity of the domain model is important and influences
the query results.

8.4.4.      User feedback
After the evaluation task was completed by the users, all users were
given the option to give their comments of the evaluation procedure, the
CnS-system and the results. There were no particular structure to these
“interviews” users were free to give comments.
The users comments can be summarised as:

     The model-based retrieval system provides a quicker and more
      direct formulation of a query, if relevant concepts are included in
      the model, with its direct selection of concepts, rather than
      typing. Some users felt it also gave a more confident query, if they
      were uncertain about the domain language. Some users, perhaps
      due to being non-native Norwegian speakers, in particular felt
      more comfortable selecting concepts than typing long domain
      specific words.

     Almost all users noted the lack of possibility to augment a model
      search with additional information. Users would prefer to be able
      to combine text and model queries. Not only in a generic text-
      field, but also as instances to model concepts, in cases where a
      relevant class concept exists.78 In particular this was true for
      query topics that would tend to cover specific locations (e.g.
      Trondheim or Norway) or time (today, in the future), which is not

  In the general specification of our approach (chapter 5) these options are included. However, they were
 not part of the implementation that was used for the retrieval trial. The general text field was used to enter
 the regular queries, while the instantiation of concepts is not fully implemented.

224                                                                                 Evaluating the approach
     covered in the model at all. One user suggested that such generic,
     and somewhat orthogonal concepts could be included in another
     model and perhaps inherited from exising ontologies.

     Also, the concepts in the model are mostly nouns and none of the
     relations are named, so if a user would search for an action (e.g.
     “bytte fastlege” – change primary doctor), the model can find
     documents about a primary doctor, but not about changing this

    The quality of the linguistic expansion (the domain model lexicon)
     must be varying. Even if users used somewhat similar words in
     model and text query, results could in some cases differ quite
     largely. In particular the concepts “EPJ” (electronic medical
     record), “helsekort” (health card – a particular medical
     documentation that is used in schools) and “helsestasjon for
     ungdom” (health station for youth) performed poorly.

    One user suggested that the lexicon could be used to “profile” the
     search according the expected language or audience. As an
     example, the concept of patient is in the model represented as
     “user” (Bruker). This is highly official language. For most of the
     public audience, “patient” would be the preferred term. If the
     model should be used to retrieve documents relevant for the
     general public, one lexicon could be used, and a different lexicon
     could be used if the target audience would be official health

    Precision: Three users made the remark that the model based
     queries did seem to produce less hits than the regular text-only
     queries, but almost the same amount of relevant documents. This
     is reflected in the log-files and the query scores.

8.5. Summing up
In general, the results show no big difference between the different
search strategies, for either of the two comparative search engines. No
statistically significant claims can be made based on these results. The
two main reasons for the result are:

    Few relevant documents are found in the domain. In general,
     query scores are low for all queries across all search strategies.

    Large variance in user-reported query scores.

  Again, the possibility to name un-named relations is specified in the general approach, but this is
 implemented for pre-classified documents only. The applied search-engine interface does not support this.

Semantic modelling of documents                                                                       225
The positive aspects from our point of view are:

     Win in generic queries: A tendency can be shown that we perform
      well for generic queries. This would fit with the prior assumptions,
      i.e. that the model would help users formulate a more precise
      query from an information need that is initially unspecified. The
      results that underpin this assumption are in particular the results
      from the generic category queries (C1) and the placement of the
      “clear win” queries in the model. The fact that we also win in the
      group 3 queries (poor coverage of query topic in model), also
      indicates that we help the users in formulating rather precise
      queries for topics that are not initially clearly formulated.

     More precise result-sets: A tendency that has been indicated by
      the users and which is visible in the data, is that the model-based
      search strategy results in fewer returned documents per query,
      however gives roughly the same amount of relevant documents.
      This is backed by the log-file and the overall average query scores.
      The number of returned documents is significantly smaller for
      model-based queries and the overall query scores are roughly
      equal. This may indicate that model-based queries are able to
      more precisely select the relevant documents. An interesting
      point, is that there is very little overlap between the result-sets of
      model based queries and regular queries.

The negative aspects of the evaluation are:

     The lack of a clear-cut result. While, this does not prove the
      approach useless, neither does it help us verify the assumptions
      regarding the benefits of the approach.

     In addition, the mentioned limitations (section 8.2.3) and the
      listed uncertainty factors (section 8.4.2), dampens the suggested
      positives above.
In sum, we conclude that the system seem to perform well in that it is
able to narrow down a search and helps the users to apply more specific
terms. In particular this seems to be the case when the users does not
know the complete and detailed domain terminology from before.
However, this only works if the quality of the domain model is high and
further that the domain model lexicon accurately reflects the words in
the document text. The latter has not been the case in our trial and
seems to hamper the result, especially in combination with the sparse
population of documents in the domain. With few relevant documents
available, the actual wording of a query is important in order for a
retrieval system to locate these.

226                                                         Evaluating the approach
Suggestions for further evaluations:
     “… user-based evaluation would seem to be much preferable over
       system evaluation: it is a much more direct measure of the
       overall goal. However, user-based evaluation is extremely
       expensive and difficult to do correctly.”
                                                      (Vorhees, 2001)
Due to time and resource limitations, further evaluation has not been
possible this far. However, based on the discussion in this chapter, we
suggest the following points to be considered for the next round of
testing of the system:

   Improved use of CnS-Client functionality; some of the functionality
    the test users asked for, actually exists in the client, but was
    restrained in this evaluation, for different reasons

   Select a highly populated domain, in particular ensure that all
    queries potentially could retrieve a full set of relevant documents,
    i.e. ensure enough relevant documents to be ranked for relevance.

   More careful construction of domain model lexicon, in particular
    to use the actual document collection that later will be queried.

   To facilitate users relevance considerations, select domain and a
    query set where they are more likely to have their own ideas about

   Define a query set with more clear-cut categories, in particular
    along the generic-specific dimension, in order to verify whether or
    not the difference we have observed is real.

   Randomise tasks (queries and search strategies) among subjects,
    in order to avoid learning and carry-over effects.

Semantic modelling of documents                                            227
                                             9. Concluding remarks

We have presented an approach to semantic document description and
retrieval that is based on conceptual modelling and linguistic analysis.
The main contributions of this work are:

   The specification and prototype realisation of the approach, where
    the Referent Model Language is used to create a model of the
    domain referred to by the documents. This domain model is then
    used directly in a user tool for document classification and
    retrieval. Documents are classified using - possibly instantiated -
    fragments of the domain model. Querying is performed by
    mapping an initial text-only query into the domain model, before
    model-based query execution and refinement are performed
    interactively by the user.

   The use of RML for the tasks of document description and
    retrieval is formalised through the definition of referential
    semantics of RML, the definition of a model-based boolean query
    strategy and a corresponding model-based query notation.

   In order to enhance our approach and to facilitate the users
    interaction with the system, we have developed a way of
    supporting our approach using lexical analysis:

     -   A linguistic analysis process in support of constructing
         the domain model from a domain specific document
         collection is specified and realised.

     -   The domain model is augmented with a domain model
         lexicon, which supports automated matching of
         document text against the model. Based on this
         automatic matching, the system presents the user with
         an initial suggestion for the description of the document
         at hand.

     -   The domain model lexicon also enables the mapping of
         natural language queries into a model based query
         expression, that serve as a starting point for model-
         based query execution and refinement.

   A comparative perceived relevance evaluation has been performed
    with a set of test users on a limited but real world case. The result
    show a tendency, but without adequate statistic significance, that

Semantic modelling of documents                                             229
      the model-based retrieval performs better for queries generic in
      nature and in general produces more specific results than the
      regular search.

In chapter 1, we listed the following objectives for our work:
      1 . To explore and understand the requirements for document
          management in such intranet settings as those outlined above.
      2. To investigate how a given semantic modelling language can be
         used to represent the semantics of documents.
       and then:
      3 . To investigate if the semantic modeling language can be used
          directly in a tool that will assist the users in their semantic
          classification and retrieval of documents.

In the sequel, we discuss our contributions according to these

1. Understand requirements for cooperative document management settings
The literary survey in chapter 2 explored the theoretical background and
investigated several different approaches; from state-of-the art IR
techniques, semantic web and ontology-based approaches to knowledge
management approaches. Based on this, we outlined the following
principles for a model-based document classification and retrieval
       Understandable to humans
       Constitute an explicit representation
       Interface standard IR machinery
       Support interoperability
We have applied the Referent Model Language (RML) - a conceptual modelling language based
on set theory and with a simple graphical notation. As any conceptual modelling language, this
language is intended for presentation of domain semantics among the stakeholders and domain-
users. The modelling language is a tool for cooperative construction of visual, explicit
representations of domain semantics. The application of such a language implies a focus on a
human readable domain model. In the approaches we have seen from the semantic web,
application of semantic networks in IR and ontology-based approaches, this is not always the
case. For different reasons and depending on their objectives, many ontology-based approaches
have a focus on machine-readable domain representations.

Our goal has not been to develop new IR machinery, but to be able to
interface existing machinery. While we have implemented our own
components for classification and retrieval in our prototype realisation,
we have specified a mapping from RML based representations into
representations in the Resource Description Framework (RDF). For the
evaluation of our approach, we devised an interface from model-based
query formulations and into regular query notations used in two publicly
available search engines. The latter only covers one half of our
approach, i.e. the retrieval task. The RDF mapping and the search

230                                                                       Concluding remarks
engine interface represents the first steps into realisation of our
approach using standard IR machinery.
In the connected world of intranets and the web, interoperability is a
necessity, either between different views of the same domain, related
domain representations or between different domains. The Referent
Model Language includes mechanisms to handle semantic
interoperability between models and model views and the CnS client
provides the ability to apply several models for both classification and
retrieval. This is however, an issue we have not investigated in detail and
the interoperability aspects of our approach have not been specified or
evaluated in this thesis.

2. Applying the semantic modelling language to represent semantics of documents
In our approach, the semantics of documents is represented by way of a
Referent Model fragment. We have defined the referential semantics of
RML, which defines the semantics of a domain model by way of its
references to documents in the domain document collection. The
semantics of a document is given by its interpretation as a domain model
fragment. In our approach, we expect the user to be able to perform this
interpretation manually. By connecting the model to a domain model
lexicon, the system can provide a user with suggestions based on term
occurrences in the document text, but the user interpretation means
that there is not a direct mapping between terms, their document
occurrences and the domain model concepts. A query is represented as
a domain model fragment. Document retrieval is performed by
collecting the set of references from the model fragment that represents
the query.
To support the users tasks, we have devised a model-based visualisation
of documents. The visualisation is important in our approach as it
enables user-interaction with the model for document description and
retrieval and it helps the user understand the semantics of a domain
model in terms of the document collection at hand. For retrieval, we
have realised a model-based query strategy and notation.

3. Applicable in a tool to assist users in semantic classification and retrieval
The CnS client represents our attempt at the realisation of such a tool.
The evaluation of our approach was a user oriented retrieval experiment
in order to investigate the effects of model-based query formulation in
terms of perceived retrieval quality.
The positive outcomes of the evaluation that supports the applicability
of our approach were:

   Users expressed satisfaction with model-based query formulation;
    they found it satisfactory to have the domain specific terminology

Semantic modelling of documents                                                    231
      available for query formulation and they found the model
      interaction somewhat more direct and “quicker” than to formulate
      complete text queries. This is encouraging since this supports
      that we have realised some of the main principles, even if the end-
      users in our evaluation all are trained modellers and experienced
      users of conceptual modelling.

     From the evaluation results, we see a trend that the model
      performs well for queries that are generic in nature. In addition:
      The results indicate that the model-based queries produce more
      specific results than free text query formulation. This supports the
      overall idea that having the domain terminology available helps
      the user to formulate a domain specific query. In particular, this
      seems to be the case when the information need is vague and the
      user does not have a clear idea of a query formulation.

     We also found that the system performed well for the queries that
      referred to the generic parts of the model. This is encouraging,
      since it indicates that:

      a. The model-based query expansion helps to modify an initially
         generic query into a more specific and relevant query. We have
         defined a default query expansion along the relations and
         abstraction mechanisms of our modelling language. The fact
         that we scored higher for the queries where this expansion
         could take place shows that adequate application of the
         semantic constructs of the modelling language can affect
         retrieval quality.

      b. In general, the quality of the domain model is important for the
         system to work. This implies that the domain model must be
         carefully constructed in order to appropriately reflect both the
         document collection at hand and the information needs of the
         users. We have specified the use of a cooperative modelling
         process, supported by adequate modelling tools in order for
         resulting model to become a shared representation of the
         domain among the domain stakeholders. Further, we have
         specified the use of linguistic analysis to define a domain
         model lexicon intended to bridge the gap between the domain
         model concepts and the document texts.
On the other hand, the evaluation also showed that the approach
performs less well in some cases:

     For specific queries where users are familiar with the terminology
      and know how to phrase a text query. If the user already has
      detailed knowledge of the specific terminology, there is no need
      for the model to guide the way.

232                                                          Concluding remarks
   For queries with very few relevant documents. If the model-based
    queries return no documents, the model does not give the users
    any more options for experimenting. The only option for this in the
    model-based approach would be to move up in the hierarchical
    parts of the model to create queries that are more generic. For
    text only queries, it is easier to simply experiment with variant or
    alternate words. This was visible in our evaluation where the
    domain was sparse.

   Where the quality of the domain model and domain model lexicon
    is not good enough. In the evaluation, we chose to use an existing
    domain model and a domain model lexicon that was not
    constructed from the document collection that was queried. This
    was done since we wanted to use a real-world domain model and
    since we did not have control over the document collection in the
    comparative search engines. The results and user-feedback
    indicate that in some cases the domain model did not reflect the
    information need and some of the lexicon entries produced
    unwanted documents.
Future work
There is a limit to how much of the specified approach we have been
able to implement and evaluate within the boundaries of this work.
Further, there are related ideas that we would like to explore in the

   Further evaluation of the approach is needed. First, in order to
    verify the tendency we detected in the performed evaluation
    experiment. Second, in order to be able to more accurately state
    what settings the approach will provide a benefit: For what kind of
    users is this a benefit and what level/kind of domain model is
    needed? The results have indicated that the model performs well
    in cases where users do not know the detailed domain
    terminology. It could also be that the approach would prove a
    benefit for other users as well, but that this would need a different
    level of detail in the domain model or a domain model with a
    different focus.

   Our evaluation focused on retrieval quality where we used the
    model as a query formulation interface only. One of the
    assumptions behind our approach was that the model could be
    used in a cooperative setting where the users would describe the
    semantics of documents for the benefit of each other, an
    assumption similar to one of the foundations of the Semantic
    Web. The description or classification part of our system has not
    been tested. Naturally, it would be interesting to measure the
    effect of retrieval quality if documents were also described by the

Semantic modelling of documents                                             233
      users and by way of the domain model, rather than only querying
      an existing index.

     It seems from the results that the domain model quality has an
      impact on retrieval quality. Since we simply applied an existing
      domain model, we did not thoroughly investigate the impact of
      model and modelling language quality on the retrieval task. It
      would be interesting to perform an experiment with various
      domain models, possibly in different modelling languages, in
      order to properly understand the effect of the applied modelling
      languages and model constructs.

     We have not defined applications for all of the RML constructs in
      our approach. Would the classification and retrieval tasks have
      benefited from a more thorough exploit of all RML constructs?

   In particular, we have not achieved a full exploit of model
     relations. We have based our approach on “free-hand” relation
     naming, with lexicon entries for variant names. For querying,
     users can select among stored relation names. This is the
     “normal” use of relations in domain modelling. However, we have
     not investigated how relations can be exploited for the specific
     purpose of formulating a query that reflects an information need,
     or how relations are best applied in order to reflects the
     documents in the collection. Our linguistic analysis does not
     provide adequate input for relations, neither statistical nor lexical.
     Several advanced strategies exist for detecting relations or term-
     associations from text. We would need to understand how
     relations should best be applied for retrieval and devise a
     linguistic analysis that supports this.

     Realise a tool in support for the cooperative modelling process. As
      mentioned, the evaluation has indicated that the model must
      accurately reflect the needs of the users, for the approach to be
      beneficial. We have not realised such a modelling tool. Other work
      within the Information systems group has investigated such tools
      and work is under way for its realisation.

     Investigate other areas of application than traditional document
      retrieval. The thesis work has been kept within document
      classification and retrieval. The value of domain modelling and
      ontology building is recognised in a multitude of settings, for
      different application areas. It would be interesting to investigate,
      different scenarios for applications of an approach where a
      domain model is connected to a domain document collection
      through a corresponding domain model lexicon. As an example
      from the medical domain: New regulations in Norway state that a
      patient shall get access to all documentation that describe his/her

234                                                           Concluding remarks
   condition. While several sources for such documentation exist,
   ranging from electronic medical records, medical procedure
   books to medical encyclopaedias, the regulation further states:
   “the patient is entitled to an explanation of the terminology of the
   provided documentation”.

Semantic modelling of documents                                           235
(Abecker et. al, 2000) Andreas Abecker, Ansgar Bernardi, Ludger van Elst, Rudi Herterich, Christian Houy,
  Stephan Muller, Spyros Dioudis, Gregory Mentzas, Maria Legal, “Workflow-Embedded Organizational
  Memory      Access     -   The    DECOR      Project”,    available     at    http://serv-4100.dfki.uni- Accessed july 2003.

(Aberer & Read, 1998) Karl Aberer & Brian Read, “11th ERCIM Database Research Group Workshop on
  Metadata for Web Databases”, Sankt Augustin, Germany 25-26 May 1998, Proceedings available at:

(Ackerman, 1994) M.S. Ackerman, “Definitional and Contextual Issues in Organisational and Group
  Memories”, Proceedings of HICSS, 1994.

(Ackerman & McDonald, 1996) Mark S. Ackerman and David W. McDonald. “Answer Garden 2: Merging
  organizational memory with collaborative help.” In Proceedings of the ACM CSCW 1996

(Agichstein et. al, 2000) Eugene Agichtein and Eleazar Eskin and Luis Gravano, “Combining Strategies for
  Extracting Relations from Text Collections'', Proceedings of the 2000 ACM SIGMOD Workshop on Research
  Issues in Data Mining and Knowledge Discovery (DMKD 2000)'

(Ahmad, 1994) Ahmad, K. 1994. “Language Engineering and the Processing of Specialist Terminology”., accessed: January 2004

(Aitchison et. al., 2000) Aitchison, J, Gilchrist, A & Bawden, D, “Thesaurus construction and use: a
  practical manual.” London : Aslib. 218p, 2000.

(Alexaki et. al., 2001) S. Alexaki, V. Christophides, G. Karvounarakis, D. Plexousakis, K. Tolle, “The ICS-
  FORTH RDFSuite: Managing Voluminous RDF Description Bases”, , 2nd International Workshop on the
  Semantic Web (SemWeb'01), in conjunction with Tenth International World Wide Web Conference
  (WWW10), pp. 1-13, Hongkong, May 1, 2001.

(Allan, 2000) James Allan “NLP for IR”, Tutorial presented at the NAACL/ANLP language technology joint
  conference        in    Seattle,       Washington,          2000.     Slides     available      at:, accessed: March 2004

(Almo, 2003), Marit Almo, “BIBSYS – Emnefelter” (“Handbook of bibsys subject fields” – in Norwegian),
  Available at: (Accessed: April, 2004)

(Amento et al, 2003) Amento, B., Terveen, L., Hill, W., Hix, D., and Schulman, R. “Experiments in Social
  Data Mining: The TopicShop System”, in ACM Transactions on Computer-Human Interaction , 10, 1
  (March 2003)

(Andersen, 1994) Rudolf Andersen, “A Configuration Management Approach for Supporting Cooperative
  Information System Development”, PhD thesis, IDT, NTH, 1994

(Ando et. al, 2000) R. Ando, B. Boguraev, R. Byrd, and M. Ne_. “Multi-document summarization by
  visualizing topical contents”, ANLP-NAACL Workshop Automatic Summarization, 2000.

(Angele et. al, 2000) Jürgen Angele, Hans-Peter Schnurr, Steffen Staab, Rudi Studer. “The Times They Are
  A-Changin' - The Corporate History Analyzer.” In: D. Mahling & U. Reimer, Proceedings of the Third
  International Conference on Practical Aspects of Knowledge     Management. Basel, Switzerland, October
  30-31, 2000

(Arampatzis et. al., 2001) Arampatzis, A.T., T.P. van der Weide, P. van Bommel and C.H.A. Koster,
  "Linguistically Motivated Information Retrieval", Technical Report, CSI-R9918, University of Nijmegen,
  Holland, 2001.

(Arms, 1995) Arms, W. Y., “Key concepts in the architecture of the digital library”, D-Lib Magazine, July

(Arppe, 1995) Antti Arppe, “Term extraction from unrestricted text”, Proceedings of NODALIDA-95, Article
  available at:, Accessed April 2004

(Auxilio & Nieto, 2003) Auxilio, M. and Nieto, M. “An Overview of Ontologies”, Universidad de las Americas
  Puebla Technical Report, 2003

(Aas and Eikvil, 1999) K. Aas and L. Eikvil. “Text categorisation: A survey.”, Technical report, Norwegian
  Computing Center, June 1999.

(Baca, 1998) Baca, M., “Introduction to Meatadata: Pathways to digital information”, Los Angeles, Getty

Semantic modelling of documents                                                                        237
(Bahl & Mercer, 1976) Bahl, L.R. and Mercer, R.L. “Part of speech assignment by a statistical decision
 algorithm”, in proceedings of IEEE Inernational symposium on information theory, pp88-89, 1976.
 Described in (Jurafsky & Martin, 2000)

(Baeza-Yates R. and Ribeiro-Neto, B., 1999) Ricardo Baeza-Yates and Berthier Ribeiro-Neto, “Modern
 Information Retrieval”, Addison-wesley, 1999.

(Bannon and Bødker, 1997). Bannon, L. and S. Bødker. "Constructing Common Information Spaces" in 5'th
  European Conference on CSCW. 1997. Lancaster, UK: Kluwer Academic Publishers

(Bechhofer, 2002) Bechhofer, S. (ed) “ontology language standardization efforts”, Ontoweb Deliverable 4.0,

(Berners-Lee et al, 2001), Tim Berners-Lee, James Hendler, Ora Lassila, “The Semantic Web”, Scientific
 American, may 2001

(Berners-Lee & Miller, 2002) Tim Berners-Lee and Eric Miller “The semantic web lifts off”, ERCIM
 News No. 51 , October 2002.

(Borghoff and Pareschi, 1998) U.M. Borghoff and R. Pareschi (eds) “Information Technology for Knowledge
  Management”, Springer Verlag, 1998

(Borgmann, 2000) Cristine Borgman, “From Gutenberg to the Global Information Infrastructure : Access to
  Information in the Networked World”, MIT Press, 2000.

(Bowers & Delcambre, 2000) Shawn Bowers and Louis M. Delcambre, “Representing and Transforming
 Model-Based Information”, ECDL 2000 Workshop on the Semantic Web, 21 September 2000, Lisbon,

(Brasethvik, 1998) Brasethvik, T., "A semantic modeling approach to meta-data". Journal of Internet
 Research, 1998. Vol. 8 (No. 5): p. 377-386.

(Brasethvik & Gulla, 1999) Brasethvik, T. and J.A. Gulla. "Semantically accessing documents using
 conceptual model descriptions", in World Wide Web & Conceptual Modeling (WebCM 99). Paris, France,19

(Brasethvik & Gulla, 2002) Brasethvik, T. and J.A. Gulla “A conceptual modelling approach to semantic
 document retrieval”, in proceedings of Advanced Information Systems Engineering, (CaiSE), Toronto,
 Canada, 2002.

(Breese et. al , 1998) Breese, J.S., Heckerman, D., & Kadie, C. (1998): “Empirical analysis of predictive
 algorithms for collaborative filtering”, in Proceedings of the Fourteenth Conference on Uncertainty in
 Artificial Intelligence, San Francisco, CA, 1998, pp. 43-52.

(Brickley & Guha, 2002) Dan Brickley and R.V. Guha, “RDF Vocabulary Description Language 1.0: RDF
 Schema”,, (accessed: June 2003)

(McBrien et al, 1992) McBrien, P.J., Seltveit, A.H., and Wangler, B. "An Entity Relationship Model extended
 to describe Historical Information", in Proceedings of CISMOD'92. Bangalore, India, 1992

(Brin & Page, 1998) Sergey Brin & Lawrence Page, "The Anatomy of a large-scale hypertextual web search
 engine.", in Proceedings of the Seventh WWW conference, 1998, available at: http://www-

(Brill, 1995) Brill, E.: “Transformation-Based Error-Driven Learning and Natural Language Processing: A
 Case Study in Part of speech Tagging.”, in Computational Linguistics. 21:4. (1995)

(BSCW, 1999) BSCW, "Basic Support for Cooperative Work on the WWW",, (Accessed:
 May 1999)

(Bubenko et. al, 1997) Bubenko, J., Bomann, Johanneson and B. Wangler, "Conceptual Modeling", Prentice
  Hall, 1997.

(Buckley et al, 1998) C. Buckley, J. Walz, M. Mitra, and C. Cardie. “Using Clustering and SuperConcepts
 Within SMART: TREC-6”. In Proceedings of TREC-6, pages 107-124, 1998

(Butler, 2003) Mark H. Butler, “Is the semantic web hype”, presentaion at Manchester Meteropolitan
 University, 2003, (accessed July

(Bunge, 1998) Mario Bunge, “The philosophy of science”, Transaction publishers,          USA, 1998, ISBN:

(Byrd & Ravin, 1999) Byrd, R. & Ravin, Y. “Identifying and extracting relations from text.” In. NLDB'99, 4th
  International conference on applications of natural language to information systems, 1999

238                                                                                             References
(Carlsen, 1998) Steinar Carlsen, “Conceptual Modelling and Composition of flexible workflow models”, PhD
  Thesis, IDI, NTNU, 1998

(Conklin, 1996) E. J. Conklin “Capturing Organisational Memory”, Group Decision Support Systems,, Accessed August, 1999.

(Corcho & Gomez-Perez, 2000a) Oscar Corcho & Asunción Gómez-Pérez “A roadmap to Ontology
 Specification Languages” in proceedings of EKAW, 2000.

(Corcho & Gomez-Perez, 2000b) O. Corcho and A. Gomez Perez. “Evaluating knowledge representation and
  reasoning capabilities of ontology specification languages.” In Proc. of ECAI-00 Workshop on Applications
  of Ontologies and Problem-Solving Methods, 2000.

(Cranefield & Purvis, 1999) S. Cranefield and M. Purvis. “UML as an ontology modelling language.” In
  Proceedings of the Workshop on Intelligent Information Integration, 16th International Joint Conference on
  Artificial Intelligence (IJCAI-99), 1999. Available at http://sunsite.informatik.rwth- CEUR-WS/Vol-23/cranefield-ijcai99-iii.pdf.

(Denny, 2002) Michael Denny “Ontology building: A survey of editing tools”, XML.COM magazine,, November, 2002.

(Dingsøyr, 2002) Torgeir Dingsøyr, “Knowledge Management in Medium-Sized Software Consulting
 Companies”, PHD Thesis, IDI, NTNU, 2002.

(Dominque, 1998) Domingue, J. "Tadzebao and WebOnto: Discussing, Browsing, and Editing Ontologies on
 the Web." in 11th Banff Knowledge Aquisition for Knowledge-based systems Workshop. Banff, Canada,

(Eberhart, 2002) Eberhart, A. “A survey of RDF data on the web”, Technical Report, International University
  of Germany, available at:

(Encarta, 1999) Encarta “World English Dictionary”, Developed for Microsoft by Bloomsbury Publishing

(Engels, 2001) Robert Engels, “Ontology Extraction Tool”, Ontoknowledge Project Deliverable,, Accessed July 2003.

(Farshchian, 1998) Farshchian, B.A. "ICE: An object-oriented toolkit for tailoring collaborative Web--
  applications" in IFIP WG8.1 Conference on Information Systems in the WWW Environment. Beijing, China,

(Farshchian, 2001) Farshshian, B.A. “ A framework for supporting shared interaction in distributed product
  development projects”, Ph.D. thesis, IDI, NTNU, 2001:38

(Farquhar et. al, 1996) Farquhar, R. Fikes, W. Pratt, and J. Rice. “The Ontolingua Server: a Tool for
  Collaborative Ontology Construction.” In proceedings of the 10th Knowledge Acquisition Workshop,
  KAW'96,Banff,Canada, November 1996. Also available as KSL-TR-96-26.

(Faure & Nedellec, 1998) Faure D. and Nédellec C. “A Corpus-based Conceptual Clustering Method for
  Verb Frames and Ontology Acquisition.” In LREC workshop on adapting lexical and corpus resources to
  sublanguages and applications, Granada, Spain, 1998.

(Faure & Poibeau, 2000) D. Faure and T. Poibeau. “First experiments of using semantic knowledge learned
  by asium for information extraction task using intex.” In Proceedings of the ECAI 2000.

(Fellbaum, 1998) Fellbaum, Christiane, ed., “WordNet: An Electronic Lexical Database,” MIT Press, May

(Fensel et. al, 1998) Fensel, D., S. Decker, M. Erdmann and R. Studer. "Ontobroker: How to make the web
  intelligent" in 11th Banff Knowledge Aquisition for Knowledge-based systems Workshop. Banff, Canada,

(Fensel,, 1999a) Fensel, D., J. Angele, S. Decker, M. Erdmann and H.-P. Schnurr, "On2Broker:
  Improving access to information sources at the WWW",
  broker/o2/o2.pdf, (Accessed: May, 1999 )

(Fensel et. al, 1999b) D. Fensel, J. Angele, S. Decker, M. Erdmann, H.-P. Schnurr, S. Staab, R. Studer, and
  Andreas Witt. On2broker: Semantic-based access to information sources at the www. In In Proceedings of
  the World Conference on the WWW and Internet (WebNet 99), Honolulu, Hawaii, USA, 1999.

(Fensel et. al, 2000) Fensel, D. Horrocks, I., van Harmelen, F., Decker, S., Erdmann, M. and Klein, M. “OIL
  in a nutshell”, In (Ding et. al (eds)) (Fox, C, 1992) “Leical Analysis and stop-lists”, in Frakes, W., and
  Bayeza-Yates, R. “Information Retrieval: Data structures and algorithms”, Prentice Hall, 1992.

Semantic modelling of documents                                                                         239
(Fensel & Gomez-Perez, 2002) Dieter Fensel and Asunción Gómes Pérez (eds.) “OntoWeb Deliverable 1.3 -
  A    survey   on    ontology     tools”,   Ontoweb    deliverable   1 . 3 , http://ontoweb.aifb.uni-, 2002

(Fidjestøl, 2003) Arne Dag Fidjestøl, “OntoSearch“, TDT 4215 Project report, IDI, NTNU, 2003

(FirstClass, 1999) FirstClass, "FirstClass Collaborative Classroom",, (Accessed:
  May, 1999 )

(Fliedl et. al, 1997) Fliedl, G., C. Kop, W. Mayerthaler, H.C. May and C. Winkler. "NTS-based derivation of
  KCPM Perspective Determiners" in 3rd Int. workshop on Applications of Natural Language to Information
  Systems (NLDB'97). 1997. Vancouver, Ca.

(Frakes & Baeza-Yates, 1992) W.B. Frakes and R. Baeza-Yates, “Information Retrieval: Data structures and
  algorithms”, Prentice Hall, NJ, USA, 1992

(Furnas et al, 1988) G.W. Furnas, S. Deerwester, S.T. Dumais, T.K. Landauer, R.A. Harshman, L.A. Streeter
  and K.E. Lochbaum, "Information Retrieval using a Singular Value Decomposition of a Latent Semantic
  Structure", in "Proceedings of the 11th annual ACM SIGIR conference on Research and Development in
  Information Retrieval", pp.465-480, 1998

(Furnas et. al, 1987) Furnas, G W, Landauer, T K, Gomez, L M & Dumais, S T. “The vocabulary problem in
  human-system communications.” In: Communications of the ACM, 30 (11), pp964-971, 1987.

(Fox, 1992) C. Fox, “Lexical analysis and stop-lists”, in “Information retrieval: data structures and
  algorithms”, pp 102-130, Prentice Hall, USA, 1992

(Galen, 2000) Galen, "Why Galen - The need for Integrated medical systems", http://www.galen-, (Accessed: March 2000)

(Gaizauskas, 2002) Robert Gaizauskas, "An Information extraction perspective on text mining", Euromap
 Text Mining Seminar, 2002

(Gjersvik, 1993) R. Gjersvik, “The construction of information systems in organisation”, An action research
 project on technology, organisational closure, reflection and change, PhD Thesis, ORAL, NTH, 2003

(Glance et. al, 1998) Glance, N., Arregui, D. and Dardenne, M. “Knowledge Pump: Supporting the flow and
 use of knowledge”, in (Borghoff & Pareschi, 1998)

(Goldberg et al, 2000) Goldberg, K., Roeder, T., Gupta, D., & Perkins, C. (2000): “Eigentaste: A Constant
 Time Collaborative Filtering Algorithm”, UCB ERL Technical Report M00/41, August 2000

(Goldstein et. al, 1999) J. Goldstein, M. Kantrowitz, V. Mittal, and J. Carbonell. Summarizing text
 documents: Sentence selection and evaluation metrics. In Proceedings of SIGIR, pages 121--128, 1999.

(Gomez-Perez & Macho, 2003) Asunción Gómez-Pérez, David Manzano-Macho. (eds) “Deliverable 1.5: A
 survey of ontology learning methods and techniques” Ontoweb Project Deliverable 1.5,

(Google, 2003) “Google Web Search Features”,

(Gordon & Pathak, 1999) Michael Gordon and Praven Pathak, “Finding information on the world wide web:
 the retrieval effectiveness of search engines”, Information processing and managament, 35(2),pp 141-
 180, March 1999

(Gruber, 1995) Gruber, T.R., "Towards Priciples for the Design of Ontologies used for Knowledge Sharing".
 Human and Computer Studies, 1995. Vol. 43 (No. 5/6): p. 907-928.

(Gruber, 1993) Gruber, T.R., “A translation approach to portable ontology specification”. Knowledge
 Acquisition. #5: 199-220. 1993.

(Gruber, 1991) Gruber, T.R., "Ontolingua - A mechanism to support portable ontologies", Technical Report,
 KSL91-66, Knowledge Systems Lab, Stanford University,

(Gross et. al, 2003) Gross, T., Wirsam, W. and Graether, W. “AwarenessMaps: Visualising Awareness in
 Shared Workspaces.” In Extended Abstracts of the Conference on Human Factors in Computing Systems
 CHI          2003,         Fort        Lauderdale,           Florida,       available          at:

(Grudin, J, 1994) Grudin J. “Groupware and social dynamics – eight challenges for developers”,
 Communications of the ACM, Vol. 37, no 1, 1994

(Guarino et. al, 1999) Guarino, N., Masolo, C., and Vetere, G. OntoSeek: Content-Based Access to the Web .
 IEEE Intelligent Systems May/June (1999), 70--80.

(Guarino, 1995) Guarino, N., "Ontologies and Knowledge Bases". 1995, IOS Press, Amsterdam.

240                                                                                            References
(Guarino, 1998) Nicola Guarino (ed) “Formal ontology in information systems”, IOS Press, Amsterdam,
 Berlin, Oxford, 1998.

(Guarino et al., 1999) Guarino, N., Masolo, C., and Vetere, G. “OntoSeek: Content-Based Access to the
 Web.” IEEE Intelligent Systems May/June (1999), 70--80.

(McGuinnes, 2003) Deborah L. McGuinness. "Knowledge Representation for Question Answering". In
 Proceedings of the American Association for Artificial Intelligence Spring Symposium Workshop on New
 Directions for Question Answering. Stanford University, Stanford, CA. pages 75-77, AAAI Press, March

(Gundersen, 2000) Dag Gundersen, “Norske synonymer, blå ordbok”, Kunnskapsforlaget, 2003

(Gulla, 1993) Jon Atle Gulla, “Explanation Generation in Information Systems Engineering”, PhD Thesis,
 IDT, NTH, 1993

(Gulla et. al., 2002) Gulla, J. A., Auran, P.G. & Risvik, K.M. “Using Linguistics in a large-scale web search
 engine”, Proceedings of NLDB-2002, Stockholm, Sweden.

(Gulla, 2003) Gulla, Jon Atle, “IR Systems Evaluation”, Course material TDT4215, IDI, NTNU, H2003

(Gulla et. al., 2004) Gulla, J.A., Brasethvik, T., Kaada, H., “A flexible workbench for document analysis and
 text mining.”, accepted for NLDB-2004, Manchester, England.

(Gutwin et. al, 1996) Gutwin, C., Greenberg, S. and Roseman, M. “Supporting awareness of others in
 groupware”. In ACM CHI’96 (short paper suite), 1996.

(Haase et. al., 2004) Peter Haase, Jeen Broekstra, Andreas Eberhart, Raphael Volz, “A Comparison of RDF
 Query Languages”, available at: , accessed: April

(Haddad & Latiri, 2003) H. Haddad, C. Ch. Latiri, “Query Expansion using Crisp Association Rules between
 Terms”, The Second International Workshop on Advanced Computation for Engineering Applications
 (ACEA'2003). Cairo, Egypt, December 21-22, 2003

(Hall & Dowling, 1980) P.A. Hall and G.R. Dowling. “Approximate string matching.” ACM Comput. Surveys,
 (12):pp 381-402, 1980.

(Halpin, 1995) Halpin, T., "Conceptual Schema and relational database design". 2nd. ed. 1995, Sydney,
 Australia: Prentice Hall.

(Harmze and Kirzc, 1998) Harmze, F. A. P. and J. Kirzc. “Form and content in the electronic age.” IEEE-
 ADL'98 Advances in Digital Libraries, St.Barbara, CA, USA, IEEE., 1998

(Haustein & Pleumann, 2002) Stefan Haustein and Jörg Pleumann, “Is Participation in the Semantic Web
 Too Difficult?”, In The Semantic Web - First International Semantic Web Conference, I. Horrocks and J.
 Hendler (ed.), LNCS, 2342, Springer, Heidelberg, 2002.

(Hawking et. al.. 2001) D. Hawking, N. Craswell, P. Bailey and K. Griffiths. “Measuring search engine
 quality.”.Journal of information retrieval, 2001.

(Hearst, 1999) Matti Hearst, “Untangling Text Datamining”, In Proceedings of ACL'99: the 37th Annual
 Meeting of the Association for Computational Linguistics, 1999.

(Heggland, 2002) Jon Heggland: “OntoLog: Temporal Annotation Using Ad Hoc Ontologies and Application
 Profiles”, presented at the European Conference on Digital Libraries, Rome, Italy, September 16-18,

(van Heijst et al, 1998) G. van Heijst et. al, 1998, “The Lessons Learned Cycle”, In Borghoff et al,
  “Information technology for Knowledge Management”, Springer Verlag, 1998

(Heikkila, 1995) Heikkilä, Juha.. “ENGTWOL English lexicon: solutions and problems.” In Karlsson, Fred,
 Atro Voutilainen, Juha Heikkilä, and Arto Anttila (editors). Constraint Grammar: a language-independent
 system for parsing unrestricted text, volume 4 of Natural Language Processing. Mouton de Gruyter, Berlin
 and New York, 1995

(Helfin & Hendler, 2000) J. Heflin and J. Hendler, “Searching the Web with SHOE.”, in: Artificial Intelligence
 for Web Search. Papers from the AAAI Workshop. WS-00-01 (AAAI Press, Menlo Park, CA, 2000) 35-40

(Hoffman & Herrmann, 2002) Marcel Hoffmann, Thomas Herrmann “PRomiseE2 - Gathering and Providing
 Situated Process Information in Knowledge Management Applications”, Proceedings of I-Know, 2002.

(Horrocks et al, 2002) Horrocks, Patel Schneider, van Harmelen, 2002: “Layering on the semantic web”,, Accessed july 2003).

Semantic modelling of documents                                                                           241
(Hu, 1999) Wen-Chen Hu "ApproxSeek: Web Document Search Using Approximate Matching", Paper
 available at, Accessed: April 2004

(Hudon, 1998) : Hudon, Michéle, “An assessment of the usefulness of standardized definitions in a
 thesaurus through interindexer terminological consistency measures”, PhD dissertation, University of
 Toronto, Canada, 1998

(Hull and King, 1986) Hull, R. and R. King, "Semantic Database Modeling; Survey, Applications and
 Research Issues". ACM Computing Surveys, 1986. Vol. 19 (No. (3) Sept.).

(Inxight, 2002) Inxight White Paper: “Linguistics adding value to e-publishing and e-content”, (Accessed april 2004)

(Jardine and van Rijsbergen, 1971) Jardine, N., & Van Rijsbergen, C. J. (1971). “The use of hierarchical
  clustering in information retrieval”. Information Storage and Retrieval, 7, 217-240.

(Jin, 2003) Yuhui Jin, Stefan Decker, Gio Wiederhold, “OntoWebber: Building Web Sites Using
  Semantic Web Technologies.” Twelfth International World Wide Web Conference , 20-24 May 2003,

(Jin, 2001) Yuhui Jin, Stefan Decker, Gio Wiederhold, “OntoWebber: Model-Driven Ontology-Based Web
  Site Management.” The 1st International Semantic Web Working Symposium (SWWS'01) , Stanford,
  University, Stanford, CA, July 29-Aug 1, 2001.

(Jurafski & Martin, 2000) Daniel Jurafski & James H. Martin “Speech and Language Processing”, Prentice
  Hall, 2002.

(Karlsson, 1995) Karlsson, F., A. Voutilainen, J. Heikkilä and A. Antilla, "Constraint Grammar. A
  language-independent system for parsing unrestricted text". 1995, Berlin, New York: Mouton de Gruyter.

(Karvounarakis et al., 2000) Karvounarakis, G., V. Christophides and D. Plexousakis, "Querying
 semistructured (meta)data and schemas on the web: The case of RDF and RDFS",, (Accessed: September 2000)

(Karvounarakis et. al. 2002) G. Karvounarakis, A. Magkanaraki, S. Alexaki, V. Christophides, D. Plexousakis,
  M. Scholl, K. Tolle, “RQL: A Functional Query Language for RDF”, at The Functional Approach to Data
  Management: Modelling, Analyzing and Integrating Heterogeneous Data, P.M.D.Gray, L.Kerschberg,
  P.J.H.King, A.Poulovassilis (eds.), LNCS Series, Springer-Verlag

(Kaada, 2002) Harald Kaada, “Linguistic Workbench for Document Analysis and Text Data Mining”, Masters
  Thesis, IDI, NTNU, June, 2002.

(Katz, 1997) Katz, B. "From Sentence Processing to Information Access on the World Wide Web". in AAAI
  Spring Symposium on Natural Language Processing for the World Wide Web. 1997. Stanford University,
  Stanford CA.

(Klavans & Muresan, 2001) J. Klavans and S. Muresan “Evaluation of the DEFINDER System for Fully
  Automatic Glossary Construction”, in Proceedings of the AMIA Symposium 2001. Washington DC, USA.

(Knuth, 1973 Donald Knuth, “The art of computer programming: Vol 3. Searching and sorting”, Addison
  Wesley, 1973.

(Krogstie, 1995) John Krogstie, “Conceptual Modelling for Computerized Information Systems Support in
  Organizations “, PhD thesis, Idt, NTH, 1995

(Kwok and Chan 1998) Kwok, K.L. and Chan, M. “Improving two stage ad hoc retrieval for short queries”, in
  proceedings of SIGIR, pp250-256, 1998

(Lenat, 1995) Lenat, D. B. "Cyc: A Large-Scale Investment in Knowledge Infrastructure." Communications of
  the ACM 38, no. 11 (November 1995)

(LePriol, 1999) Le Priol F, “A data processing sequence (ECoRSe) to extract terms and semantics relations
  between terms”, Human Centered Processes (HCP'99) - 10th Mini EURO Conference, Brest, 22-24
  septembre 1999

(Levenshtein, 1966) Levenshtein, V.I., “Binary codes capable of correcting deletions, insertions and
  reversals.” Cybernetics and control theory, 10(8),pp 707-710. Described in (Jurafski & Martin, 2000)

(Lindland, 1993) Odd Ivar Lindland, “A Prototyping Approach to Validation of Conceptual Models in
  Information Systems Engineering”, PhD thesis, IDT, NTH, 1993

(Lingsoft, 2000a) Lingsoft, "Lingsoft Indexing and Retreieval               -   Morphological    Analysis",, (Accessed: March 2000)

242                                                                                             References
(Lingsoft, 2000b) Lingsoft, "NORTHES Norwegian Thesauri",,
  (Accessed: March 2000)

(Loose and Paris, 1999) Robert M. Loose and Lee Anne Paris, “Measuring search engine quality and query
  difficulty: ranking with target and freestyle”, journal of the American society for information science,
  50(10),pp882-889, 1999

(Lykke-Nielsen, 2002) Marianne Lykke Nielsen, “The word association method : a gateway to work-task
  based retrieval.”, Åbo Akademi University Press. Doctoral dissertation

(Maedche & Staab, 2000) Maedche, A. and Staab, S. “Discovering Conceptual Relations from Text.” In:
 W.Horn (ed.):ECAI 2000. Proceedings of the 14th European Conference on Artificial Intelligence, Berlin,
 August.        IOS       Press,       Amsterdam,           2000,         http://www.aifb.uni-

(Magkanaraki et. al, 2002) A. Maganaraki, G. Karvounarakis, V. Christophides, D. Plexousakis, T. Anh
 “Ontology        Storage      and      Querying”,        Technical       Report       No      308,, April 2002.

(Manning & Schutze, 1999) Christopher D. Manning and Hinrich Schutze, “Foundations of Statistical
 Natural Language Processing”, MIT press, 1999.

(Marcus et. al. , 1993) M. Marcus, Beatrice Santorini and M.A. Marcinkiewicz: Building a large annotated
 corpus of English: The Penn Treebank. In Computational Linguistics, volume 19, number 2, pp313-330.

(Merriam-Webster, 2002) “Merriam-Webster online Dictionary”, Merriam-Webster Inc., Available at:

(Merriam-Webster, 2003) “Merriam-Webster online Dictionary”, Merriam Webster Inc. Available at:

(Miller, 1995) Miller, George A. “WordNet: a lexical database for English.'' In: Communications of the ACM
 38 (11), November 1995, pp. 39 - 41.

(Missikof et al, 2002) Michele Missikoff, Roberto Navigli, and Paola Velardi. “The usable ontology: An
 environment for building and assessing a domain ontology.” In I. Horrocks and J. Hendler, editors, The
 Semantic Web — ISWC 2002. Proceedings of the First International Semantic Web Conference, volume
 2348 of Lecture Notes in Computer Science, pages 39–53. Springer-Verlag: Heidelberg, Germany, June

(Mooers, 1950) Mooers, C. N. (1950). “Information retrieval viewed as temporal signaling.”, In Proceedings
 of the International Congress of Mathematicians, Volume 1, pp. 572–573., recited in (Savino & Sebastiani,

(Nakata et al, 1998) – K. Nakata, A. Voss, M. Juncke and T. Kreifelts, “Concept Index – social construction
 from      documents”,       ERCIM       News      no    35,     October       1998,      available      at

(Navarro, 2001) Gonzalo Navarro, “A guided tour to approximate string matching” ACM Computing Surveys
 Volume 33 ,  Issue 1 (March 2001) pp: 31-88

(Navigli et. al., 2003) Navigli R., Velardi P., Gangemi A. “Ontology Learning and its application to
 automated terminology translation IEEE Intelligent Systems”, vol. 18:1, January/February 2003.

(Neumann & Schmeier, 2002) G. Neumann and S. Schmeier “Shallow Natural Language Technology and
 Text Mining”, In: Künstliche Intelligenz (KI), the German Journal on Artificial Intelligence, Special Issue on
 Text Mining, T. Joachims, E. Leopold (Eds.), ISSN: 0933-1875, 16 (2002) 2

(Nordgård, 2002) Torbjørn Nordgård, “Part of speech tagger for Norwegian”, Perl program code, Dept. of
 Linguistics, NTNU, Trondheim, 2002.

(Nordkompleks, 2000) “Norwegian Computational Lexicon”, Project home                                    page, (accessed: april 2004)

(Nordhuus & Ree, 2002) Iver Nordhuus, Arnt Ole Ree, “Definisjonskatalog for helsestasjons og
 skolehelsetjenesten”, KITH Rapport 15/02 ISBN 82-7846-140-6, 2002. (In Norwegian)

(Noy & Hafner, 1997) Noy, N. F. and Hafner, C.D. “The State of the Art in Ontology Design”, AI Magazine

(Noy & McGuinnes, 2001) Natalya F. Noy and Deborah L. McGuinness. ``Ontology Development 101: A
 Guide to Creating Your First Ontology''. Stanford Knowledge Systems Laboratory Technical Report KSL-01-
 05 and Stanford Medical Informatics Technical Report SMI-2001-0880, March 200

Semantic modelling of documents                                                                            243
(O’Connor et. al, 2001) O'Connor, M., Cosley, D., Konstan, J. A., & Riedl, J. (2001). “PolyLens: A
 Recommender System for Groups of Users.” In Proceedings of ECSCW 2001 , Bonn, Germany, pp. 199-
 218. Available at:

(Ogden & Richards, 1923) Ogden C.K. and Richards, I.A. “The meaning of meaning”, Cambridge, 1923.
 Reissue edition: Harvest Books (June 26, 1989) ISBN: 0156584468

(Pan & Horrocks, 2001) Jeff Z. Pan and Ian Horrocks, “Metamodeling Architecture of Web Ontology
 Languages”, Proceedings of the First Semantic Web Working Symposium (SWWS01), pages 131-149.
 CEUR, July, 2001

(Park & Chon, K. 1995) Park, T. and Chon, K., “Collaborative Indexing over Networked Information
 Resources by Distributed Agents.”, In: Distributed Systems Engineering Journal, December, 1995.

(Pepper & Moore, 2001) Steve Pepper & Graham Moore, “XML Topic Maps, XTM 1.0, Specification”, August
  2001, (Accessed July 2003)

(Phillips, 1990) Lawrence Phillips, “The meta-phone algorithm, originally in Computer Programming.”
 December 1990, Described with implementations at:,
 Accessed: April 2004

(Porter, 1980) Porter, M.F, “An algorithm for suffix stripping”, Program 14(3),pp127-130. Algorithm
 repeated in (Jurafski & Martin, 2000), Appendix B.

(Radev et. al, 2000) D. Radev, H. Jing, and M. Budzikowska. “Summarization of Multiple Documents:
 Clustering, Sentence Extraction, and Evaluation.”, In Proceedings ANLP/NAACL Workshop on
 Summarization, 2000

(Ribiere & Charlton, 2001) Ribiere, M and Charlton, P. “Ontology Overview”, Motorola Labs Technical
 Report, 2001.

(Robertson and Sparck Jones, 1976) S.E. Robertson and Karen Sparck-Jones, “Relevance weighting of
  search terms”, Journal of the American society for information sciences, 27(3):129-146, 1976

(Rocchio, 1971) J.J. Rocchio “Relevance feedback in information retrieval”, In (Salton, 1971)

(Rosario & Hearst, 2001) Rosario, B. and Hearst, M., “Classifying the Semantic Relations in Noun
 Compounds via a Domain-Specific Lexical Hierarchy”, in the Proceedings of EMNLP '01, Pittsburgh, PA,
 June 2001

(Roseman and Greenberg, 1997) Roseman, M. and Greenberg, S. (1997). “A Tour of TeamRooms. ACM CHI
  '97”, Video Proceedings. Available at,, Accessed July 2003.

(Roux et. al., 2000) C. Roux, D. Proux, F. Rechermann, and L. Julliard. “An ontology enrichment method for
  a pragmatic information extraction system gathering data on genetic interactions.”, position paper in
  Proceedings of the ECAI2000 Workshop on Ontology Learning(OL2000), Berlin, Germany. August 2000

(Salton & Buckley, 1988) G. Salton and C. Buckley “Term weighting approaches in automatic retrieval”,
  Information Processing and Management, 24(5):513-523, 1988

(Salton 1971) G. Salton “The SMART Retrieval system – experiments in automated document processing”,
  Prentice Hall, Englewood Cliffs, NJ, 1971

(Sarwar et. al 2000) Sarwar, B.M., Karypis, G., Konstan, J.A., & Riedl, J. (2000): “Application of
 dimensionality reduction in recommender systems - a case study”, in ACM WebKDD 2000 Web Mining for
 E-Commerce Workshop, Boston, MA, 2000.

(Savino & Sebastiani, 1998) Savino, P. and F. Sebastiani “Essential bibliography on multimedia information
  retrieval, categorisation and filtering.”, In Slides of the 2nd European Digital Libraries Conference Tutorial
  on Multimedia Information Retrieval, 1998.

(Sebastiani, 2002) Sebastiani, F., “Automated text categorisation: Tools, Techniques and Applications”.
  Text indexing seminar, Centre National de Recherche Technologique, Rennes, France, 2002.

(Schiller, 1996) A. Schiller. “Multilingual Finite-State Noun Phrase Extraction”. In Proceedings of the
  ECAI'96, Budapest, Hungary, 1996.

(Schmidt and Bannon, 1992) Schmidt, K. and L. Bannon, "Taking CSCW seriously". CSCW, 1992. Vol. 1 (No.
  1-2): p. 7-40.

(Schneiderman et. al, 1997) Schneiderman, B., D. Byrd and W. Bruce Croft, "Clarifying Search: A User-
  Interface Framework for Text Searches". D-Lib Magazine, 1997. Vol. (No. January 1997).

(Schwartz, 1998) D.G. Schwartz “Shared semantics and the use of organisational memories for e-mail
  communication”, Journal of internet research, 8 (5), 1998.

244                                                                                                 References
(Scott, 1999) Scott, M., "WordSmith Tools, version 3", Oxford University press, ISBN 0-19-459289-8, (Accessed: March 2004)

(Seltveit, 1994) Seltveit, Anne Helga, "Complexity Reduction in Information Systems Modelling", PhD thesis,
  IDT-Report 1994:8, IDT, NTH, Trondheim, 1994.

(Shimizu et. al., 1997) Susumu Shimizu, Takashi Kambayashi, Shin-ya Sato, Paul Francis “A Framework for
  Multilingual Searching and Meta-information Extraction.”, in proceedings of Internet Society's seventh
  annual conference, INET'97, article available at:,
  accessed April 2004

(Sindre, 1992) Guttorm Sindre, “HICONS: A General Diagrammatic Framework for Hierarchical Modelling”,
  PhD Thesis, IDT, NTH, 1992

(Smith, 2003) Barry Smith, “Ontology and information systems”, Draft published at the Buffalo Ontology
  Site, University of New York at Buffalo, Dept. of philosophy. Available at:, (Accessed: April 2004)

(Soamares et. al, 1999) Soamares de Lima, L., A.H.F. Laender and B.A. Ribeiro-Neto. "A Hierarchical
  Approach to the Automatic Categorization of Medical Documents". in CIKM*98. 1998. Bethesda, USA:

(Soergel, 1996) Dagobert Soergel, “SemWeb: proposal for an open, multifunctional, multilingual system for
  integrated access to knowledge about concepts and terminology.” In: Green, R (ed.). Proceedings of the
  Fourth International ISKO Conference, 15-18 July, 1996, Washington, DC, USA. Frankfurt/Main: Indeks
  Verlag, 1996. 165 – 173.

(Soergel, 2003) Dagobert Soergel, “Thesauri and Ontologies in Digital Libraries” 2 part tutorial, at
  European Conference of Digital Libraries, ECDL, 2003

(Sonnenreich, 1999) Wes Sonnenreich, “A History of Search Engines”, Wiley & Sons, Available at:, Accessed April 2004

(Sparck-Jones, 1999) Sparck-Jones, K., "What is The Role of NLP in Information Retrieval?", in Natural
  Language Information Retrieval, T. Strzalkowski, Editor. 1999, Kluwer Academic Publisher.

(Sparck Jones & Endres-Niggermeyer, 1995) Karen Sparck Jones, Brigitte Endres-Niggemeyer: “Automatic
  Summarizing”. Inf. Process. Manage. 31(5): 625-630 (1995)

(Spek and Spijkervet 1997) B. R. Van der Spek and A.L. Spijkervet “Knowledge Management, Dealing
  intelligently with knowledge”, Kennicentrum CIBIT, 1997

(Srikant & Agrawal, 1995) Ramakrishnan Srikant and Rakesh Agrawal. Mining Generalized Association
  Rules. In Proc. of the 21st Int'l Conference on Very Large Databases, Zurich, Switzerland, September

(Strzalkowski et. al, 1998) Strzalkowski, T., G. Stein, G. Bowden-Wise, J. Perez-Caballo, P. Tapanainen, T.
  Jarvinen, A. Voutilainen and J. Karlgren. "Natural Language Information Retrieval - TREC-7 report". in
  TREC-7. 1998.

(Stahl, 2000) Gerry Stahl, "WebGuide: Encouraging & Supporting Collaborative Knowledge-Building" poster
  & presentation at AERA 2000 Conference, New Orleans, April 2000

(Steinholm, 2001) Steinholm, A., "Automatisk Graf-Utlegg (Automatic layout-generation of graphs - in
  norwegian)", Spring Project Report, IDI NTNU,

(Steinholm, 2001) Steinholm, A, “Utvikling av brukergrensesnitt for modellbasert dokumentsøkesystem. “,
  Masters thesis report, IDI, NTNU, 2001

(Stolz et al. 1965) Stolz, W.S., Tannenbaum, P.H. and Carstensen, F.V. “A stochastic approach to the
  grammatical coding of English”, Communications of the ACM 8(6), pp 399-405, 1965

(Storey et. al, 2002) Margaret-Anne Storey, Casey Best, Jeff Michaud, Derek Rayside, Marin Litoiu, and
  Mark Musen, “SHriMP Views: An Interactive Environment for Information Visualization and Navigation”
  Proceedings of ACM CHI conference on human factors in computing, 2003.

(Strzalkowski, 1999) Strzalkowski, T., "Natural Language Information Retrieval". 1999: Kluwer Academic

(Strzalkowski et. al., 19997) Strzalkowski, T., F. Lin and J. Perez-Carballo. "Natural Language Information
  Retrieval TREC-6 Report". in 6th Text Retrieval Conference, TREC-6. 1997. Gaithersburg, November, 1997.

(Su et. al., 2002) Su, X., Hakkarainen, S. E. and Ilebrekk, L. “Evaluating Ontology Engineering Languages
  and Systems”, in press:

Semantic modelling of documents                                                                        245
(Sullivan, 2001) Dan Sullivan, “Document warehousing and text mining”, Wiley Computer Publishing, ISBN
  0-471-39959-0, 2001

(Sullivan, 2001b) Danny Sullivan “Search feature chart – for searchers”, SearchEngineWatch, Article
  available at:, (accessed: April 2004)

(Sure et. al. , 2002) York Sure, Michael Erdmann, Juergen Angele, Steffen Staab, Rudi Studer, Dirk Wenke,
  “OntoEdit: Collaborative Ontology Development for the Semantic Web” Proceedings of the first
  International Semantic Web Conference 2002 (ISWC 2002), June 9-12 2002

(Svenonius, 2000) Elaine Svenonius, “The intellectual foundation for organising information”, Cambridge,
  Mass. MIT Press, 2000.

(Swartout et. al, 1996) Swartout, B., R. Patil, K. Knight and T. Russ. "Ontosaurus: A tool for browsing and
  editing ontologies". in 9th Banff Knowledge Aquisition for KNowledge-based systems Workshop. 1996.
  Banff, Canada.

(Sølvberg, 1999) Sølvberg, A. "Data and what they refer to". in Conceptual modeling: Historical perspectives
  and future trends. 1998. In conjunction with 16th Int. Conf. on Conceptual modeling, Los Angeles, CA,

(Sølvberg, 2002) Arne Sølvberg, “Introduction to concept modelling for information systems”, Course
  material, SIF8035/TDT4175, IDI, NTNU, 2002/2004

(Sølvberg, 1991) Ingeborg Sølvberg, Inge Nordbø Agnar Aamodt. “Knowledge-based information
  retrieval”. Future generations Computer Systems 7 (1991/1992), pp.379-390.

(TeamWave, 1999) TeamWave, "TeamWave WorkPlace Overview",, (Accessed:
  May, 1999)

(Telcordia, 2003) Telcordia 2003: " Telcordia Latent Semantic Indexing Software (LSI): Beyond Keyword
  Retrieval",, accessed April 2003

(Terveen & Hill, 2001) Terveen, L.G., and Hill, W.C. “Beyond Recommender Systems: Helping People Help
  Each Other”, in Carroll, J. (ed.), HCI In The New Millennium, Addison-Wesley, 2001.

(Tjoa & Berger, 1993) Tjoa, A.M. and L. Berger. "Transformation of Requirement Specifications Expressed
  in Natural Language into EER Model". in 12th Int. conceference on Entity-Relation approach. 1993.

(Ungar & Foster, 1998) Ungar, L.H., & Foster, D.P. “Clustering Methods for Collaborative Filtering”, in AAAI
 Workshop on Recommendation Systems, Menlo Park, CA, 1998.

(Uschold, 1996) Uschold, M. "Building Ontologies: Towards a unified methodology". in The 16th annual
 conference of the British Computer Society Specialist Group on Expert Systems. 1996. Cambridge (UK).

(Uschold & Gruninger, 1996) Mike Uschold & Michael Gruninger “Ontologies: Principles, Methods and
 Applications”, Knowledge Engineering Review; Volume 11 Number 2, June 1996

(Vintsyuk, 1968) Vintsyuk, T.K. “Speech discrimination by dynamic programming”, Cybernetics 4(1), pp52-
  57, Described in (Jurafski & Martin, 2000)

(Von Krogh et. al. 2000) Von Krogh, Georg, Kazuo Ichijo and Ikujiro Nonaka.             “Enabling Knowledge
  Creation.” Oxford: Oxford University Press, 2000

(Voutilainen,    1995)     Voutilainen,      A.,   "A   short   introduction       to     the   NP   Tool",, (Accessed: March 2000)

(Voutilainen 1993) Atro Voutilainen. “NPtool, a detector of English noun phrases.” In Proceedings of the
  Workshop on Very Large Corpora, pages 48--57, 1993

(Voutilainen & Heikkila, 1993) Atro Voutilainen and Juha Heikkilä, “An English Constraint Grammar
  (ENGCG) - a surface-syntactic parser of English”,Udo Fries, Gunnel Tottie and Peter Schneider, Eds.
  (1993), Creating and using English language corpora. Rodopi: Amsterdam and Atlanta. Pages 189-199.

(Vorhees, 2001) E.M. Vorhees, “The philosophy of retrieval evaluation”, ERCIM CLEF2001: Workshop of the
  Cross-Language Evaluation Forum, 2001

(Vorhees, 1998) E.M. Vorhees, “Variations in relevance judgements and the measurement of retrieval
  effectiveness”, Proceedings of SIGIR’98, pp315-323. Melbourne, Australia, 1998

(Vorhees & Tice, 1999) Ellen M. Vorhees and Dawn M. Tice, “The Trec-8 question answering track
  evaluation”, NIST special publication, available at:

(Voss et. al, 1999a) Voss, A., K. Nakata, M. Juhnke and T. Schardt. "Collaborative information management
  using concepts". in 2nd International Workshop IIIS-99. 1999. Copenhague, DK: Postproceedings
  published by IGP

246                                                                                              References
 (Voss et al, 1999b) – A. Voss, K. Nakata & M. Juhnke “Concept Indexes: Sharing knowledge from
 documents”, in Scwatrz, Divitini & Brasethvik (eds) “Internet-based organisational memory and knowledge
 management”, IDEA Group publishing, 1999 (from proceedings of 2nd workshop on internet based
 information systems IIS99).

(Warmer & Kleppe, 1998) , Jos Warmer and Anneke Kleppe, “The Object Constraint Language : Precise
 Modeling with UML”, Addison-Wesley, 1998

(Weibel et. al, 1995) Stuart Weibel, Jean Godby, Eric Miller, Ron Daniel “OCLC/NCSA Metadata Workshop
 Report” 1st Dublin Core report, (accessed july 2003)

(Weibel , 1995) Weibel, S. “Metadata: The foundations of resource descriptions”, D-Lib Magazine, July

(Winograd & Flores, 1986) T. Winograd & F. Flores “Understanding Computers and cognition”, Addison
 Wesley, 1986.

(Willett, 1988) Peter Willett, “Recent trends in hierarchic document clustering: a critical review.”, in
 Information Processing and Management: an International Journal, v.24 n.5, p.577-597, 1988

(Willumsen, 1991) Geir Willumsen, “Executable Conceptual Models in Information Systems Engineering“,
 PhD Thesis, IDT, NTH, 1991

(W3C-DAML+OIL, 2001) Dan Connolly, Frank van Harmelen, Ian Horrocks, Deborah L. McGuinness, Peter F.
 Patel-Schneider and Lynn Andrea Stein, “DAML+OIL Reference Description”, W3C-Note,, accessed July 2003.

(W3C-IsaViz, 2003) W3C-RDF development, “IsaViz: A Visual Authoring Tool for RDF”, available at:, accessed: April: 2004

(W3C-Metadata, 1997) “W3C Activity Group of Metadata and Resource                        Description.”,, Superceded by the W3C Semantic Web activity.

(W3C-OWL, 2003) Deborah Mc Guinnes and Frank van Harmelen, “OWL Web Ontology Language Overview”,
 W3C working draft, Available at :, Accessed july 2003

(W3C-OWLREQ, 2003) Mike Dean & Gus Schreiber (eds) “OWL Web ontology Language Reference”,, Accessed July 2003.

(W3C-RDF, 1999) Ora Lassila & Ralph Swick, “Resource Description Framework (RDF) Model and Syntax
 Specification”, W3C Recommendation 22 February 1999,

(W3C-RDFS, 2002) Dan Brickley & R.V. Guha, “RDF Vocabulary Description Language 1.0: RDF Schema
 “W3C Working Draft 30 April 2002,

(W3C-SW, 2003) W3C Semantic Web Activity,, Accessed July 2003.

(W3C-XML, 1996) W3C XML, “Extensible Markup Language (XML)”,, Accessed:
 April 2004

(W3C-XPATH, 1999) James Clark & Steve DeRose (eds) “XML Path Language (XPath) Version 1.0”, W3C
 Recommendation 16 November 1999, available at:, accessed april 2004.

(W3C-XSLT, 1999) James Clark (ed) “XSL Transformations (XSLT) Version 1.0”, W3C Recommendation 16
 November 1999, available at:, accessed April 2004.

(Yang, 1993), Mingwei Yang, “COMIS: A conceptual model for information systems” Phd thesis. IDT, NTH,

(Yeh et. al, 2003) Iwei Yeh , Peter D. Karp , Natalya F. Noy and Russ B. Altman, “Knowledge acquisition,
  consistency checking and concurrency control for Gene Ontology (GO)“, Bioinformatics Vol. 19 no. 2,
  Pages             241-248,               2003.            Article           available             at: , (Accessed: April 2004)

(Zhdanova, 2002) Anna V. Zhdanova “Automatic Identification of European Languages.” In proceedings of
  NLDB 2002, Stockholm, Sweden, pp76-84

(van Zwol, 2002) Roelof van Zwol, Modelling and searching web-based document collections''. PhD Thesis
  Dutch Graduate School for Information and Knowledge Systems (SIKS). SIKS Dissertation Series No.

Semantic modelling of documents                                                                     247
                             Appendix A. Detecting relation names

The specified approach to retrieval is specified to use relation names in
order to add semantics to the binary relations of the domain model.
Currently, we allow for the free-hand entry of relation names by the user
at document classification time. In order to provide the users with
suggestions for relation names and to be able to utilise relation names
that more accurately reflect the document texts, we specified an
approach to automatic detection of relation names. This approach is not
implemented in the tool today, as it proved to produce awkward names
and was inaccurate in implementation. This sentence analysis was
previously described in (Brasethvik & Gulla, 2001).
The input to this analysis is the linguistic analysis process described in
chapter 6 and the document to model matching. In particular, we need:
   An extraction of the document sentences that refers to at least
    two model concepts.
   The sentences must be tagged by a part-of-speech tagger.
For a relation between two given concepts, the system would extract all
tagged sentences that refer to these two concepts. Our analysis would
then apply a set of semantic sentence rules to extract suggested names
for this relation from the set of sentences.
As an example, from the domain model we experimented with in
(Brasethvik & Gulla, 2001) and one of its corresponding documents,
There are 11 sentences in that document that contain the domain
concepts “helsetjeneste” (health service) and “pasient” (patient). One of
the sentences is
     “Medarbeidere i helsetjenesten må stadig utvikle nye muligheter for
       å fremme helse, hjelpe pasientene og bedre rutinene”
     (The employees of the health service must steadily develop new
       methods to promote health, help the patients and improve the
and the results of tagging this sentence is shown in figure A.1.
On top of the POS-tagging, some syntactic roles will also be added to
the tagged sentence. The relation name is formed by the part of the
sentence linking the two concepts that remain after removing irrelevant
sentence constituents. Constituents to remove are for example
attributive adjectives (“new”), modal verbs (“must”), non-sentential
adverbs, and most types of sentential adverbs (“steadily”). Also, we
remove prepositional phrases and paratactic constructions that do not
contain any of the two concepts in question (“promote health” and
“improve the routines”).

Semantic modelling of documents                                            249
                                                                                               Figure A.1
                            NP                                                    VP                                            VP                     VP                               VP
                N            P               N              V ADV               V ADJ                 N       CC I          V         N           V            N         CC V                 N
         medarbeidere i helsetjenesten må stadig utvikle nye muligheter for å fremme helse, hjelpe pasientene og bedre rutinene
         "med|arbeider"            "helse\#tjeneste”           "stadig”               "ny”                                  "fremme"               "hjelpe”                  "og”            "rutine”
         <*> N HAN PL UBEST NOM    <upl> N HAN SGBEST NOM      AE POS UK SG UBEST NOM AE POS UK PL UB NOM                   V INF AKT              V INF AKT                 U <cc>          N HAN PL BEST NOM
                                                                                                              U <cc>
                                  "i”                       "måtte”               "ut|vikle”    "mulighet”            "å”          "helse”                   "pasient”             "bedre”
                                  U <prep>                  <modal> V PRS AKT AKT V INF AKT     N HUN PL UBEST NOM    U <inf>      <upl> N HUN SG UBEST NOM N HAN PL BEST NOM      V INF AKT

                Example sentence. POS-tagging and syntacting roles are attached before removal.

For the sentence above, the system proposes to use the following words
to form a relation between the concepts (proposed words in bold face):
       “The employees of the health service must steadily develop new
         methods to promote health, help the patients and improve the
The rules we applied for sentence part removal are:

     Removal of attributive adjectives.

     Removal of prepositional phrases that does not include a
      reference to one of the model concepts.

     Removal of parallel constructs that does not include a reference to
      one of the model concepts. Example: Patients and old people
      need the health services. (underlined part removed)

     Removal of adverb constructs, except those containing truth
      values (such as negations). Examples:

       Health personell writes the journal 2 times a week.

       Health personnel does not write the journal. (not removed)

     Remove all auxiliary and modal verbs.

     Remove all subordinate sentences that do not include a reference
      to both concepts. Subordinate sentences that include a reference
      to both concepts are treated separately.
In addition to the example given above, there were 10 other sentences
containing a reference to these concepts. The resulting suggestions are
shown in table A.1. The English “translations” are performed simply
word by word and without altering the word sequence, in order to
facilitate the comparison with the original Norwegian sentence structure,
which is the basis for the analysis. As the table shows, most sentences
are discarded, since they do not contain any complete subordinate
sentence that references both concepts. The remaining suggestions are
simply awkward fragments, and do not represent good relation names
or proper input to a user.

250                                                                                                                                                                    Detecting relation names
                                                        Table A.1
   #    Suggestions
   1    I [helsetjenesten] kan ikke [pasienten] på samme måte gis rett
   1    In [the health service] can not [the patient] in the same way be (stated) correct
   2    Discarded
   3    Derfor er også [helsetjenesten] best tjent med å sette [pasienten] først
   3    Therefore are also [health service] best served by putting [patient] first
   4    Discarded
   5    det blir lettere å rekruttere leger til stillinger i den offentlige [HT] der de høyest prioriterte [P] tas hand om
   5    it becomes easier to recruit doctors for positions in the public [health service] where the highest prioritised
        [patients] are taken care of
   6    Discarded
   7    Discarded
   8    klarer ikke den offentlige [HT] å innfri folks forventninger vil [P] søke andre løsninger
   8    can not the public [health service] satisfy people’s expectations will [patient] seek other solutions
   9    Discarded
   10   Discarded
                                         Examples of suggested relation names

Semantic modelling of documents                                                                                              251
            Appendix B. Evaluation: Query topic descriptions

Information need Min fastlege?
Description           Hvem er min fastlege? Hvordan finner jeg ut hvem som
                      er min fastlege? Hvordan får jeg opplysninger om min
                      fastlege? Hvordan kan jeg bytte fastlege?
Information need Helsetilbud til gravide?
Description           Hva slags helsetilbud tilbys gravide? Hvilke spesielle
                      tjenester finnes for gravide innen en kommune?
Information need Hva er en helsestasjon?
Description           Hvilke tjenester utfører en helsestasjon? Hva er
                      forskjellen på en helsestasjon og et legesenter? Hva er
                      forskjellen     på    kommunehelsetjenesten         og
                      helsestasjonstjenesten? Hvilke oppgaver har
                      helsestasjonen som kommunehelsetjenesten ikke har?
Information need Kommunehelsetjenesten
Description           Hva slags tjenestetilbud tilbyr kommunehelsetjenesten?

Information need Min Pasientjournal
Description           Hva slags informasjon inneholder min pasientjournal?
                      Hvilke opplysninger finnes der? Hvilke rettigheter har jeg
                      som pasient ift. min pasientjournal, f.eks. for å få innsyn i

Semantic modelling of documents                                                253
Information need   Helsetjenester for barn og unge?
Description        Hva er formålet med en helsestasjon for barn og ungdom
                   spesielt? Hva har vi i norge i dag, hva må det satses
                   mere på?
Information need   Pasient i konsultasjon med lege
Description        Hvordan er maktforholdet mellom pasient og lege? Hva
                   kan pasienten kreve i en konsultasjon med legen?
Information need   EPJ
Description        Hvilke dokumenter inneholder en elektronisk
                   pasientjournal? Hva slags data foreligger i dag
Information need   EPJ system
Description        Hvilke leverandører av EPJ systemer benytter norske
Information need   Kan jeg ha med ledsager?
Description        Hvilke pasienter har rett til å ha med ledsager ved
                   konsultasjon hos fastlege?
Information need   Fysioterapitilbud for ungdom
Description        Har helsestasjoner i trondheim tilbud om fysioterapi til
                   ungdom, eller i hvor stor grad er det vanlig med slikt
Information need   Barns helsekort
Description        Hva slags informasjon finnes på helsekortet til barn og

254                                               Evaluation: Query topic descriptions
Information need      Konsultasjoner, fastlege
Description           Hvordan påvirker fastelegeordningen            antall
                      konsultasjoner hos lege?
Information need      Helsestasjon for ungdom
Description           Hva er en helsestasjon for ungdom? Er dette det samme
                      som en helsestasjon? Tilbyr en HfU noe spesielt?
Information need      Kontakt fastlege
Description           Kan man bestille time til fastlege over internett i
Information need      Dokumentasjonsplikt i EPJ
Description           Hvem har dokumentasjonsplikt          i   forhold    til

Semantic modelling of documents                                           255
                                       Appendix C. Evaluation: Scoring sheet

Evaluation users were given a printed scoring sheet for recording the
different query scores. The scoring sheet was formed as below, and
included the query topic and query description for each query. The
scoring sheet was printed and stapled with one query per page in
sequence from 1 to 16.
For each of the 4 query strategies, users should record their scores. The
form allows for recording up to 3 tries per query for each strategy. In
addition, the users could score the included definitions of the domain
concepts provided by the model. In some cases, the definition would
actually answer the query, but no users returned a score for any

Scoring Sheet
Q#                             Query Number
Information need               Query topic – short description
Description                    Query topic – long description.

              CnSClient (ODIN)                 ODIN               CnSClient(ATW)        AllTheWeb
Doc           1        2       3         1      2        3       1       2      3   1        2      3


2         Good, Relevant hit
1         Somewhat relevant
0         unrelated, Duplicates
-1        Broken link, Trash

Semantic modelling of documents                                                                     257
Semantic modelling of documents   259

To top