Authoring In and Out of the Real World by abstraks


									            Authoring: In and Out of the Real World
                                 MesMuses 2003, Florence

                                       Andy Dingley

                                       11 April 2010

                 “For magic consists in this, the true naming of a thing.”
                                           - Ursula Le Guin

This paper discusses some issues arising in the field of metadata, as applied to the web
publishing of “exhibits” in on-line museums or concrete museums with on-line catalogues.
Much work has been done on how to do this, yet there is still a lack of interoperability
and of overall advances in performance for the typical user of Google.

Metadata handling for large collections is now well established. The need is identified, as
are transport protocols and useful sets of properties to manage. Techniques such as
faceted metadata and application profiles allow the management of sets of such
properties, so that rich resources may be manipulated by simple-minded cataloguing
tools. I do not discuss the technical issues of encapsulating and transporting metadata by
means such as RDF or Dublin Core. These are assumed to be familiar.

In this far-from-exhausted field, one of the next problems to study is that of the values
used for these properties. Currently this is moving towards re-visiting some complex
issues in AI.

The problem of nominals [3] is increasingly recognised in the fields of categorisation and
metadata description.

In the ”classical” XML design, a schema is produced before the data is captured. This
defines the structure for all items, and may also contain a vocabulary of cataloguing
terms; fish, mammals, birds, etc.. This schema is then stable throughout the life of the
project. When a record is produced, it is in terms of references to items in this schema.

The problem of nominals begins when we encounter a record that must be catalogued
from outside this schema. Typically this is the first encounter of a category that was not
foreseen when the schema was created.
Three Levels of Solution
Solutions to the issue of nominals are at three levels:
     Recognising the problem
     Techniques for dealing with nominals within a project
     Publishing outside the project.

The first level is now widely recognised, as may be seen from some of the initial titles for
this conference.

The second level is that of developing tools to manage complex descriptions, within the
boundaries of a project. I described some of our approaches here for ARKive in [2].

The third level consists of extending these techniques so that producers or consumers of
metadata outside the project boundary may also benefit from them.

Dependency on nominals
The simplest approach is to forbid nominals. A rigid schema is produced and then that‟s
used for all descriptions. If possible, this is a simple solution. Is this schema static ?
Technically it‟s easy to amend it, except for the obvious version-management problems.
Where authoring happens within a single system, then an expanding schema is workable.
For distributed authors, then their synchronisation is difficult.

It‟s an interesting approach to have no initial vocabulary at all. The system then becomes
entirely self-populating on the nominals, with obvious application for automatic spidering
of new territories.

Combining the two approaches is typical, usually for systems that set out along the first
route, then encounter essential, yet uncategorizable instances. Given the current state of
the art, this is the worst of both worlds.

Three Use Cases
A “use case” is a software design term, used to describe one instance of an interaction
between a system and its user (which may be another program).

It is important for an on-line repository to distinguish between the capabilities of the
system itself and the use cases with the outside world. It is still rare for museum systems
to take any consideration of external interfaces, rather than just simply slapping a
publishing interface on afterwards.

The short-term limits of the system are defined by these use cases, but the long-term
limits are constrained by the internal design. As we shall see for input, this can give rise
to conflicts.

In, out and authoring
A rich metadata repository has three use cases of interaction with the outside world.
These are input, output and authoring.

Output is the most obvious. A content and metadata repository is constructed and it may
either respond to specific requests for metadata [8], or it may be included with the
published content. Publication can be to either the “polished” content to the public, or
also various house-keeping roles for the staff.
Input is less essential, but equally obvious in function. Many new systems will be
constructed that load large pre-existing data sets from other systems. This varies in
importance, depending on the availability of suitable datasets. For ARKive, this was a
significant potential source of data. Groups such as the WCMC (World Conservation
Monitoring Centre) had many lists available that were broad, if shallow.

Authoring is less obvious. The need itself is obvious, but its separation through an inter-
system interface is less so.

Issues in publishing
Output is usually done by embedding within the published output, but the metadata may
also be accessed independently. Harvesters read this in a manner akin to metadata-only
web crawlers.

There may also be future developments in remote querying. Protocols for distributed
querying (e.g. Z39.50) have existed for years, yet their implementation has been rare in
the web era. Recent interest in XML-based web services and agent systems may
encourage this.

Supporting harvesters is a very low-cost addition to a system, compared to support for

Inevitability of nominals in publishing
Publishing to the public audience has an impact on the nominals problem - effectively
everything there becomes a nominal. If the schema is unavailable to a consumer from
outside the system (or it simply hasn‟t troubled to obtain it) then there‟s effectively no
difference between references to a vocabulary or to nominals.

For the foreseeable future, sharing of schemas or ontologies (on a per-access basis) is
unlikely. Only if the schemas referenced are public and well-known (typically those like
MESH or Getty TGN, well known from the DC good-practice guides) is there any
practical hope of them being used or usable.

By and large, published metadata from the current practice of large and complex systems
is no more than unreadable, un-processable “metadada”1.

Difficulties with ARKive’s published metadata
Metadata is not a publishing problem, it is a communication problem. Unless it is
communicable, i.e. it can be understood by other systems outside the project, then it is

1From the Dadaist art movement of the early 20th century, and their deliberate choice of
a meaningless title.
This is a sample of metadata published by the ARKive project [6].
<title>ARKive species | Barbastelle bat | Barbastella barbastellus | Overview</title>

  <META   name="WPC"   content="@CHAPTER_UK">
  <META   name="WPC"   content="@SPEC_Animalia">
  <META   name="WPC"   content="@SPEC_Chordata">
  <META   name="WPC"   content="@SPEC_Mammalia">
  <META   name="WPC"   content="@SPEC_Chiroptera">
  <META   name="WPC"   content="@SPEC_Vespertilionidae">
  <META   name="WPC"   content="@LOC_Europe">
  <META   name="WPC"   content="@HAB_Broadleaved">
  <META   name="WPC"   content="@BEHAV_Carnivorous">
  <META   name="WPC"   content="@BEHAV_Flying">
  <META   name="WPC"   content="@STATUS_I-V">
  <META   name="WPC"   content="@STATUS_CITES II">
  <META   name="WPC"   content="@STATUS_UKBAP Priority">
  <META   name="WPC"   content="@STATUS_WCA">
  <META   name="WPC"   content="@STATUS_Habitats Directive">

<META NAME="DC.Publisher" CONTENT="ARKive">
<META NAME="DC.Language" CONTENT="en-UK">
<META NAME="DC.Format" CONTENT="text/html">
<META NAME="DC.Date" CONTENT="Thu May 08 16:51:24 BST 2003">
<META NAME="DC.Identifier"
<META NAME="DC.Subject" CONTENT="Barbastella barbastellus">
<META NAME="DC.Title" CONTENT="ARKive species | Barbastelle bat | Barbastella
barbastellus | Overview">
<META NAME="DC.Description" CONTENT="ARKive species | Barbastelle bat | Barbastella
barbastellus | Overview">

We immediately note that the metadata is published in two obvious groups.

One is fairly standard Dublin Core. Much of this is useful, but trivial. It does not vary
between species pages.

The second is entirely “proprietary”. It is effectively invisible to any consumers outside
ARKive. Not only is it obscure, but it is also camouflaged by a prefix (and an abbreviate
prefix at that). This renders it even less visible to simple-minded free-text search

In this example from ARKive, we see the original version:
<META name="WPC" content="@SPEC_Chiroptera">
<META name="WPC" content="@SPEC_Vespertilionidae">
<META name="WPC" content="@LOC_Europe">

A minor technical reformatting could imprive this. It still meets ARKive‟s goals, but also
has rather more public accessibility:
<META name=”taxonomy" scheme="DCMIDCSV" content=”order=Chiroptera">
<META name="taxonomy" scheme="DCMIDCSV" content="species=Vespertilionidae">
<META name=”DC.Coverage.Spatial" scheme=”TGN" content="Europe">

“Taxonomy” as a name is still not machine-processable, but at least it‟s human readable.
The use of DCSV and “species=“ is intended for the Dublin Core “dumbing down”
principle. An abbreviation like “SPEC” (presumably derived from species taxonomy)
saves only a handful of bytes yet makes the property non-meaningful to outsiders. Near-
term practice will continue to depend on some human-read tables of properties and
values. Maintaining readable terms in distinct columns; “taxonomy,
species,Vespertilionidae” appears trivial, yet is worthwhile.

“DC.Coverage.Spatial” is used because it‟s a common public standard and there‟s simply
no excuse to go inventing new terms.
An alternative format might have been to combine all the taxonomy entries into one.
This is easy enough with DCSV, but it adds nothing and it‟s a change from the original

Yet again we see that minimal forethought gives a solution that‟s near-identical, has
identical cost to the “owning” system, yet is more open towards other systems.

Issues in Import
There are two issues in importing; how to import and what to import. “How” involves
mainly the same technical issues as for authoring, discussed below.

When considering the set of records to import, we must consider whether to populate our
own repository with only a few relevant records, or a large number, even if most are
never used any further.

Sparse Databases
Choosing how much to import raises a significant issue for on-line descriptions of existing
artefacts. There will be a small set of “prestige” exhibits for which there is a rich on-line
description. In an efficient technical system, the number of these should become limited
by the authoring costs alone, not system limits or technical implementation costs. There
will also be a large number of exhibits for which the choice must be made to either show a
very sparse description, or to show nothing. In this case, “nothing” excludes even the
existence of the artefact.

In many cases, it is a relatively simple operation to import very large lists of sparse data
(minimal information on each item). There may be existing species databases that can be
imported, or a gallery may already have an asset tracking database with just a bare
accession number and physical location.

This raises a conflict between the use cases and the internal design of the system. It‟s a
commonplace and reasonable constraint that the public interface should not show these
sparse exhibits. The only entries visible at all are those at the very highest quality. This
is not though any constraint on the internal system, or the internal data to be stored. If
some of this sparse data is not to be made visible, then this is a trivial operation for the
publishing use case to implement, not the central repository. Limiting the repository, or
the import, limits the long-term usefulness of the system.

Most systems will have at least two user groups querying the repository for output;
“public” and “staff”. Even the most sparse data might be of use to this staff use case and
so it is needlessly limiting to exclude it. It is characteristic of databases that adding new
features is very expensive, adding new properties (facts about a record) is expensive, but
adding a great many new rows of data is almost free.

Consider a database of endangered species, and a reasonably sophisticated site
publishing engine [1]. The user interface should show species for which there is an
“adequate” record, i.e. the species name is listed and there is at least either a text
description or an image. The initial target list is a few dozen species, of different families.
A potential data source is then identified, perhaps a list of 600 sea snails and their
distribution as a GIS dataset. Only two of these species are relevant to the target list. In
this case, it‟s preferable to simply import the whole list. The complexity that ignoring the
598 irrelevant snails requires is already part of the publishing system. The publishing
system selects according to a rule of {is a creature, on target list, has description} and this
is sufficient to exclude either un-endangered snails, vendor contact details, or dictionary
entries that also inhabit the database.
The alternative is to trim the dataset on import. This complicates the management and
reification of it. It‟s also a reasonable solution for a one-purpose system but unworkable
when this same database has multiple publishing interfaces on it. What should happen if
there is a list of endangered species (2 snails) and a list of the snails of New Guinea (20
snails) ? Should 2 or 20 be imported ? Or 22, with duplication into two separate lists ?!

Querying a sparse database
The “sparse database” approach has risks that must be understood. Consider the species
database again. An interesting query might be “Show me the local rare species”. What
should be returned ? A sensible limit could be to show a maximum of 20 species, from a
100 mile radius. But which species ? Good user experience design might trim this as a
maximum of 5 from each genus to give a more representative view of local biodiversity,
rather than simply taking the nearest, rarest, or alphabetically earliest.

If the database is populated with both a list of significant endangered animals and
another list of rare orchids, then the situation becomes even worse. Unless query design
is done carefully, almost any generalised query is likely to be swamped under a mass of

A query of “Show me the local rare species” makes implicit assumptions. A fuller
statement of it might be “Show me { a small set of, even though this is obviously a
subset } the local rare species { chosen to demonstrate variety at the level of taxonomy }”.
Skewing with too many orchids unbalances the query at a level that is significant. In
contrast, a surfeit of red orchids within the few orchids selected would not affect the
validity of this query.

The key is to realise that the database is fundamentally skewed, and must always be
assumed to be so. Any query that claims to be “representative” must accept this and code
around it. This is not only a problem for our contrived “sparse database” case, but it
becomes significant for any shared database with multiple access operations upon it. It is
especially significant for “statistical” queries – those that offer results such as “90% of
garden bird species are endangered” are obviously impossible to build on a dataset that
only includes a species because it is under threat. This type of database is a catalogue of
more detailed records, not a statistical survey. Querying exists to support navigation, not

Opinions vary as to whether it is possible to remove the effects of skew when coding the
queries, by examining the query alone, or whether it also requires knowledge of how the
database is skewed.

Issues in Authoring
Authoring can take place in a number of contexts; from on-line systems within the project
to distributed editing tools with multiple authors and no access to shared schema etc.

The sophistication of the descriptions produced is constrained by those that the authoring
tools support. All too often these tools are just built around a classical tree structure,
because that is what is best understood and most easily implemented. Authoring is
complex, especially when done to complex vocabularies.

The ARKive experience is that of spending three times as long (elapsed) on authoring
tools, compared to publishing. For a large project, with long time-scales and large
budgets, this may be supportable.
If capable tools are to be available to smaller projects, then they must be external to the
project, either commercial products or open-sourced. The likelihood is that one of these
will use proprietary standards and the other a more open standard, with obvious

Problems of Classical Taxonomy
We have discussed schemas, controlled vocabularies and even nominals. But all of these
come from the same classical tree-structured taxonomy. It is a familiar model to IT
people, and it‟s reinforced by the Linnaean taxonomies used by the zoologists.

However it‟s a poor model of the real world.

It‟s a poor model of the real world, in an objectivist sense. Very often, it‟s a simple
demonstration that real world problems just can‟t be mapped onto this simple branching
tree. There may be issues of topology (the tree is not simply branching) or there may be
issues of membership (a member is simultaneously in multiple locations, with varying
degrees of membership).

More subtly, there are also perceptual effects. How is the relationship between exhibits
understood by the audience ? Commonly the taxonomy is used as a navigational tool,
rather than as a primary datum. The cognitive perception of these relations can be more
important than any objective truth (if any is possible).

Cladistic and Phenetic Taxonomy
One of my targets with ARKive was to produce something that could display both
Linnaean and cladistic taxonomies in parallel. The concepts of evolutionary biology are
not widely understood and this would be a good dataset to demonstrate them with.

At a technical level though, they are both similar presentations. Once categorised,
membership of a taxon is clear and the taxonomies are both simply branching. A more
technically interesting taxonomy is that of phenetic taxonomy. In this view, membership
is less clear-cut and relationships are usually judged by some distance-measuring metric.

The current ARKive site uses a (broadly) phenotypical search engine, from Adiuri [9]. A
search for “whale” returns several results, including “Cetorhinus maximus”.
Unfortunately the basking shark is clearly not a whale !

In the past on ARKive, there have been long and heated debates as to whether a search
for “big fish” should return whales too. The outcome was that it should not (for most
searches) but that the interface for 6 year olds might possibly do so in the future. This
was still a heated topic though, and it was not part of the immediate development plans.

It‟s thus a little surprising (although understandable) to find that whales are not fish, but
that some fish are now apparently whales. To most people, this seems somehow “worse”.

And that brings us to our final topic, the role of cognitive theory in indexing

Cognitive Science
We‟ve already seen issues around defining a simple classical concept like a “whale”.
Looking over the parapet from how we code classical taxonomies, we find that these
problems are fundamental, not merely aspects of implementation.
In order to permit description by human-imposed categorisations, we must implement
structure by these same means, not just by those that are simplest to implement
computationally. The Ketengban people of New Guinea have no concept for the genus of
“pigeons” [10], but are perfectly capable of describing its members and the relations
between them. Members of the genus are described in relation to an ur-pigeon, a
“prototype” [13] for the abstract genus. In contrast, the Tzeltal of Mexico name only the
genus, not species [12].

A textbook case of prototype theory is that of colour perception. Viewers (especially from
the fashion industry) can never agree on a name for something that is beige. Some
languages do not even distinguish between blue or green. Yet asked to choose the “most
blue” colour chip from a selection, all cultures congregate on the same selection. The
descriptions are fluid, but the prototype is rigid.

Is a dove a pigeon ? Is it like a pigeon ? When the classification depends on relation to a
single member, rather than to membership of a large group, it becomes necessary to
introduce a notion like fuzzy set theory [14]. This fuzziness is fundamental to the
membership, and the definition of membership, not just a choice of implementation or
programming algorithm.

If membership of a group no longer needs to be rigid, then the definitions of a group can
become intriguing. Jorge Luis Borges‟ “Chinese Encyclopedia” lists categories of animals
according to the Celestial Emporium of Benevolent Knowledge:
     those that belong to the Emperor
     embalmed ones
     those that are trained
     suckling pigs
     mermaids
     fabulous ones
     stray dogs
     those included in the present classification
     those that tremble as if they were mad
     innumerable ones
     those drawn with a very fine camelhair brush
     others
     those that have just broken a flower vase
    those that from a long way off look like flies

Although these appear unreal, they have as much validity as research studies of “Things
to take on a picnic” and similar sets [16].
Women, Fire and Dangerous Things
The Dyirbal aboriginal language of Australia uses these terms as noun classifiers [15].

    Bayi      Men, kangaroos, possums, bats, most snakes, most fish, some birds, insects,
              the moon, some spears

    Balan     Women, bandicoots, platypus, echidna, some snakes, some fish, most birds,
              water, fire or the sun, the hairy mary grub, some spears

    Balam     Edible fruit and plants, cigarettes, wine, cake

    Bala      Parts of the body, meat, bees, wind, most trees, grass, mud, stones, some
              spears, all others

The sets appear strange, but have an underlying structure and coherence. This is a
subjective judgement, not an objective one. It may even vary between speakers, as it is
the underlying concept that is shared, not rigid set inclusion. The hairy mary grub gives
rise to a skin irritation that feels like sunburn, so it is classified with the sun and fire.

In the zoological field, the term “mole rat” is applied to many unrelated species, which
share a rodent-like appearance and a habitat of communal underground tunnels.

These notions have been the currency of cognitive science since the „70s, and have been of
interest to the AI community since. They are still rare in the metadata publishing and
semantic web worlds.

In some fields, particularly for information sharing and for publishing, metadata will
continue to be drawn from the objective taxonomies (with all their faults). For the
internal use of metadata by sites, whether for authoring and content selection, or for user
browsing, it‟s my opinion that we should begin to investigate these cognitive models.

   Consider metadata as a dialogue between producers and consumers,
    not as a monologue to an empty theatre.

   Authoring descriptions that meet the developing future standards of metadata will
    require shared 3rd party tools. All groups in the publishing field will need to be
    involved in developing these.

   Tree-structured classical taxonomies are inadequate for descriptions that are both
    detailed and accurate. We must look to fields outside IT for guidance on how to
    develop alternative approaches.

        “The mistaken notions of Naturalists are hereby stricken out of the
        record so as to prevent damage to learning.”
                    T‟ien-Kung K‟ai-Wu
                    (The Creations of Nature and Man)
                    Sung Ying-Hsing, 1637
[1]    Use of RDF in the ARKive project, In IEEE International Conference on
       Advanced Learning Technologies (ICALT 2001) IEEE Computer Society
       Press, Madison, USA

[2]    Today‟s Authoring Tools for Tomorrow‟s Semantic Web, Dingley & Shabajee,
       Museums And The Web, Boston, 2002

[3]    DAML+OIL: a Reason-able Web Ontology Language, Ian Horrocks

[4]    The Capture and Tracking of 'Pieces of Information': Necessary Requirements
       for 'Educational' and Rich Repurposing Architectures
       Paul Shabajee and Dave Reynolds

[5]    ARKive launched

[6]    ARKive species page
       Barbastelle bat, Barbastella barbastellus

[7]    Faceted Metadata for Image Search and Browsing, Yee, Swearingen, Li &
       Hearst, 2003

[8]    OAI Open Archives Initiative

[9]    Adiuri Systems

[10]   Ethno-ornithology of the Ketengban People, Indonesian New Guinea,
       Jared Diamond, Folkbiology, 1999, MIT Press, ISBN 0-262-13349-0

[11]   How shall a thing be called?, Brown R., Psychological Review 65:14-21 1958

[12]   Principles of Tzeltal Plant Classification, Berlin, Breedlove & Raven, 1974,
       New York: Academic.

[13]   Prototype Classification and Logical Classification, Rosch, E., New Trends in
       Cognitive Representation, Lawrence Earlbaum Associates, 1983

[14]   Fuzzy Sets, Zadeh, L., Information and Control, 1965 8:338-53

[15]   Women, Fire and Dangerous Things, George Lakoff, 1987, University of
       Chicago Press, ISBN 0-226-46804-6

[16]   Barsalou, L., Ad hoc Categories, Memory and Cognition, 11:211-27, 1983

To top