Document-level Semantic Metadata using Folksonomies and Domain

Document Sample
Document-level Semantic Metadata using Folksonomies and Domain Powered By Docstoc
					Semantic Web Annotation based on Folksonomies and Domain
                       Ontologies
                                       Hend S. Al-Khalifa
                                    Second Year PhD. Student
                              Learning Technologies Research Group
                                ECS- Southampton University, UK
                                     Hsak04r@ecs.soton.ac.uk


1. Introduction
The Word Wide Web is the largest silo for learning resources in the world for a diverse
range of topics. Finding a suitable resource for the use with a particular class is a difficult
and cumbersome task. In addition, using search engines like Google does not always
yield satisfactory results, because search engines depend on keyword/full-text searching
supported with a variety of algorithms (e.g. PageRank). These results do not take into
consideration the semantics of the learning resource.
    Nowadays, new contemporary services have appeared on the web with a new trend in
annotating resources using people’s own words. These systems are called collaborative-
tagging, self-tagging or folksonomies 1 . Using these systems people act as metadata
generators. But, still the problem of semantics is not solved with these systems; since
metadata generated by people is not processable by machines unless it conforms to some
schema.
    We know that the core technology of the Semantic Web is machine processable
metadata. And we believe that the problem of semantics for learning resources can be
solved using Semantic Web technologies (for example document content can be
annotated with semantic information using domain ontologies.)
The question is, how to apply the Semantic Web vision to the new trend of contemporary
services where people tag (viz. annotate) resources like web resources, audio files,
documents, etc. with keywords for easy discovery and later retrieval of the resource?

2. Motivation
Self- tagging techniques are increasing in popularity and many systems are applying
them; among these systems is the popular social bookmark service del.icio.us 2 and photo-
sharing Flickr 3 . Self-tagging is also propagating into platforms, such as operating systems
(e.g. “Vista” the new operating system from Microsoft); other services like bookstores
(e.g. Amazon is asking their customers to tag books) and search engines (e.g. Google
embraced self-tagging in their search history and email service). All these pointers
indicate the wide acceptance of tags.
    In particular, our focus will be on social bookmarking services. Social bookmarking
services -from our point of view- are considered a gold mine for many web users. They
1
  Notice that the word fo lksonomy and tag will be used interchangeably through out this document.
2
  "Del.icio.us is a social bookmarks manager. It allows you to easily add sites you like to your personal
collection of links, to categorize those sites with keywords, and to share your collection not only between
your own browsers and machines, but also with others." Fro m: http://del.icio.us/doc/about
3
  http://www.flickr.co m/
hold inside them scrutinized and manually selected web resources where people think
they are important or interesting to come back to later. Thus, bookmark services can be
literally considered as a recommendation or voting systems, where people nominate a
web resource as being good. In principle, it is different than search engines where their
results are filled with noise.
     So, how can we convert people’s tags (i.e. folksonomies) into machine processable
metadata to exploit semantic search? Is it possible to semantically annotate bookmarked
web resources in these systems using people folksonomy as a guide to finding the
corresponding concept in domain ontologies?
     In fact, what we are interested in is annotating bookmarked web resources from an
educational perspective; we believe using ontologies with extra information from an
educational domain (e.g. difficulty level of a resource, prerequisite, etc.) will add more
semantics to a web resource.
     We also believe that these questions and more can be answered by developing a tool
to facilitate the semantic annotation of web resources using people metadata
(folksonomies). Therefore, the problem can be formulated as follows: how can we get
benefit from folksonomies in an attempt to make parts of the web amenable to machine-
processing?

3. Related Work
Semantic annotation is an active research discipline with a lot of work being developed in
this area. Systems developed using semantic annotation is either manual or (semi-)
automatic (Uren et al., 2005). Manual tools such as Annotea 4 require human intervention
to annotate a web resource. Individual annotations may have limited value, and are
tedious to produce. On the other hand, most (semi-)automatic semantic annotation
approaches rely on extraction technologies such as wrappers, NLP, etc. (see Figure 1) to
extract the important bits of a document so that it can be annotated. Some extraction
techniques are mature with successful applications e.g. GATE5 .


                   Wrapp                 IE                      IE                      NLP
                    er               Supervised              Unsupervised
                                        ML                       ML

                                                                     Map extracted
                             Load content       Extraction         content to ontology
                 Web                             Engine
                Docum
                 ent                         Semantic Annotation



                 Figure 1: The three main components of semantic annotation process

Also there is a proxy–based and a browser–based approach (Koivunen et al., 2000) for
semantic annotation. In the proxy-based approach the annotation is stored on a proxy

4
 http://www.w3.org/2001/Annotea/
5
  GATE stands for General Arch itecture for Text Engineering; the tool is used for all sorts of language
processing tasks, including Informat ion Extract ion in many languages . Fro m http://gate.ac.uk/
server and when an annotated web page is visited the annotation is merged with the web
page in the proxy side then displayed in the web page to the user. While the browser-
based annotation is slightly different; the browser is capable of merging the web page
with its annotation before displaying it to the user. It is also possible to save the
annotation on the browser side, but it is less interesting because of its limitations (e.g.
annotation might be updated and changed, but the client-side copy will not be updated).
There are many browser-based annotation tools like ComMentor and Yawas 6 which are
mentioned in (Koivunen et al., 2000).

3.1 Semantic Annotation Tools for e-Learning
Few semantic annotation tools exist for annotating learning resources. In a survey paper
by (Azouaou et al., 2004) about the different tools for semantic annotation for learning
materials, they evaluated two semantic annotation tools dedicated for educational
purposes.
    MemoNote and AnnForum were the two evaluated annotation tools dedicated for
annotating learning materials. They reported that the tools respecting most of the
requirements are those who are computational, cognitive and semantic; these
requirements were reified in MemoNote. They also pointed out that the problem with
general purpose annotation tools was, they usually provide domain independent
annotation, thus, do not take into consideration the requirements of special domains.

3.2 Outlook
From the previous overview of the different aspects of semantic annotation (general and
specific purpose), the researcher is planning to devise a slightly different systematic
approach to develop a semantic annotation tool. The system in mind will not adopt the
approach used in the previous reviewed systems (i.e. content- level semantic annotation).
Instead, a document- level semantic metadata annotation will be approached.
    The problem of most automatic semantic annotation tools is that they require “the
man in the middle” process (i.e. extraction technologies) where an extensive amount of
processing time is wasted in that phase. Moreover, none of these tools handle
folksonomies as a guide to annotate web resources. So, if is it possible to use people’s
metadata (i.e. folksonomies) to help annotating web resources semantically and check
how rich is the generated semantic metadata will be; this will help in using contemporary
web services output, that use tagging as their main asset, to create semantic metadata.
    An important difference of this work, when compared with automatic annotation
techniques, is that folksonomy annotations are produced by communities which annotate
the meaning of the document as a whole, rather than automatic annotations which are
based on content alone. We hope to show the added value of such community based
annotations.

4. Contribution
The contribution will be in developing an annotation tool using folksonomies to produce
machine processable metadata. The novelty resides in showing how the two named
domains (folksonomies and the Semantic Web) can be bridged. Doing so will replace or
maybe alleviate -to some extent- the burden of using any of the extraction techniques to
6
    http://www.fxpal.co m/people/denoue/yawas/
help in annotating web resources. Again, note that our system will not annotate the
content of a web resource; instead it will assign a document- level semantic metadata to
the web resource as a whole.

4. Research Plan
The research plan is twofold: firstly, an experiment to measure the semantics of
folksonomies compared to automatic keyword extraction was carried out. This step is
important since folksonomies are thought of as keywords that represent what a document
is about and we want to differentiate between automatic keyword extraction and
folksonomies, by proving that folksonomies hold more semantic value than automatically
extracted keywords. The system we used to generate automatic keywords was Yahoo
Term Extraction7 service.
    The experiment was done in three phases: the first phase was to measure the overlap
between the folksonomy set and the automatically generated keyword set. In the second
phase, a human indexer was asked to generate a set of keywords for a sample of web
resources from our data set and compare the generated set to the folksonomy and the
keyword sets to measure the degree of overlap. The final phase was to expose a sample of
the two sets (folksonomy and keywords) to the indexer to evaluate which set carry more
semantic value than the other.
    After completing the three phases of the experiment we found that it was clear from
the results that folksonomy holds more semantic value than machine extraction
techniques. The experiment also showed the percentage of overlap between folksonomies
and automatically extracted keywords for a given document.
    Secondly, another experiment will be carried out to generate semantic metadata using
folksonomies (Figure 2). First, all tags assigned to a web resource in the del.icio.us
service (our data source) need to be collected and processed using several techniques
such as converting plural to singular, making the best guess to eliminate substrings and
relating abbreviations and acronyms to full words. This process is potentially useful to
help in clean up the noise in people tags.




                        Figure ‎ : The folksonomy based annotati on tool frame work
                               2



7
    http://developer.yahoo.net/search/content/V1/termExtraction.ht ml
     After the processing phase, the tags are exposed to the different ontologies (in the
first instance three ontologies will be used: subject ontology, resource type ontology and
domain ontology) to associate each tag to the corresponding concept in the ontology. This
process can be done in several batches where each time we find a matching concept in the
ontology that corresponds to a tag; the tag is dropped from the tags set. In this phase we
plan to use some heuristics to leverage the possibility of getting good annotation results.
When finishing the annotation process, each generated item of semantic metadata will be
saved in a database (i.e. triple store) so we can apply semantic searches to them.

5. Evaluation
The evaluation of folksonomy-based semantic annotation guided with an ontology is not
this well researched. To the best of the researcher knowledge, no established measures on
the researcher could use do exist. Thus, the researcher is going to adopt the well-known
measures of precision and recall from the information retrieval community. Whereas
these measures denote perfect agreement, the researcher will additionally define new
measures for sliding agreement, such as the quality of the produced semantic metadata;
this will be based on subjective evaluation.
In summary, the possible evaluation techniques can be listed as follows:
     Semantic metadata evaluation
             o Quality.
             o Accuracy.
     Retrieval measurements (precision and recall)
             o Compared to other semantic annotation systems.
             o Compared to full-text search engine, Google (globally) or Lucene
                (locally).
     Others
             o Percentage of used folksonomy tags in the annotation process.
                                            In _ Folksnomy  Out _ Folksonomy
                             Percentage                                       100
                                                      In _ Folksonomy
6. Conclusion
One of the main technologies of the Semantic Web is the use of metadata to describe
online resources. To attach meaningful metadata to a resource it requires performing a
process called information/knowledge extraction. The process is time consuming,
complicated and often not accurate. This implies the idea of using human generated
metadata (i.e. folksonomies) instead, to add meaning to a resource.
    The researcher is trying to show how to merge Semantic Web techniques with
folksonomies produced by social bookmarking service to annotate learning resources
with extra value. In this way, the researcher hopes to overcome the inflexibility of
standard approaches (i.e. LOM) that requires humans to explicitly attach metadata to
learning resources. Similarly, the production of metadata will move the burden from the
human producers to an automated process.
References
Azouaou, F. and C. Desmoulins (2005). Semantic Annotation for the Teacher: Models for
       a Computerized Memory Tool. Workshop on Applications of Semantic Web
       Technologies for e-Learning (SW- EL@ AIEDメ 05), Amsterdam, The
       Netherlands.
Koivunen, M.-R., D. Brickley, J. Kahan, E. P. Hommeaux and R. R. Swick. (2000). The
       W3C Collaborative Web Annotation Project. or how to have fun while building
       an RDF infrastructure. W3C.
Uren, V., P. Cimiano, J. Iria, S. Handschuh, M. Vargas-Vera, E. Motta and F. Ciravegna
       (2005). "Semantic Annotation for Knowledge Management: Requirements and a
       Survey of the State of the Art." Journal of Web Semantics 4(1): 34.