Organization and Structure of Information using Semantic Web
Shared by: ctg14933
Organization and Structure of Information using Semantic Web Technologies Jennifer Golbeck, Amy Alford, James Hendler Semantic Web and Agents Project Maryland Information and Network Dynamics Laboratory University of Maryland, College Park Introduction Today's web has millions of pages that are dynamically generated from content stored in databases. This not only makes managing a large site easier, but is necessary for fully functioning ecommerce and other large, interactive websites. These local databases, in one sense, are not full participants in the web. Though they present normal looking HTML pages, the databases themselves are not interconnected in any way. Organization X has basically no way of using or understanding Organization Y's data. If these two want to share or merge information, the database integration would be a fairly significant undertaking. It would also be a one time solution. If Organization Z entered the picture, a new merging effort would have to be undertaken. As the web stands, this has not been a significant problem. By design, the web has been a vehicle for conveying information in a human readable form – computers had no need to understand the content. As dynamic sources of information have become omnipresent on the web, the World Wide Web Consortium has undertaken efforts to make information machine readable. This technology, collectively called the Semantic Web, allows computers to understand and communicate with one another. For site designers, this means data from other sites can be accessed and presented on your own website, and your own public data can be made easily accessible to anyone. It follows that just as web pages are currently hyperlinked, data can also be linked to form a second web behind the scenes, allowing full across-the-web integration of data. Dynamically generated pages driven by databases are becoming commonplace for most large websites, and even for medium and small ones. This trend, combined with the proliferation of non-text media files as one of the primary forms of content, poses several problems to the current web architecture. Search engines have difficulty indexing database driven pages, and cannot directly index the raw data used in the back end. Media searches, such as image or MP3 searches, are notoriously bad, because there is no text from which to extract keywords that could be used to index the media. The nature of the web, with interconnected information, does not extend backend databases or media either. It is usually not possible for a web designer to use information from an external database to drive their own site. The databases are not publicly accessible for queries, nor is the underlying organization of the database apparent. The Semantic Web is a vision for the future of the World Wide Web that will give meaning to all of this data, as well as making it publicly accessible to anyone who is interested. While some web sites and designers will want to keep their backend data proprietary, many will find it in their interest, for design and public interest, to use semantic encodings. This chapter will introduce the semantic web, explain how to organize content for use on the semantic web, and show several examples of how it can be used. Throughout the discussion, we will describe how the technologies affect the human factors in web design and use. What is the Semantic Web? The World Wide Web today can be thought of as a collection of distributed, interlinked documents, encoded using (primarily) HTML. Any person can create their own HTML document, put it online, and point to other pages on the Web. Since the content of these pages is written in natural language, computers do not “know” about what is in the page, just how it looks. The Semantic Web makes it possible for machine-readable annotations to be added, linked to each other, and used for organizing and accessing Web content. Thus, the Semantic Web offers new capabilities, made possible by the addition of documents that encode the "knowledge" about a web page, photo, or database, in a publicly accessible, machine readable form. Driving the Semantic Web is the organization of content into specialized vocabularies, called ontologies, which can be used by Web tools to provide new capabilities. In this section, we present the basic ideas underlying ontologies, what they can be used for, and how they are encoded on the Web. Vocabularies and Ontologies An ontology is a collection of terms used to describe a particular domain. Some ontologies are broad, covering a wide range of topics, while others are limited with very precise specifications about a given area. The general elements that make up an ontology are the following: • Classes – general categories of things in the domain of interest • Properties – attributes that instances of those classes may have • The relationships that can exist between classes and between properties Ontologies are usually expressed in logic-based languages. This allows people to use reasoners to analyze the ontologies and the relationships within them. XML (eXtensible Markup Language) exists to add metadata to applications, but there is no way to connect or infer information from these statements. For example, if we say “Sister” is a type of “Sibling” and then say “Jen is the sister of Tom,” there is no way to automatically infer that Jen is also the sibling of Tom in XML. By encoding these relationships in a logic based language, these and other more interesting inferences can be made. Reasoners are the tools that use the logical statements to make inferences about classes, properties, instances, and their relationships to each other.. These reasoners can be used in advanced applications, such as semantic portals or intelligent web agents. Making data on the web accessible for these types of services and applications, ontologies, and languages for developing them, is a focus in the emerging Semantic Web. Motivations The “formal” models of the domain enabled by the ontologies provide a number of new capabilities, but also require extra work with respect to entering the metadata appropriately, developing the vocabularies, etc. To justify the significant added effort required for good encoding of the semantics behind a given application, users should understand some of the benefits that will be available, and doors that are opened. There are many places where the semantic web can improve the way things are done on the web now, and add new capabilities beyond what is available now on the web. The following sections enumerate some of the visions for the Semantic Web as put forth by the World Wide Web Consortium’s Web Ontology Working Group in a document outlining use cases and requirements for ontologies on the Web (Heflin, 2003). Web portals A Web portal is a web site that provides information content on a topic. While the term has become common for full-web search engines such as Google, portals in the traditional sense are also domain specific pages that do not necessarily have a search feature. The goal is to provide users with a centralized place to find links, newsgroups, and resources on a topic. For portals to work well, they need to be good sources of information to encourage the community to participate in maintaining and updating their content. To create a semantic web portal, where information is well annotated and maintained in a semantic web format, the same is true. Users need some motivation to do the markup that makes the site work. The vision for semantic web portals is to not only make them available as online web pages, but also to integrate them into tools. On web pages, users can find resources based on their semantic markup. To encourage users to create their own metadata, tool integration of portal features is key. For example, if a scientist authoring a paper or web page uses a particular term from an online ontology, the semantic web portal feature should return other sources with similar markup. Results would most certainly return related web pages. They will also provide links to images, video, audio files, or datasets whose content is described by the same term. By these sorts of providing useful information and resources, which could not be found with a standard text-based keyword search, users will be encouraged to mark up their documents so that they make take advantage of the portal. What allows this system to work more fully is the integration of the markup process with the portal. The portal provides the most advantage to users while they are creating their own semantic web documents. Thus, after providing information to the user, the portal itself is extended when the new markup is published. This interactive cycle means that semantic web portals will reach out to incorporate external resources, as well as creating a dense web of semantically interlinked documents. Multimedia collections Ontologies can be used to provide semantic annotations for collections of images, audio, or other non-textual objects. Though one may choose to argue for keywords and natural language processing for computers to understand text documents, extracting information about media is much more difficult. Though some file formats do contain information about the file and the media, there is no way for a machine to understand what is happening in a picture, or the significance of who is pictured. Ontologies to describe media and its content address this problem. Multimedia ontologies can be of two types: media-specific and content-specific. Media specific ontologies describe the format of files and related information. For an image, ontologic markup may include file format and file size, plus information about how the image was produced, such as the camera that took the photo or focal length. Content-specific ontologies allow an author to describe what the media is about. For a photo, this could include the date and time it was taken, where it was taken, who and what is in the picture, and what is happening. For other media, like sound, attributes like lyrics, chord progressions, or historical information may also be relevant. Data about the contents can be related to detailed instances declared in other files, Web site management Websites for even small organizations can have large collections of documents which fall into many categories. These can include news releases and announcements, papers, forms, contact pages, and downloads. As the number of documents increases, finding them, without structure, becomes all the more difficult. Even a taxonomy with a strong hierarchical structure can be insufficient. This is clearly seen in web directories, such as Yahoo!, where finding a particular page is difficult, even in a subset of the hierarchy. An ontology-based web site allows users to search and navigate using specific, ontologically defined terms. This will make documents easier to find, and cross- references easier to track down. Later on, this chapter will discuss one website using semantic markup as its foundation. Design documentation Documentation of systems is often very complex. Large sets of documents with overlapping scopes have several presentation challenges. Since documents are generally grouped thematically, it is not unusual for several sets of documentation to address different aspects of the same sub problem. For a client who is trying to find data on the sub problem alone, there is sometimes no choice but to navigate through several sets of complex documents. Even when the desired information is contained in one set, the level of detail can often be overwhelming. Troubleshooting problems on a website, for example, usually demands a less detailed analysis from the user when compared to the system administrator. Ontologies can be used to build an information model which allows the exploration of documents in a different way. Users can choose to look for specific topics, even if they are small, and see information on that topic, as well as how it connects to the documentation of the encompassing categories. Different levels of abstraction can also be specified, so that depending on user preferences, varied levels of detail are made available. Agents and services The development of intelligent agents for the web is an area of intense interest. With the evolution of the semantic web, the groundwork is being laid that will allow agents to understand web based information, and act upon it. Agent tasks include scheduling and planning (Payne et al., 2002), trust analysis (Golbeck et al., 2003), ontological mapping, and interaction with web services. Web services are sets of functions that can be executed over the web. When services are semantically marked up, they become available for agents to find, compose, and execute in conjunction with data also found on the semantic web. Already, there are hundreds of web services, and a fast growing number of agents and tools (Sirin et al., 2002) that can work with them. Ubiquitous computing Ubiquitous computing describes a movement from hard-wired personal computing devices, to embedding devices in the environment and making them available to any other wireless device. For these systems to work effectively, each device needs to make itself known to the environment and advertise what types of inputs it requires and what it is able to output. When agents are introduced to the system, needing to configure a collection of services and devices to accomplish a goal, it is important to have the ability to reason over the descriptions of the devices and their capabilities. Semantic Web Example Consider the example of making a page about a recent trip to Paris. The page would include some text describing the trip and when it happened, undoubtedly a picture of the travelers in front of the Eiffel Tower, and perhaps links to the hotel where the user stayed, to the City of Paris homepage, and some helpful travel books listed on Amazon. As the web stands now, search engines would index the page by keywords found in the text of the document, and perhaps by the links included there. Short of that vague classification, there is no way for a computer to understand anything about the page. If the date of the trip, for example, were typed as "June 25-30", there would be no way for a computer to know that the trip was occurring on June 26th, since it cannot understand dates. For the non-textual elements, such as the photo, computers have no way of knowing who is in the picture, what is happening, or where it occurs. On the semantic web, all of this information and more would be available for computers to understand. A number of research efforts have explored the representation of ontological information on the Web (see references 15,20,17, 16). A language called DAML+OIL was released in March 2001 as the result of a joint committee of US and European researchers working together to develop a de facto standard. In November of 2001, the W3C created the Web Ontology Working Group to develop a recommendation based on DAML+OIL. The resulting language, OWL is emerging as the standard language to use for these applications, and a set of tools for OWL is being produced as part of the W3C process and under both US and EU funding. OWL is based on the Resource Description Framework (RDF) and its extension RDF Schema . Using OWL, users can encode the knowledge from the webpage, and point to knowledge stored on other sites. To understand how this is done, it is necessary to have a general understanding of how Semantic Web markup works. With OWL, users define classes, much like classes in a programming language. These can be sub-classed and instantiated. Properties allow users to define attributes of classes. In the example above, a "Photo" class would be useful. Properties of the Image class may include the URL of the photo on the web, the date it was taken, the location, references to the people and objects in the picture, as well as what event is taking place. To describe a particular photo, users would create instances of the Image class, and then fill in values for the Image's properties. In a simple table format, the data may look like this Photo Name: ParisPhoto1 URL: http://www.example.com/photo1.jpg Date Taken: June 26, 2001 Location: Parc Du Champ De Mars, Paris, France Person in Photo: John Doe Person in Photo: Joe Blog Object in Photo: Eiffel Tower On the Semantic Web, resources (collectively Classes, Properties, and Instances of classes) are all given unique names, and referred to by their URI (Uniform Resource Indicator). That URI will be the web address of the document containing the code, with a '#' and the name of the object appended to the end. For example, if the document describing the trip is at http://www.example.com/parisTrip.owl, the URI of the photo would be http://www.example.com/parisTrip.owl#ParisPhoto1. Since each resource has a unique name, it allows authors to make reference to definitions elsewhere. In our ontology above, the author can make definitions of the two travelers, John Doe and Joe Blog: Person Name: JohnDoe First Name: John Last Name: Doe Age: 23 Person Name: JoeBlog First Name: Joe Last Name: Blog Age: 24 Then, in the properties of the photo, these definitions can be referenced. Instead of having just the string "John Doe", the computer will know that the people in the photo are the same ones defined in the ontology, with all of their properties. Person in Photo: http://www.example.com/parisTrip.owl#JohnDoe Person in Photo: http://www.example.com/parisTrip.owl#JoeBlog It is also possible to make reference to instances defined in other ontologies. In the simple table above, the property "Object in Photo" is listed as the simple string "Eiffel Tower." If someone has created a Paris History Ontology with a formal definition of the Eiffel Tower in another document, the URI of that resource can be used in place of the string. Thus, our property would become something like the following: Object in Photo: http://www.example.com/parisHistroyOntology.owl#EiffelTower The benefit of this linking is similar to why links are used in HTML documents. If a web page mentions a book, a link to its listing on Amazon offers many benefits. Thorough information about the book may not belong on a page that just mentions it in passing, or an author may not want to retype all of the text that is nicely presented elsewhere. A link passes users off to another site, and in the process provides them with more data. References on the Semantic Web are even better at this. Though the travelers in this example may not know much about the Eiffel Tower, the authors of the Paris History Ontology may have included all sorts of interesting data about the history, location, construction, and architecture of the Eiffel Tower in their definition. By making reference to that definition in the description of the trip, the computer understands that the Eiffel Tower in the photo has all of the properties described in the History Ontology. This means, among other things, that agents and reasoners can connect the properties defined in the external file to our file. Once this data is encoded in a machine understandable form, it can be used in many ways. Of course, the definition of ontologies and resources is not done in simple tables as above. The next section will give a general overview of the capabilities of RDF and OWL, used to formally express the semantic data. Once that is established, we will present several examples of how this semantic data can be used to produce and augment traditional web content. Encoding Information on the Semantic Web The basic unit of information on the Semantic Web, independent of the language, is the triple. A triple is made up of a subject, predicate, and object or value. In the example from the previous section, one triple would have subject "JohnDoe", predicate "age", and value "23". Another would have the same subject, predicate "First Name", and value "John". One detail that has been skipped over so far, however, has been the issue of URIs. On the Semantic Web, everything, other than strings, is represented by its unique URI. Thus, if our ontology is located at http://www.example.com/parisTrip.owl, the triples will be1: Subject: http://www.example.com/parisTrip.owl#JohnDoe Predicate: http://www.example.com/parisTrip.owl#age Value: 23 Subject: http://www.example.com/parisTrip.owl#JohnDoe Predicate: http://www.example.com/parisTrip.owl#firstName Value: John In the two examples above, the predicates relate the subject to a string value. It is also possible to relate two resources through a predicate. For example Subject: http://www.example.com/parisTrip.owl#ParisPhoto1 Predicate: http://www.example.com/parisTrip.owl#objectInPhoto Object: http://www.example.com/parisHistoryOntology.owl#EiffelTower 1 Actually, we slightly simplify the treatment of datatypes, as the details are not relevant to this chapter. Readers interested in the full details of the RDF encoding are directed to Re s o u r c e Description Framework (RDF). http://www.w3.org/RDF/. Each of these triples forms a small graph with two nodes, representing the subject and object, connected by an edge representing the predicate. The information for John Doe is represented in a graph as shown below: Figure 1: Three triples, rooted with http://www.example.com/parisTrip.owl#JohnDoe as the subject of each Taking all of descriptors from the previous section and encoding them as triples will produce a much more complex graph: Figure 2: The graph of triples from the Paris example As documents are linked together, joining terms, these graphs grow to be large, complex interconnected webs. To create them, we need languages that support the creation of these relationships. Though there are many languages that can be used on the semantic web, the most popular is RDF, with OWL emerging as a new, more powerful extension. RDF and RDFS The Resource Description Framework (RDF) - developed by the World-Wide Web Consortium (W3C) - is the foundational language for the Semantic Web. It, along with RDF Schema, provides the basis for creating vocabularies and instance data. This section presents the basics of syntax and an overview of the features of RDF and RDFS. There are full books written on RDF, and this chapter is far too brief to give a thorough coverage of the language. The rest of this section will give a general overview of the syntax of RDF and OWL and their respective features, but is not intended as a comprehensive guide. Links to thorough descriptions of both languages, including an RDF primer and an OWL Guide, each with numerous examples, are available on the W3C’s Semantic Web Activity website at http://w3.org/2001/sw. Document Skeleton There are several flavors of RDF, but the version this chapter will focus on is RDF/XML, which is RDF based on XML syntax. The skeleton of an RDF document is as follows: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> </rdf:RDF> The tag structure is inherited from XML. The RDF tag begins and ends the document, indicating that the rest of the document will be encoded as RDF. Inside the rdf:RDF tag is an XML namespace declaration, represented as an xmlns attribute of the rdf:RDF start-tag. Namespaces are convenient abbreviations for full URIs. This declaration specifies that tags prefixed with rdf: are part of the namespace described in the document at http://www.w3.org/1999/02/22-rdf- syntax-ns#. In the example detailed in previous sections, we made some assumptions. For creating triples about people, we assumed there was a class of things called People with defined properties, like "age" and "first name". Similarly, we assumed a class called "Photo" with its own properties. First, we will look at how to create RDF vocabularies, and then proceed to creating instances. Defining Vocabularies RDF provides a way to create instances and associate descriptive properties with each. RDF does not, however, provide syntax for defining Classes, Properties, and describing how they relate to one another. To do that, authors use RDF Schema (RDFS). RDF Schema uses RDF as a base to specify the a set of pre-defined RDF resources and properties that allow users to define Classes and restrict Properties. The RDFS vocabulary is defined in a namespace identified by the URI reference http://www.w3.org/2000/01/rdf-schema#", and commonly uses the prefix "rdfs:". This namespace is added to the rdf tag. Describing Classes Classes are the main way to describe types of things we are interested in. Classes are general categories that can later be instantiated. In the previous example, we want to create a class that can be used to describe photographs. The syntax to create a class is written: <rdfs:Class rdf:ID="Photo"/> The beginning of the tag "rdfs:Class" says that we are creating a Class of things. The second part, "rdf:ID" is used to assign a unique name to the resource in the document. Names always need to be enclosed in quotes, and class names are usually written with the first letter capitalized, though this is not required. Like all XML tags, the rdfs:Class tag must be closed, and this is accomplished with the "/" at the end. Classes can also be subclassed. For example, if an ontology exists that defines a class called "Image", we could indicate that our Photo class is a subclass of that. <rdfs:Class rdf:ID="Photo"> <rdfs:subClassOf rdf:resource= "http://example.com/mediaOntology.rdf#Image"/> </rdfs:Class> The rdfs:subClassOf tag indicates that the class we are defining will be a subclass of the resource indicated by the rdf:resource attribute. The value for rdf:resource should be the URI of another class. Subclasses are transitive. Thus, if X is a subclass of Y, and Y is a subclass of Z, then X is also a subclass of Z. Classes may also be subclasses of multiple classes. This is accomplished by simply adding more rdfs:subClassOf statements. Describing Properties Properties are used to describe attributes. By default, Properties are not attached to any particular Class; that is, if a Property is declared, it can be used with instances of any class. Using elements of RDFS, Properties can be restricted in several ways. All properties in RDF are described as instances of class rdf:Property. Just as classes are usually named with an initial capital letter, properties are usually named with an initial lower case letter. To declare the "object in photo" property that we used to describe instances of our Photo class, we use the following statement: <rdf:Property rdf:ID="objectInPhoto"/> This creates a Property called "objectInPhoto" which can be attached to any class. To limit the domain of the property, so it can only be used to describe instances of the Photo class, we can add a domain restriction: <rdf:Property rdf:ID="objectInPhoto"> <rdfs:domain rdf:resource="#Photo"/> </rdf:Property> Here, we use the rdfs:domain tag that limits which class the Property can be used to describe. Here, the rdf:resource is used the same way as in the subclass restriction above, but we have used a local resource. Since the Photo class is declared in the same namespace (the same file, in this case) as the "objectInPhoto" property, we can abbreviate the resource reference to just the name. Similar to the rdfs:subClassOf feature for classes, there is an rdfs:subPropertyOf feature for Properties. In our photo example, the property "person in photo" is a subset of the "object in photo" property. To define this relation, the following syntax is used: <rdf:Property rdf:ID="personInPhoto"> <rdfs:subPropertyOf rdf:resource="#objectInPhoto"/> </rdf:Property> Sub-properties inherit any restrictions of their parent Properties. In this case, since the objectInPhoto property has a domain restriction to Photo, the personInPhoto has the same restriction. We can also add restrictions. In addition to the domain restriction which limits which classes the property can be used to described, we can add range restrictions which limit what types of values the property can accept. For the personInPhoto Property, we should restrict the value to be an instance of the Person class. Ranges are restricted in the same way as domains: <rdf:Property rdf:ID="personInPhoto"> <rdfs:range rdf:resource="#Person"/> </rdf:Property> Creating Instances Once this structure is set up, our instances can be defined. Consider the previous triple that we described as plain text: Person Name: JoeBlog First Name: Joe Last Name: Blog Age: 24 Here, JoeBlog is the subject, and is an instance of the class Person. There are also Properties for age, first name, and last name. Assuming we have defined the Person class and its corresponding properties, we can create the Joe Blog instance: <Person rdf:ID="JoeBlog"> <firstName>Joe</firstName> <lastName>Blog</lastName> <age>24</age> </Person> In the simplest case, the classes and properties we are using are declared in the same namespace as where our instances are being defined. If that is not the case, we use namespace prefixes, just as we used with rdf: and rdfs:. For example, if there is a property defined in an external file, we can add a prefix of our choosing to the rdf tag: <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:edu="http://example.com/education.rdf#"> Once this new namespace is introduced, it can be used to reference classes or properties in that file: <Person rdf:ID="JoeBlog"> <firstName>Joe</firstName> <lastName>Blog</lastName> <age>24</age> <edu:degreeEarned>PhD</edu:degreeEarned> </Person> OWL The OWL (Web Ontology Language) is a vocabulary extension of RDFS that adds the expressivity needed to define classes and their relationships more fully. Since OWL is built on RDF, any RDF graph forms a valid OWL ontology. However, OWL adds semantics and vocabulary to RDF, and RDFS, giving it more power to express complex relationships. OWL introduces many new features over what is available in RDF and RDFS. They include, among others, relations between classes (e.g. disjointness), cardinality of properties (e.g. "exactly one"), equality, characteristics of properties (e.g. symmetry), and enumerated classes. Since OWL is based in the knowledge engineering tradition, expressive power and computational tractability were major concerns in the drafting of the language. Features of OWL are well documented online (McGuinness, van Harmelen, 2003), and an overview is given here. Since OWL is based on RDF, the syntax is basically the same. OWL uses Class and Property definitions and restrictions from RDF Schema. It also adds the following syntactic elements: Equality and Inequality: • equivalentClass – This attribute is used to indicate equivalence between classes. In particular, it can be used to indicate that a locally defined Class is the same is one defined in another namespace. Among other things, this allows properties with restrictions to a class to be used with the equivalent class. • equivalentProperty – Just like equivalentClass, this indicates equivalence between Properties. • sameIndividualAs – This is the third equivalence relation, used to state that two instances are the same. Though instances in RDF and OWL must have unique names, there is no assumption that those names refer to unique entities or the same entities. This syntax allows authors to create instances with several names that refer to the same thing. • differentFrom – This is used just like sameIndividualAs, but to indicate that two individuals are distinct. • allDifferent – The allDifferent construct is used to indicate difference among a collection of individuals. Instead of requiring many long lists of pair wise "differentFrom" statements, allDifferent has the same effect in a compact form. AllDifferent is also unique in its use. While the other four attributes described under this heading are used in the definition of classes, properties, or instances, allDifferent is a special class for which the property owl:distinctMembers is defined, which links an instance of owl:AllDifferent to a list of individuals. The following example, taken from the OWL Reference (van Harmelen et al., 2003), illustrates the syntax: <owl:AllDifferent> <owl:distinctMembers rdf:parseType="Collection"> <Opera rdf:about="#Don_Giovanni"/> <Opera rdf:about="#Nozze_di_Figaro"/> <Opera rdf:about="#Cosi_fan_tutte"/> <Opera rdf:about="#Tosca"/> <Opera rdf:about="#Turandot"/> <Opera rdf:about="#Salome"/> </owl:distinctMembers> </owl:AllDifferent> Property Characteristics: • inverseOf – This indicates inverse properties. For example, "picturedInPhoto" for a Person would be the inverseOf the "personInPhoto" property for Photos. • TransitiveProperty – Transitive properties state that if A relates to B with a transitive property, and B relates to C with the same transitive property, then A relates to C through that property. • SymmetricProperty – Symmetric properties state that if A has a symmetric relationship with B, then B has that relationship with A. For example, a "knows" property could be considered transitive, since if A knows B, then B should also know A. • FunctionalProperty - If a property is a FunctionalProperty, then it has no more than one value for each individual. "Age" could be considered a functional property, since no individual has more than one age. • InverseFunctionalProperty – Inverse functional properties are formally properties such that their inverse property is a functional property. More clearly, inverse functional properties are unique identifiers. Property Type Restrictions: • allValuesFrom – This restriction, along with someValuesFrom, are used in a class as a local restriction on the range of a property. While an rdfs:range restriction on a Property globally restricts the values a property can take, allValuesFrom states that for instances of the restricting class, the value for the restricted Property must be an instance of a specified class. • someValuesFrom – Just like allValuesFrom, this is a local restriction on the range of a Property, but it states that there is at least one value of the restricted property has a value from the specified class. Class Intersection: • intersectionOf – Classes can be subclasses of multiple other classes. The intersectionOf statement that a class lies directly in the intersection of two or more classes. Restricted Cardinality: Cardinality restrictions are made for a class, and specifies how many values for a property can be attached to an instance of a particular class. • minCardinality – This limits the minimum number of values of a property that are attached to an instance. A minimum cardinality of 0 says that the property is optional, while a minimum cardinality of 1 states that there must be at least one value of that property attached to each instance. • maxCardinality - Maximum cardinality restrictions limit the number of values for a property attached to an instance. A maximum cardinality of 0 means that there may be no values of a given property, while a maximum cardinality of 1 means that there is at most one. For example, and UnmarriedPerson should have a maximum cardinality of 0 on the hasSpouse property, while a MarriedPerson should have a maximum cardinality of 1. • cardinality – Cardinality is a convenient shorthand for when maximum cardinality and minimum cardinality are the same. OWL also offers a usability benefit, in addition to the expressiveness described below. Some syntax was renamed from DAML+OIL, the predecessor to OWL. This replaced some confusing bits of syntax with more descriptive and understandable names. Other features, such as the qualified cardinality constraints, which many people considered both confusing and unnecessary, were removed from OWL all together. Both OWL and DAML+OIL were based on RDF and RDFS. DAML+OIL had duplicated some terms from these base languages, putting identical syntax in two namespaces. This could lead to questions of whether, say, a daml:Property or an rdf:Property were different, when they were, in fact, identical. OWL removed any shadowing of the underlying languages, leaving just the one option to users. Furthermore, OWL divides the language into three subsets: OWL Lite, which is a subset of OWL DL, which is, in turn, a subset of OWL Full. The benefit of these three levels is that the more complex features are preserved in OWL Full, while OWL Lite and OWL DL offer smaller subsets to the user, each with various features removed. Tools for Creating Semantic Web Markup To drive any application, it is necessary to create large amounts of RDF. Though authoring RDF and OWL by hand is an option, there are many tools available to make the process more transparent. This section will present a few of the general- purpose tools used to create content2. 2 Both of these tools were developed in our lab and are available for download at http://www.mindswap.org/. Instance Creation Users will often want to create Semantic Web markup for individual web pages, photos, or concepts, rather than making a mass conversion of existing data. One of several tools available to assist the user in creating instances is the RDF Instance Creator (RIC) (Golbeck et al., 2002). The tool lets users import existing ontologies, choose a class from those available, and then create an instance by simply filling in a form. Figure 3: The RDF Instance Creator (RIC) in action When a class is selected, the user is presented with a workspace that lists all of the known properties of that class. In the screen shot shown in Figure 5, the user is creating an instance of the class "Athlete". The known properties of Athlete, such as "weight", "eyeColor", and "height" are shown in the workspace, and the user can enter the values. Though these first properties just take strings as values, RIC also allows the user to link objects. The "plays" property shown below, for example, requires an instance of the "Sport" class as its value. The user can either create a new instance of a sport to act as the object in the triple, or an existing instance can be linked in. RIC also facilitates the extension of existing ontologies. Users may add a property to any existing class, and the RDF for the new property is stored in the local output file. Users have the capability to add new classes, as well. These may be independent classes, or subclass any classes that have been imported from other ontologies. For users who are new to the semantic web with limited understanding of the underlying languages, a lightweight tool like RIC can hide most of the ugly details, and jumpstart the instance authoring process. Ontology Manipulation and Instance Creation Most people are not ontological engineers, domain experts, or logicians, or even programmers, so its unlikely that they will be able to read, sort through, and grasp how to apply large ontologies, much less construct their own. Aside from the difficulty of learning how to model content in a reasonably correct and formal way, current Web focused knowledge engineering tends to involve either an interruption of normal workflow and techniques (e.g., switching to an RDF editor to create RDF content which is then linked to an HTML page (McGuiness, van Harmelen , 2003), (Bechhofer, Ng), (Staab et al., 2002)) or a wholesale abandonment of prior practice. While there are many tools for easing ontology creation and knowledge acquisition, few focus on how normal Web authors work. Most tools are geared only toward ontology development (Musen et al., 2002) This forces the author into a two-step situation where either the author must first create the content and then annotate, or create all of the content in a knowledge creation context and then render it to HTML in some fashion. SMORE (Semantic Markup, Ontology and RDF Editor) is a tool whose design is driven by the idea that much Semantic Web based knowledge acquisition will look more like Web page authoring than traditional knowledge engineering. It blurs the line between normal content creation and Semantic annotation, but SMORE also supports ad hoc ontology use, modification, combination, and extension. In keeping with the main design principle of seamless integration of content creation and annotation, SMORE provides built-in support for performing routine web-oriented tasks in the context of semantic markup. For instance, SMORE contains a fully featured WYSIWYG text/html editor that allows users to create and deploy web pages. Besides providing standard features for web page design, the editor facilitates the generation of semantic markup by acting as a medium through which the user can compose semantic triples of his data. Users can select portions of text from the web page and insert them into triple placeholders (that follow the standard subject-predicate-object model). When trying to expose the information encoded in natural language to a software agent, it seems natural to produce a translation of the information. Some of the information is extracted from the text and encoded as RDF. The process of creating accurate metadata from text is not terribly difficult, but this problem becomes more acute with non-textual sources, such as photographs, in part because the information "in" a photograph is not already encoded in a linguistic form. SMORE lets users add triples to a document that describe a particular photo as a whole. One of its interesting features also allows sub-image annotation. Using standard drawing-like tools (squares, circles, polygons, etc.), the user delineates a region of a photo. The user then can represent facts about that region. One crucial fact about these regions is that the user can assert is what they depict. Subsequent annotations can then be about the depicted object. For example, in the screen shot below, the photo depicts Bonnie, an orangutan housed at the National Zoo. One of Bonnie’s identifying features is a bulbous forehead, and in this markup, the feature is mapped from the overall photo, semantically described, and noted in connection with other info about Bonnie. Figure 4: The SMORE interface, showing the sub-image markup feature Case Study: http://owl.mindswap.org In addition to the development languages and applications to support the vision of the Semantic Web, there is a large community working to implement it. In this section we will look at the implementation details of http://owl.mindswap.org: a website produced entirely using semantic web technology, and serving as an example to show how semantic markup can be used as the fundamental structure for information on websites. Figure 5: The http://owl.mindswap.org homepage On a web site generated using Semantic Web technology, as with any good dynamically generated website using traditional database methods, the average user does not see anything different from hard-coded HTML site. This means that users who are viewing the page do not need to even be aware of the underlying technology, and the usability of the website is not affected. The real human factors change arises for the web site managers. Instead of potentially complicated software with a centralized and engineered database, information for a Semantic Web based site can be distributed across the web, and automatically incorporated as dynamic content. For example, in a current database backed dynamic website, a website that presents the day’s headlines would potentially have to collect stories and news from a variety of wire services, convert each source of that data into the database format, and then load it into the database before it can appear on the page. In a world using Semantic Web technology, each wire service would maintain its news headlines as RDF or OWL documents that would be available on the web. To display this information, the centralized news service would only need to do a one time description of how the ontology used for marking up the news of each wire service maps to the ontology or formatting used for the website. Because the wire services automatically update their news, the centralized site would merely have to retrieve the latest RDF or OWL document from each service and use the pre-defined mappings to present that data on a page. By allowing each source to maintain their own data, the central site that presents that data is freed from maintaining a central database, updating that database, and worrying about consistency between central news service and wire services. One may argue that a system of automated retrieval and conversion of data in the traditional database model is quite similar to the scenario described above. An even clearer benefit can be seen in the case where a wire service may change the format of their news. In a system of automated conversion, a human user would have to manually update the code that does the conversion. A change as simple as swapping the position of article date and article author, or changing the name of a field, say from “Author Name” to “Byline” could break a converter. Conversely, in the Semantic Web model, the ontology dictates the structure of data. As long as a revised ontology is based on the original version, the system will continue to function. This gives the maintainer the freedom to update mappings leisurely, since breakage is less likely to occur. One final benefit, before explaining how to implement such a site, is that of the standard format of data. Because all Semantic Web data is based on standards, it is easy to connect information maintained by separate sources. If a wire service connects an author to each article, and a separate service maintains information about authors, Semantic Web technology makes it possible to automatically connect the two because of the shared data format. Thus, not only is a hyperlinked web of connected pages built, but a secondary web of connected data emerges. Users could read an article form a wire service, click on the authors name to get some biographical information about them from a second service, and then perhaps click on their hometown to find out more about the place with information provided by yet another service. In some cases, where the ontologies of each service interlink themselves, this is a trivial task for the website manager. Even in more complex cases, the burden on the manager is much lighter because of the common format permits the ability to easily link data from distributed sources. Implementation Details Mindswap is the Maryland Information and Network Dynamics Lab Semantic Web Agents Project, based at the University of Maryland, College Park. The website http://owl.mindswap.org was created to showcase the tools and technologies developed in the lab, and to become the first website generated using Semantic Web technology exclusively. RDF and OWL are used to store all of the local information for the site. A collection of ontologies describes any domain relevant concepts, such as people, news items, downloads, and paper references. The site is divided into categories, and instance data is presented on each page. Data Storage Data exists in two places simultaneously. RDF and OWL files that contain the data are available on the web server and available for download, allowing interchange with other sites. RDF is also stored in a backend database, and is manipulated and accessed via the Redland application Framework. Redland is an object-oriented library written in C that has three major features: • Classes to represent the core concepts of RDF, including URIs, literals, nodes, and statements. • A fast, standards compliant parser for RDF/XML and NTriples. • A triple store that provides facilities for querying and modifying data. The triple store is an abstract interface for an underlying data store, whose actual implementation can be chosen at run time. Data can be stored using Berkeley DB ,a low level database system, Parka (Evett et al., 1994), an inferencing database designed with triples in mind, in memory, or on disk. Using a database of RDF as the backend for the web site raises the question “why not just use a standard database?” The answer is that to do so would require building an extensive hierarchy anyway, and would not be portable to other sites. With an RDF base, stored in a more traditional database, all of the site's data can be accessible to anyone on the web. No capabilities are taken away with this approach, since the RDF can easily be edited or changed, and the backend database can be updated in real time. Adding, Editing, and Removing Data Any authorized user can add, remove, or edit data on the site in real time by using one of several interfaces for creating new RDF instance data. This includes a web- based form of the RDF Instance Creator tool (see section 4.2), as well as an interface for editing raw RDF. In the event that a user just wants to see the backend data without changing it, each page offers a set of links that show all of the RDF in either raw text form, or through the web based version of RIC. The Redland framework not only mirrors RDF found on the owl.mindswap.org site – it can also import data from other web servers. This allows querying based on ontologies created by other organizations. Any user can submit the URIs of their RDF and OWL data through a form on the site. That data is immediately added to the database, and will appear on any pages that use the same semantics. If any external pages are changed, users can request an update via the website, so even the external data in the database is kept consistent with the files. Generating HTML HTML web pages are generated from the database. For example, one of the Mindswap ontologies defines a class called "Swapper", which is used to refer to any members or affiliates of the lab. Subclasses of "Swapper" include "Graduate Students", "Faculty", and "Alumni" to name a few. The "People" page on the website queries through Redland for all subclasses of “Swapper”, and retrieves all instances of each of those subclasses. A nested list in HTML is then used to represent the hierarchy of types of “Swappers” and information about each individual. Figure 4: The People page from http://owl.mindswap.org Because instances are interlinked, users of the site are not restricted to viewing information from a specific category. A common example of this is finding items created by a particular person. Any RDF instance generated for the Mindswap site can have a “creator” property, which will be an instance of a “Swapper.” This makes it easy to find and list all RDF entries created by a particular person., including news items, papers, and software. Another example of interlinking is that each software project has a property to express the language used. Because these languages are part of an RDF hierarchy, with the ultimate superclass “Programming Language, it is possible with a simple user interface to show not only projects using, say, Java, but also all projects using an "Object Oriented Programming Language", since that is part of the "Programming Language" hierarchy. One of the first issues raised in this chapter was the inability of web developers to use another party's database to drive their own site. With web sites such as this, that is no longer the case. Since every page is generated in real time using RDF, the site is not limited to using its own content alone. On the homepage, in-house news items are presented first, followed by links to and descriptions of news from the W3C and other semantic web sources. These third party news sources are seamlessly integrated into the site, since processing them is just a matter of reading the RDF. Similarly, any other website could use data from the MINDSWAP RDF files that are also web available. Web Scraping and Formatted Content It is not uncommon to find HTML web pages that present data in a well formatted, structured layout. Tables of data certainly fall into this category, as do many auto-generated pages, such as eBay listings or Amazon.com products. If data is available in this format, it can be "scraped" into a tool that can output corresponding RDF. One of these screen scrapers (Golbeck et al., 2002) is available as part of a package called SMORE (discussed later in this section). To scrape a page or set of pages, the user creates a "wrapper" that describes how the HTML tags in the document relate to the contents. In the simple HTML below, the person's name is immediately preceded by the "<b>" tag, and proceeded by the "</b>" tag. The email address follows similar rules using the "<i>" tag. An interesting feature of the scraper is its ability to take information from between tags as well as from within them. This allows users to scrape the URL’s of images or links and mark them up. In this example code, the URL of a photo is contained within the "img" tag, and needs to be extracted. <b>John Doe</b> <br><img src="http://www.example.com/images/johndoe.jpg"> <br> Mr. Doe can be emailed at <i>email@example.com</i>. By specifying the three points above, the software can extract a simple table of data from a page. The screen scraper also has the capability to crawl over a number of pages. This means that even if a server generates a different page for each person, each page can be scraped and the data can be aggregated. Once a table of data has been collected, the user can specify how each column should be translated into RDF. Columns may be turned into class names, instances of existing classes, or attached to an instance as values for pre-defined properties. The situation is similar for spreadsheets and simple databases. Since these types of files are often highly structured in a straightforward way, the step of "scraping" is unnecessary, and direct conversion to RDF is fairly simple. Mindswap has two tools, ConvertToRDF (Golbeck et al., 2002) and ExcelToRDF, which allow users to turn comma delimited data files and Microsoft Excel spreadsheets into RDF data. In both cases, users specify what Class to instantiate for each row of data, and which properties of that class correspond to each column. Consider the following simple database: Name,Height,Weight,Eye Color,Hair Color Adam,6,150,blue,brown Chris,5.5,102,brown,brown Joe,5.7,183,green,blond Marvin,6.3,212,blue,black ConvertToRDF uses a simple text file stating that each row corresponds to a Person (as defined in an existing ontology), and that each column corresponds to a particular, pre-defined property of Person. With these converter tools, it is trivial to produce thousands of RDF triples in minutes. Depending on the detail within a given database, this may result in fairly rich data models with minimal effort from the user. References  Golbeck, J., Grove,M., Parsia, B., Kalyanpur, A., Hendler, J. (2002). New Tools for the Semantic Web in Proceedings of 13th International Conference on Knowledge Engineering and Knowledge Management EKAW02 Siguenza, Spain.  NCI Center for Bioinformatics caCORE : http://ncicb.nci.nih.gov/NCICB/core  McGuinness, D., van Harmelen, F. (2003) Web Ontology Language (OWL): Overview. http://www.w3.org/TR/owl-features/  Heflin, J. (2003). Web Ontology Language (OWL) Use Cases and Requirements. http://www.w3.org/TR/2003/WD-webont-req-20030203/.  Payne, T., Singh, R., & Sycara, K. (2002). Calendar Agents on the Semantic Web. IEEE Intelligent Systems, Vol. 17 (3), 84-86.  Sirin, E., Hendler, J., Parsia, B. (2002). Semi-automatic Composition of Web Services using Semantic Descriptions. Web Services: Modeling, Architecture and Infrastructure Workshop at ICEIS2003.  Golbeck, J., Parsia, B., Hendler, J. (2003). Trust Networks on the Semantic Web” Proceedings of Cooperative Intelligent Agents 2003. Helsinki, Finland.  van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D., Patel- Schneider, P., Stein, L. (2003). Web Ontology Language (OWL) Reference Version 1.0 W3C Working Draft 21 February 2003. http://www.w3.org/TR/owl-ref/  Evett, M., Hendler, J., & Spector, L. (1994). Parallel Knowledge Representation on the Connection Machine, Journal of Parallel and Distributed Computing, 22, 168-184.  Musen, M., Fergerson, R., Grosso, W., Noy, N., Crubezy, M., & Gennari, J. (2002). Component-Based Support for Building Knowledge-Acquisition Systems. Conference on Intelligent Information Processing (IIP 2000) of the International Federation for Information Processing World Computer Congress (WCC 2000). Beijing.  Bechhofer, G. & Ng, G. OilED. http://img.cs.man.ac.uk/oil/.  Staab, S., Sure, Y., Erdmann, M., Wenke, D., Angele, J., Studer, R. (2002). OntoEdit: Collaborative Ontology Development for the Semantic Web. Proceedings of the first International Semantic Web Conference 2002 (ISWC 2002). Sardinia, Italia.  Jan Winkler. RDFedt. http://www.jan-winkler.de/dev/e_rdfe.htm.  The DAML+OIL Language. http://www.daml.org/2001/03/daml+oil- index.html  SHOE (Simple HTML Ontology Extensions). http://www.cs.umd.edu/projects/plus/SHOE/.  The OIL Language. http://www.ontoknowledge.org/oil/.  Extensible Markup Language (XML). http://www.w3.org/XML/.  Resource Description Framework (RDF). http://www.w3.org/RDF/.  RDF Schema. http://www.w3.org/TR/rdf-schema/.  Ontobroker. http://ontobroker.aifb.uni-karlsruhe.de/index_ob.html.  McGuinness , D., van Harmelen , F. OWL Web Ontology Language. http://www.w3.org/TR/owl-features/  Mutton, P. & Golbeck, J. (2003). Visualizing Ontologies and Metadata on the Semantic Web. Proceedings of Information Visualization 2003. London.