Intelligent Information Retrieval and Web Search by nikeborome


									         XML, RDF and
         Advanced Search
         (Semantic Web)

   Thanks to Jim Hendler, Carl Lagoze, Jayavel
Shanmugasundaram, Sara Cohen, Jonathan Mamou,
 Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv,
               Frank van Harmelen
                    What we have covered
•   What is IR
•   Evaluation
•   Tokenization and properties of text
•   Web crawling
•   Query models
•   Vector methods
•   Measures of similarity
•   Indexing
•   Inverted files
•   Basics of internet and web
•   Spam and SEO
•   Search engine design
•   Google and Link Analysis
•   Social network analysis
     – This lecture: metadata, XML, RDF; issues in advanced search and the Semantic
   The importance of data and their rules

• Tim Berners-Lee
  – inventor of the world wide web
  – Founder of the W3C

• Presentation at Ted
Metadata and Markup languages

“Metadata is data about data”

Metadata often is written in XML
Metadata is semi-structured data conforming to commonly
agreed upon models, providing operational interoperability
             in a heterogeneous environment
           What is metadata?
         Some simple definitions
• „Structured data about data‟.
            • Dublin Core Metadata Initiative FAQ, 2005

• Machine-understandable information about
  Web resources or other things.
                         • Tim Berners-Lee, W3C, 1997
"Web resources or other things"
• Metadata might be "about"… anything!
 –   HTML documents       –   Web sites
 –   digital images       –   collections
 –   databases            –   services
 –   books                –   physical places
                          –   people
 –   museum objects
                          –   organizations
 –   archival records
                          –   “works”
 –   metadata records     –   formats
                          –   concepts
                          –   events
               What is metadata?
           Towards a "functional" view

• Data associated with objects which relieves their potential users of
  having to have full advance knowledge of their existence or
         • Lorcan Dempsey & Rachel Heery, "Metadata: a current view of practice and
           issues", 1998

• Structured data about resources that can be used to help support a wide
  range of operations.
         • Michael Day, "Metadata in a Nutshell", 2001
What might metadata "say"?
      What is this called?
      What is this about?
      Who made this?
      When was this made?
      Where do I get (a copy of) this?
      When does this expire?
      What format does this use?
      Who is this intended for?
      What does this cost?
      Can I copy this? Can I modify this?
      What are the component parts of this?
      What else refers to this?
      What did "users" think of this?
           What operations/functions?
•   resource disclosure & discovery
•   resource retrieval, use
•   resource management, including preservation
•   verification of authenticity
•   intellectual property rights management
•   commerce
•   content-rating
•   authentication and authorization
•   personalization and localization of services
•   (etc!)
          What operations/functions?

• Different functions : different metadata
• Metadata (and metadata standards) sometimes
  classified according to function
   – Descriptive: primarily for discovery, retrieval
   – Administrative: primarily for management
   – Structural: relationships between component parts of
   – Contextual: relationships between resources
• No “one size fits all solution”!
               Metadata importance
• “data about data” is about as good as the definition
• As a data resource grows, metadata becomes more
• Lack of metadata has different consequences
   – documentation: metadata can be regenerated automatically,
     or by hand
   – datasets, pictures: once lost, can be impossible to
                  Types of Metadata
• Descriptive       See

   – Discovery / description of objects
      • Title, author, abstract, etc.
• Structural
   – Storage & presentation of objects
      • 1 pdf file, 1 ppt file, 1 LaTeX file, etc.
• Administrative
   – Managing and preservation of objects
      • Access control lists, terms and conditions, format descriptions,
Which View is Correct?

 figure 1 from:
                     Approaches to Metadata

• from Ng, Park and Burnett, 1997 (also JASIS, 50(13))

   – library science: bibliographic control
           • “organizing the physical containers of information, by means
             of bibliographical description, subject analysis, and
             classification notation construction, so that the container can be
             efficiently described, identified, located and retrieved”
   – computer and information science: data management
           • “not only to store, access and utilize data effectively, but also
             to provide data security, data sharing, and data integrity”
Metadata Formats and Implementation

• Use markup languages
  – Interoperable
  – Extensible
  – Robust
• Permits advance search features

When online, the beginning of a semantic
       What is a markup language?
• Textual (i.e. person readable) language
  where significant elements are indicated by
• Examples are RTF, HTML, XML, TEX etc.
• Easy to process and can be manipulated by
  a variety of application programs
 Standard Generalized Markup Language (SGML)

• Based on GML (generalized markup language), developed
  by IBM in the 1960s
• An international standard (ISO 8879:1986) defines how
  descriptive markup should be embedded in a document
• Can define any document format of any complexity
• Enables, extensibility, structure and validation
• Too many optional features for the Web
• Gave birth to the extensible markup language (XML),
  W3C recommendation in 1998
              The Purpose of SGML

    SGML is designed to make your information last longer
    than the systems that created it. Such longevity also
    implies immunity to short-term changes -- such as a
    change from one application program to another -- so
    SGML is also inherently designed for re-purposing and
                  What is SGML?
• SGML (and it's derivatives, HTML and XML) are ASCII
  character based representations of electronic data
• Remember, it's all bits--meaning is derived from how they
  are organized…
• Think of SGML docs as strings that must be parsed--A
  web browser parses an HTML doc and uses the markup
  codes to display the data contained
• Since it's all ASCII, these docs can also be handled by non
  parsing tools (such as vi, emacs, perl, etc.)
•   SGML is the “mother tongue” – but is overkill for most
    common desktop applications.

•   XML is an abbreviated version of SGML
•   easier to define own document types
•   easier for programmers to write programs to handle
    documents (and data)
•   omits all the options (and most of more complex and less-
    used parts) of SGML)
•   HTML is just one of many SGML or XML
    “applications” – most frequently used on the Web
               SGML Components

• SGML documents have three parts:
• Declaration: specifies which characters and delimiters may
  appear in the application
• DTD (document type definition) / style sheet: defines the
  syntax of markup constructs
• Document instance: actual text (with the tag) of the
• More info could be found:
World Wide Web (W3C) Consortium
                 What is XML?
• XML – eXtensible Markup Language
• designed to improve the functionality of the Web
  by providing more flexible and adaptable
  information and identification
• “extensible” because not a fixed format like
• a language for describing other languages (a meta-
• design your own customised markup language
                       The HTML World
  <h1> XML and Information Retrieval: A SIGIR 2000 Workshop </h1>

     <p> The workshop was held on 28 July 2000. The editors of the workshop
         were David Carmel, Yoelle Maarek, and Aya Soffer </p>

     <h2> XQL and Proximal Nodes </h2>

          <p> The paper was authored by Ricardo Baeza-Yates and
              Gonzalo Navarro. The abstract of this paper is given below. </p>

          <p> We consider the recently proposed language … </p>

          <p> The paper references the following papers:
               <a href=“”> … </a>
                          The XML World
<workshop date=”28 July 2000”>
  <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title>
  <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>
       <paper id=”1”>
          <title> XQL and Proximal Nodes </title>
          <author> Ricardo Baeza-Yates </author>
          <author> Gonzalo Navarro </author>
          <abstract> We consider the recently proposed language … </abstract>
          <section name=”Introduction”>
              Searching on structured text is becoming more important with XML …
              <subsection name=“Related Work”>
                 The XQL language …
          <cite xmlns:xlink=”> … </cite>
              Why use XML?

• XML is written in SGML – the
  Standardized General Markup Language,
  an international standard (ISO 8879)
• XML = very simple dialect of SGML
• goal = enable generic SGML to be served,
  received and processed on the Web in ways
  not possible with HTML
               Why use XML?

• XML is not just for Web pages
• use to store any kind of structured
• to enclose/encapsulate information in order
  to pass it between different computing
  systems that are otherwise unable to
             Key feature of XML
• An application is free to use XML tagged data in
  many different ways, e.g.
• produce an image
• generate a formatted text listing
• display the XML document‟s markup in pretty
• restructure the data into a format for storing in a
  database, transmission over a network, input to
  another program.
      XML is important because...

•   Removes 2 constraints that held back Web
•   dependence on a single, inflexible
    document type (HTML) [much abused]
•   reduced the complexity of full SGML
    [many options but hard to program]
•   XML… allows the flexible development of
    user-defined document types.
•   provides a robust, non-proprietary,
    persistent, and verifiable file format for
    the storage and transmission of text and
    data both on and off the Web
            XML Software?

• many programs are “XML ready”
  already today.
• covers news of new
  additions to XML
     Is XML a Computer Language?

• XML is not C or C++ or like any other
  programming language
• By itself, it cannot specify calculations,
  actions, decisions to be carried out in any
• XML is a markup specification language
        XML - a Markup Language

• with XML, you can design ways of describing information
  (text or data), usually for storage, transmission or
  processing by a program
• XML conveys no information about what should be done
  with the data or text – it merely describes it.
• By itself, XML does anything – it is a data description
How do I run or execute an XML file?

• You can‟t and you don‟t !
• XML is not a programming language
• XML is a markup specification language
• XML files are just data (waiting for a
  program to do something with them)
• XML files can be viewed with an XML
  editor or XML-compatible browser
              Things to Remember
• XML does not replace HTML – it provides an
  alternative which allows you to define your own
  set of markup elements to a published standard:
   – <?xml version="1.0" standalone="yes"?>
   – <conversation>
   –  <greeting>Hello, world!</greeting>
   –  <response>Stop the planet, I want to get
   – </conversation>
          Things to Remember
• All parts of an XML document are case
• Element type names are case sensitive, so
  <BODY> …</b ody> is out.
• Attribute names are case sensitive …
•      <PIC width=“7cm”/> and
•      <PIC WIDTH=“6cm”/>
• describe different attributes, not just
  different values for the attribute “PIC
                 What is XQuery?

– XQuery is the language for querying XML data
    • The best way to explain XQuery is to say that XQuery is to XML
      what SQL is to database tables.
– XQuery uses XPath expressions to extract XML data.
    • XPath is a language for finding information in an XML document.
    • XPath is used to navigate through elements and attributes in an XML
– XQuery is defined by the W3C.
– XQuery is supported by all the major database engines (IBM,
  Oracle, Microsoft, etc.)
    • XQuery 1.0 W3C Recommendation
          Motivation for XML Search
• It is becoming increasingly popular to publish data on the
  Web in the form of XML documents.

• Current search engines, which are an indispensable tool for
  finding HTML documents, have two main drawbacks
  when it comes to searching for XML documents.
   – It is not possible to pose queries that explicitly refer to XML tags.
   – Search engines return references (i.e. links) to documents and not
     specific fragments thereof. This is problematic, since large XML
     documents may contain thousands of elements storing many pieces
     of information that are not necessarily related to each other.
                       The HTML World
  <h1> XML and Information Retrieval: A SIGIR 2000 Workshop </h1>

     <p> The workshop was held on 28 July 2000. The editors of the workshop
         were David Carmel, Yoelle Maarek, and Aya Soffer </p>

     <h2> XQL and Proximal Nodes </h2>

          <p> The paper was authored by Ricardo Baeza-Yates and
              Gonzalo Navarro. The abstract of this paper is given below. </p>

          <p> We consider the recently proposed language … </p>

          <p> The paper references the following papers:
               <a href=“”> … </a>
                          The XML World

<workshop date=”28 July 2000”>
  <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title>
  <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>
       <paper id=”1”>
          <title> XQL and Proximal Nodes </title>
          <author> Ricardo Baeza-Yates </author>
          <author> Gonzalo Navarro </author>
          <abstract> We consider the recently proposed language … </abstract>
          <section name=”Introduction”>
              Searching on structured text is becoming more important with XML …
              <subsection name=“Related Work”>
                 The XQL language …
          <cite xmlns:xlink=”> … </cite>
               Problems with XQuery

• A query language for XML, such as XQuery, can be used
  to extract data from XML documents.

• However, such a query language is not an alternative to an
  XML search engine for several reasons.
   – The syntax of XQuery is more complicated than the syntax of a
     standart search query. Hence, it is not appropriate for a naive user.
   – Extensive knowledge of the document structure is required in order
     to correctly formulate a query. Thus, queries must be formulated
     on a per document basis.
   – XQuery lacks any mechanism for ranking answers.

• Solution - XML Search engine
          XML Search Tool Design Features?

• A simple syntax that can be used by naive users
• Search results should include XML fragments and not
  necessarily full documents
• The XML fragments in an answer, should be semantically
   – For example, a paper and an author should be in an answer only
     if the paper was written by this author
• Search results should be ranked
• Search results should be returned in “reasonable” time
                      XML Search Engines
• Summary of XML engines
   – Open source ones starting to emerge

• Or just use web search engine with filetype:xml
   – Usually doesn‟t work!

• Many for commercial use and some in design
   – Active research area

• Web XML is a step in the direction of the semantic web!
                  What is Web 2.0 ?

• Term coined by Tim O‟Reilly and Media Live
  International as part of brainstorming session about the
  future of the web in 2005
• Also may be called the Live Web or Living Web
• Refers to more interactive technologies that engage,
  facilitate and empower users
• Companies utilizing interactive technologies are the hot
• Companies are just starting to embrace these technologies
  for business value
• Tim‟s Def (Video); Schmidt‟s (Video)
• The Machine (Video)
                         Web 1.0 vs 2.0 (Some Examples)

                                   Web 1.0                           Web 2.0
                       DoubleClick                        -->        Google AdSense
                               Ofoto                      -->        Flickr
                             Akamai                       -->        BitTorrent
                                         -->        Napster
                  Britannica Online                       -->        Wikipedia
                  personal websites                       -->        blogging
          domain name speculation                         -->        search engine optimization
                          page views                      -->        cost per click
                    screen scraping                       -->        web services
                          publishing                      -->        participation
       content management systems                         -->        wikis
             directories (taxonomy)                       -->        tagging ("folksonomy")
                           stickiness                     -->        syndication
Source:, “What is web 2.0: Design Patterns and Business Models for the next Generation of Software”, 9/30/2005
         Web 3.0
This will be the INTELLIGENT

      The Semantic Web!
How will we get the semantic web?

   Now... that should clear up a few things around here
• The Web and Web 2.0 were designed with humans in
  (Human Understanding)
• The Web 3.0 will anticipate our needs! Whether it is State
  Department information when traveling, foreign embassy
  contacts, airline schedules, hotel reservations, area taxis,
  or famous restaurants: the information. The new Web
  will be designed for computers.
  (Machine Understanding)
• The Web 3.0 will be designed to anticipate the meaning
  of the search.
            Web 2.0 vs Web 3.0

 Web 2.0 : On the Web, you can see your e-
  mails, photographs, and restaurant
Web 3.0: On the Web... can see your photographs arranged so that
you know what restaurants you visited on a particular
date, and based on related emails sent that day.
• The next stage for the Web will be making data
  accessible to artificial intelligence agents.
• The Web 3.0 will need new languages beyond
  HTML or XML. That is the case of RDF or
  Resource Description Framework.
• The Web 3.0 will need data delivered in
  computer-readable form (RDF).
                General idea of Semantic Web
Make current web more machine accessible and intelligent!
(currently all the intelligence is in the user)
Motivating use-cases
• Search engines
    • concepts, not keywords
    • semantic narrowing/widening of queries
• Shopbots
    • semantic interchange, not screenscraping
• E-commerce
    – Negotiation, catalogue mapping, personalisation
• Web Services
    – Need semantic characterisations to find them
• Navigation
    • by semantic proximity, not hardwired links
• .....
• Try these queries with Google:
   – Distance between Paris and Madrid Google returns: (giving you distances in
       miles and kilometers)
   – (The) Largest city of France Google returns: France –
     Largest City: Paris
   – (The) Largest city of Spain Google returns: Spain – Largest
     City: Madrid
• Now, try these with Google:
   – Distance between largest city of France and largest city of
   – Distance between“largest city of France”and “largest city of
   – And worst, Distance between“the largest city of France” and
     “the largest city of Spain” – No result returned by Google!
• So, what‟s wrong with Google?
   – Nothing. The problem is with the World Wide Web:
      • The Web contains unstructured information
   – and Google is a keyword- and phrase-based search
• Initiative to make the contents on the Web
  structured information/represented knowledge
   – the Semantic Web
   General idea of Semantic Web(2)

Do this by:
• Somehow making data and metadata available
  on the Web in machine-understandable form
• Structure the data and meta-data in ontologies

                These are non-trivial
                design decisions.
                Alternative would be:
          Semantic Search Engines
Do they exist?
• Some claim that they do
• Try these out:
  –   Hakia
  –   Exalead
  –   Kosmix
  –   Swoogle
  –   WolframAlpha
  –   Bing
Expressed using the W3C stack
What it’s like to be a machine
       on the Web
                  Required are:
• Explicit meta-data

• Shared domain descriptions

Machine-processable content
Machine-support for interoperability
     machine accessible meaning
            (What it’s like to be a machine)




                  XML 
         machine accessible meaning

                                  < name >

< education>

                              < CV >
    < work>

  < private >
                  So why not just use XML?
• No agreement on:
   – structure                      <country name=”Netherlands”>
        • is country a:               <capital name=”Amsterdam”>
             – object?                  <areacode>020</areacode>
             – class?                 </capital>
             – attribute?           </country>
             – relation?
             – something else?      <nation>
        • what does nesting mean?     <name>Netherlands</name>
   – vocabulary                       <capital>Amsterdam</capital>
        • is country the same as      <capital_areacode>
          nation?                      020
                         ● Are the above XML documents the same?
                         ● Do they convey the same information?

                         ● Is that information machine-accessible?
           “2nd aim of Semantic Web”:
                 Data integration
  –   Unstructured and sensors, programs, services semi-
      structured sources (document collections, message
      traffic, web pages, ...)
  –   Structured data without an explicit data schema
      (non-local databases, data tables, charts and reports,
  –   Non-Text collections (image, video, sound, ...)
  –   Streams of data from

Must specify the structure of data resources..
            2nd aim of Semantic Web:
                 Data integration
... so a processor can tell how the "attributes" and
     "values" are related
  –   What is required vs. optional?
  –   How many values for a particular attribute?
  –   What attributes are keys for other attributes?
  –   Which attributes are necessarily related to other
      attributes and in what way??
  –   How do the attributes (and values) in one data
      source map to attributes and values describing
      another source?
               Stack of languages
• XML:
  – Surface syntax, no semantics
• XML Schema:
  – Describes structure of XML documents
• RDF:
  – Datamodel for “relations” between “things”
• RDF Schema (RDFS):
  – RDF Vocabulary Definition Language
• OWL:
  – A more expressive
    Vocabulary Definition Language
       Semantic web languages today
• Today there are three semantic web languages
   – RDF – Resource Description Framework
   – DAML+OIL – Darpa Agent Markup Language
   – OWL – Ontology Web Language
      • OWL lit
      • OWL DL
      • OWL Full
  RDF is the first Semantic Web language
XML Encoding
<rdf:RDF ……..>
     <….>                 RDF
     <….>              Data Model
                                                         Good For
 Good for                                                 Human
 Machine                                                  Viewing
Processing                  Triples
                 stmt(docInst, rdf_type, Document)
                 stmt(personInst, rdf_type, Person)
                 stmt(inroomInst, rdf_type, InRoom)
                 stmt(personInst, holding, docInst)     RDF is a simple
                 stmt(inroomInst, person, personInst)   language for building
                         Good For                       graph based
                     The RDF Data Model
• An RDF document is an unordered collection of statements, each with a
  subject, predicate and object (aka triples)
• A triple can be thought of as a labelled arc in a graph
• Statements describe properties of web resources
• A resource is any object that can be pointed to by a URI:
   –   a document, a picture, a paragraph on the Web, …
   –   E.g.,
   –   a book in the library, a real person (?)
   –   isbn://5031-4444-3333
   –   …
• Properties themselves are also resources (URIs)
            RDF without a Schema
• Object ->Attribute-> Value triples
                             pers05                ISBN...
• objects are web-resources
• Value is again an Object:
   • triples can be linked
   • data-model = graph

           Author-of                    Publ-
pers05                       ISBN...    by          MIT

   What does RDF Schema add?

• Defines vocabulary for RDF
• Organizes this vocabulary in a
  typed hierarchy
  • Class, subClassOf, type
  • Property, subPropertyOf
  • domain, range

                       domain                     range
           Author                communicatesTo             Reader

          type                                                     type
            Frank                                            Lynda
             Which Semantic Web?
• Version 1:
  "Semantic Web as Web of Data" (TBL)

• recipe:
  expose databases on the web,
  use XML, RDF, integrate
• metadata from:
  – expressing DB schema semantics
    in machine interpretable ways
• enable integration and unexpected re-use
                    Which Semantic Web?
• Version 2:
  “Enrichment of the current Web”

• recipe:
  Annotate, classify, index
• metadata from:
   – automatically producing markup:
     named-entity recognition,
     concept extraction, tagging, etc.
• enable personalization, search, browse,..
             Which Semantic Web?
• Version 1:
  “Semantic Web as Web of Data”

• Version 2:
  “Enrichment of the current Web”

 Different use-cases
 Different techniques
 Different users
    Semantic Web research

 Four popular fallacies
about the Semantic Web
  First: clear up some popular misunderstandings

False statement No :
“Semantic Web people try to
  enforce meaning from the top”

They only “enforce” a language.
They don’t enforce what is said in that language

Compare: HTML “enforced” from the top,
But content is entirely free.
   First: clear up some popular misunderstandings
False statement No :
“The Semantic Web people will require everybody to subscribe to a
  single predefined "meaning" for the terms we use.”

Of course, meaning is fluid, contextual, etc.

Lot’s of work on (semi)-automatically
bridging between different vocabularies.
  First: clear up some popular misunderstandings

False statement No :
“The Semantic Web will require users to understand
  the complicated details of formalised knowledge

All of this is “under the hood”.
   First: clear up some popular misunderstandings

False statement No :
“The Semantic Web people will require us to manually markup all the
  existing web-pages.”

Lots of work on automatically producing
semantic markup:

named-entity recognition,
concept extraction, etc.
         Semantic Web research

The current state of Semantic Web
                  4 hard questions about the
                       Semantic Web:
Q1: "where does the meta-data come from?”
 NL technology is delivering on concept-extraction
 Socially emerging (learning from tagging).
Q2: “where do the meta-data-schema
  come from?”
 many handcrafted schema
 hierarchy learning remains hard
 relation extraction remains hard.
Q3: “what to do with many meta-data schema?”
 ontology mapping/aligning remains VERY hard.
Q4: “where‟s the „Web‟ in the Semantic Web?”
 more attention to social aspects (P2P, FOAF)
 non-textual media remains hard
 deal with typical Web requirements.
                    Advanced Search

Metadata and semantic web will make advanced search
 much easier
Growth of web metadata.
 Tools that automatically generate metadata

TREC 2008
          Search for Web 3.0

• Natural language queries
• Search agent (avatar) understands and
  anticipates your needs
• Personal life search with avatar
                              The Evolving Web
                                                                                     Web of

                                    Proof, Logic and
                                   Ontology Languages
       DATA/PROGRAMS                                             Shared terms/terminology
                                                                 Machine-Machine communication


           Resource Description Framework
            eXtensible Markup Language         Self-Describing Documents

HyperText Markup Language           DOCUMENTS
HyperText Transfer Protocol      Foundation of the Current Web

                                                         Berners-Lee, Hendler; Nature, 2001
                  Web Semantics

Semantic Web LayerCake (Berners-Lee, 99;Swartz-Hendler, 2001)
Semantic Web 2008 - ?
  (Jim Hendler - internal talk, Microsoft Labs, July 2008)
Web 4.0 :-?)
      The next 5000 days of the web
• Kevin Kelly
  – Founder of WIRED magazine
  – Video
                  Web 4.0
• Machines talk back!
             Search for Web 4.0

• We get real help
  when we search!

Terminator: the
Sarah Connor

Cameron’s on our
              What we covered

• The web of data
  – xml, rdf, others
• Web 2.0
  – The social web
• Web 3.0
  – The semantic web
• Future of the web

To top