Learning Center
Plans & pricing Sign in
Sign Out



									  Creating and
Exploiting a Web
of Semantic Data

•Semantic Web 101
•Recent Semantic Web trends
•Examples: DBpedia, Wikitology
           The Age of Big Data
•Massive amounts of data is available today
•Advances in many fields driven by availability of
 unstructured data, e.g., text, audio, images
•Increasingly, large amounts of structured and
 semi-structured data is also online
•Much of this available in the Semantic Web
 language RDF, fostering integration and
•Such structured data is especially important for
 the sciences
         Twenty years ago…
Tim Berners-Lee’s 1989 WWW
proposal described a web of rela-
tionships among named objects
unifying many information
management tasks
Capsule history
• Guha’s MCF (~94)
• XML+MCF=>RDF (~96)
• RDF+OO=>RDFS (~99)
• W3C’s SW activity (01)
• W3C’s OWL (03)
• SPARQL, RDFa (08)
• Rules (09)
            Ten years ago ….
•The W3C started
 developing standards
 for the Semantic Web
•The vision, technology
 and use cases are still
•Moving from a web of
 documents to a web of

4.5 billion integrated facts
 published on the Web as
  RDF Linked Open Data

   Large collections of
integrated facts published
   on the Web for many
 disciplines and domains
   W3C’s Semantic Web Goal

“The Semantic Web is an extension of
the current web in which information
is given well-defined meaning, better
enabling computers and people to
work in cooperation.”
-- Berners-Lee, Hendler and Lassila, The
Semantic Web, Scientific American, 2001
Contrast with a non-Web approach

The W3C Semantic Web approach is
•Standards based
How can we share data on the Web?

•POX, Plain Old XML, is one approach, but it has
•The Semantic Web languages RDF and OWL
 offer a simpler and more abstract data model
 (a graph) that is better for integration
•Its well defined semantics supports knowledge
 modeling and inference
•Supported by a stable, funded standards
 organization, the World Wide Web Consortium
                Simple RDF Example        dc:Title “Intelligent Information Systems
   ~finin/talks/idm02/               on the Web and in the Aether”
                                               Note: “blank node”

                                “Tim Finin”     “”
           The RDF Data Model
•An RDF document is an unordered collection of
 statements, each with a subject, predicate and
•Such triples can be thought of as a labelled arc
 in a graph
•Statements describe properties of resources
•A resource is any object that can be referenced
 or denoted by a URI
•Properties themselves are also resources (URIs)
•Dereferencing a URI produces useful additional
 information, e.g., a definition or additional facts
      RDF is the first SW language
XML Encoding
 <rdf:RDF ……..>         Data Model
                                                            Good for
   Good for
   Machine                 Triples
                  stmt(docInst, rdf_type, Document)
  processing      stmt(personInst, rdf_type, Person)
                  stmt(inroomInst, rdf_type, InRoom)
                  stmt(personInst, holding, docInst)
                  stmt(inroomInst, person, personInst)
                                                         RDF is a simple
                    Good for storage                     language for graph
                     and reasoning                       based
            XML encoding for RDF
<rdf:RDF xmlns:rdf=""
<description about="">
 <dc:title>Intelligent Information … and in the Aether</dc:Title>
    <bib:Name>Tim Finin</bib:Name>
    <bib:Aff resource="" />
                                                                              dc:Title    “Intelligent Information Systems
                                                                                          on the Web and in the Aether”



 </dc:Creator>                                       bib:name

</description>                                                                      “Tim Finin”

         N3 is a friendlier encoding

@prefix rdf: .
@prefix dc: .
@prefix bib: .
  dc:title "Intelligent ... and in the Aether" ;
   [ bib:Name "Tim Finin";                
                                                                               dc:Title    “Intelligent Information Systems
                                                                                           on the Web and in the Aether”


     bib:Email ""

     bib:Aff: "" ] .                  bib:name

                                                                                     “Tim Finin”

    RDFS supports simple inferences
• RDF Schema adds vocabulary for classes, properties & constraints
• An RDF ontology plus some RDF statements may imply additional
  RDF statements (not possible in XML)
• Note that this is part of the data model and not of the accessing or
  processing code.
@prefix rdfs: <http://www.....>.     person a class.
@prefix : <genesis.n3>.              woman subClass person.
parent a rdf: property;              mother a property.
       rdfs:domain person;           eve a person;
       rdfs:range person.                a woman;
   mother rdfs:subProperty parent;       parent cain.
            rdfs:domain woman;       cain a person.
        rdfs:range person.
   eve mother cain.
     OWL adds further richness
OWL adds richer representational vocabulary, e.g.
 – parentOf is the inverse of childOf
 – Every person has exactly one mother
 – Every person is a man or a woman but not both
 – A man is the equivalent of a person with a sex
   property with value “male”
OWL is based on ‘description logic’ – a logic subset
 with efficient reasoners that are complete
 – Good algorithms for reasoning about descriptions
    That was then, this is now

• 1996-2000: focus on RDF and data
• 2000-2007: focus on OWL,
  developing ontologies, sophisticated
• 2008-…: Integrating and exploiting
  large RDF data collections backed by
  lightweight ontologies
       A Linked Data story
•Wikipedia as a source of knowledge
 –Wikis are a great ways to collaborate
  on building up knowledge resources
•Wikipedia as an ontology
 –Every Wikipedia page is a concept or object
•Wikipedia as RDF data
 –Map this ontology into RDF
•DBpedia as the lynchpin for Linked Data
 –Exploit its breadth of coverage to integrate things
Populating Freebase KB
Underlying Powerset’s KB
Mined by TrueKnowledge
Wikipedia as an ontology
 • Using Wikipedia as an ontology
  –each article (~3M) is an ontology concept or instance
  –terms linked via category system (~200k), infobox template
   use, inter-article links, infobox links
  –Article history contains metadata for trust, provenance, etc.
 • It’s a consensus ontology with broad coverage
 • Created and maintained by a diverse community for
 • Multilingual
 • Very current
 • Overall content quality is high
Wikipedia as an ontology
•Uncategorized and miscategorized
•Many ‘administrative’ categories:
 articles needing revision; useless ones:
 1949 births
•Multiple infobox templates for the same
•Multiple infobox attribute names for
 same property
•No datatypes or domains for infobox
     Dbpedia : Wikipedia in RDF

•A community effort to extract
 structured information from
 Wikipedia and publish as RDF
 on the Web
•Effort started in 2006 with EU funding
•Data and software open sourced
•DBpedia doesn’t extract information from
 Wikipedia’s text, but from the its structured
 information, e.g., links, categories, infoboxes
DBpedia: Linked Data lynchpin
  Dbpedia uses WP structured data

DBpedia extracts structured data from
Wikipedia, especially from Infoboxes
          Dbpedia ontology
• Dbpedia 3.2 (Nov 2008) added a manually
  constructed ontology with         Place 248,000
                                           Person 214,000
 –170 classes in a subsumption hierarchy   Work     193,000
 –880K instances                           Species 90,000
                                           Org.      76,000
 – 940 properties with domain and range    Building 23,000

• A partial, manual mapping was constructed
  from infobox attributes to these term
• Current domain and range constraints are
• Namespace:
56 properties
50 properties
110 properties
 PREFIX dbp: <>
 PREFIX dbpo: <>
 SELECT distinct ?Property ?Place
 WHERE {dbp:Barack_Obama ?Property ?Place .
        ?Place rdf:type dbpo:Place .}
DBpedia: Linked Data lynchpin
Consider Baltimore, MD
   Looking at the RDF description

We find assertions equating DBpedia's object for
Baltimore with those in other LOD datasets:
 owl:sameAs census:us/md/counties/baltimore/baltimore;
 owl:sameAs cyc:concept/Mx4rvVin-5wpEbGdrcN5Y29ycA;
 owl:sameAs freebase:guid.9202a8c04000641f800000000004921a;
 owl:sameAs geonames:4347778/ .

Since owl:sameAs is defined as an equivalence
relation, the mapping works both ways
Linked Data Cloud, March 2009
   Four principles for linked data
• Use URIs to identify things that you expose to
 the Web as resources
• Use HTTP URIs so that people can locate and
 look up (dereference) these things.
• When someone looks up a URI, provide useful
• Include links to other, related URIs in the
 exposed data as a means of improving
 information discovery on the Web
                            -- Tim Berners-Lee, 2006
       4.5 billion triples for free

•The full public LOD dataset has about 4.5
 billion triples as of March 2009
•Linking assertions are spotty, but probably
 include order 10M equivalences
 –download the data in RDF
 –Query it via a public SPARQL servers
 – load it as an Amazon EC2 public dataset
 –Launch it and required software as an Amazon
  public AMI image
We’ve been exploring a different approach to
derive an ontology from Wikipedia through a
series of use cases:
– Identifying user context in a collaboration system from
  documents viewed (2006)
– Improve IR accuracy by adding Wikitology tags to
  documents (2007)
– ACE: cross document co-reference resolution for named
  entities in text (2008)
– TAC KBP: Knowledge Base population from text (2009)
– Improve Web search engine by tagging documents and
  queries (2009)
            Wikitology 2.0 (2008)
      RDF                              RDF

                  graphs    text
Freebase KB

                                   Yago      WordNet
Databases      Human input & editing
           Wikitology tagging
•Using Serif’s output, we produced an entity
 document for each entity.
  Included the entity’s name, nominal and pronominal
  mentions, APF type and subtype, and words in a
  window around the mentions
•We tagged entity documents using Wiki-
 tology producing vectors of (1) terms and (2)
 categories for the entity
•We used the vectors to compute features
 measuring entity pair similarity/dissimilarity
              Wikitology Entity Document & Tags
Wikitology entity document                                                         Wikitology article tag vector
<DOCNO>ABC19980430.1830.0091.LDC2000T44-E2 <DOCNO>
                                                                                   Webster_Hubbell 1.000
Webb Hubbell                    Name                                               Hubbell_Trading_Post National Historic Site 0.379
                                Type & subtype                                     United_States_v._Hubbell 0.377
Individual                                                                         Hubbell_Center 0.226
NAM: "Hubbell” "Hubbells” "Webb Hubbell” "Webb_Hubbell"
                                                                                   Whitewater_controversy 0.222
PRO: "he” "him” "his"
                                                              Mention heads
abc's accountant after again ago all alleges alone also and arranged
attorney avoid been before being betray but came can cat charges cheating
circle clearly close concluded conspiracy cooperate counsel counsel's
                                                                                   Wikitology category tag vector
department did disgrace do dog dollars earned eightynine enough evasion
feel financial firm first four friend friends going got grand happening has he
                                                                                   Clinton_administration_controversies 0.204
help him hi s hope house hubbell hubbells hundred hush income increase
independent indict indicted indictment inner investigating jackie                  American_political_scandals 0.204
jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie   Living_people 0.201
little make many mickey mid money mr my nineteen nineties ninetyfour
not nothing now office other others paying peter_jennings president's              1949_births 0.167
pressure pressured probe prosecutors questions reported reveal rock                People_from_Arkansas 0.167
saddened said schemed seen seven since starr statement such tax taxes tell
them they thousand time today ultimately vernon washington webb
                                                                                   Arkansas_politicians 0.167
webb_hubbell were what's whether which white whitewater why wife                   American_tax_evaders 0.167
years                                                                              Arkansas_lawyers 0.167
                                                  Words surrounding
                 Top Ten Features (by F1)
Prec.   Recall    F1       Feature Description
90.8%    76.6%     83.1%   some NAM mention has an exact match

92.9%    71.6%     80.9%   Dice score of NAM strings (based on the intersection of NAM
                           strings, not words or n-grams of NAM strings)

95.1%    65.0%     77.2%   the/a longest NAM mention is an exact match

86.9%    66.2%     75.1%   Similarity based on cosine similarity of Wikitology Article
                           Medium article tag vector

86.1%    65.4%     74.3%   Similarity based on cosine similarity of Wikitology Article
                           Long article tag vector

64.8%    82.9%     72.8%   Dice score of character bigrams from the 'longest' NAM

95.9%    56.2%     70.9%   all NAM mentions have an exact match in the other pair

85.3%    52.5%     65.0%   Similarity based on a match of entities' top Wikitology article

85.3%    52.3%     64.8%   Similarity based on a match of entities' top Wikitology article

85.7%    32.9%     47.5%   Pair has a known alias
    Knowledge Base Population
•The 2009 NIST Text Analysis Conference (TAC)
 will include a new Knowledge Base Population
•Goal: discover information about named
 entities (people, organizations, places) and
 incorporate it into a KB
•TAC KBP has two related tasks:
 –Entity linking: doc. entity mention -> KB entity
 –Slot filling: given a document entity mention, find
  missing slot values in large corpus
  KBs and IE are Symbiotic

    KB info helps interpret text

                           from Text

      IE helps populate KBs
Wikitology 3.0
                         IR         Articles
 Application         collection
 Algorithms                                         Category
                     Wikitology                 GraphGraph
                                RDF                 Page Link
                              reasoner                Graph

 Application                                          Linked
 Specific        Relational     Triple                Semantic
 Algorithms      Database       Store                 Web data &
     Wikipedia’s social network

•Wikipedia has an implicit ‘social network’ that
 can help disambiguate PER mentions
•Resolving PER mentions in a short document to
 KB people who are linked in the KB is good
•The same can be done for the network of ORG
 and GPE entities
                WSN Data

•We extracted 213K people from the DBpedia’s
 Infobox dataset, ~30K of which participate in
 an infobox link to another person
•We extracted 875K people from Freebase,
 616K of were linked to Wikipedia pages, 431K
 of which are in one of 4.8M person-person
 article links
•Consider a document that mentions two
 people: George Bush and Mr. Quayle
   Which Bush & which Quayle?

Six George Bushes   Nine Male Quayles
          A simple closeness metric

Let Si = {two hop neighbors of Si}
Cij = |intersection(Si,Sj)| / |union(Si,Sj) |

Cij>0 for six of the 56 possible pairs
0.43 George_H._W._Bush -- Dan_Quayle
0.24 George_W._Bush -- Dan_Quayle
0.18 George_Bush_(biblical_scholar) -- Dan_Quayle
0.02 George_Bush_(biblical_scholar) -- James_C._Quayle
0.02 George_H._W._Bush -- Anthony_Quayle
0.01 George_H._W._Bush -- James_C._Quayle
      Application to TAC KBP

•Using entity network data extracted from
 Dbpedia and Wikipedia provides evidence to
 support KBP tasks:
  –Mapping document mentions into infobox
  –Mapping potential slot fillers into infobox
  –Evaluating the coherence of entities as
   potential slot fillers
•The Semantic Web approach is a powerful
 approach for data interoperability and integration
•The research focus is shifting to a “Web of Data”
•Many research issue remain: uncertainty,
 provenance, trust, parallel graph algorithms,
 reasoning over billions of triples, user-friendly
 tools, etc.
•Just as the Web enhances human intelligence, the
 Semantic Web will enhance machine intelligence
•The ideas and technology are still evolving

To top