Querying Wikipedia like a Database
Document Sample


Querying Wikipedia like a Database
and
An Interlinking-Hub in the Web of Data
Chris Bizer, Sören Auer,
Georgi Kobilarov, Jens Lehmann,
G i K bil J L h
Christian Becker, Sebastian Hellmann
Berlin,
Freie Universität Berlin Universität Leipzig
Berlin. April 4, 2009 Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
DBpedia
DBpedia is a community effort to
extract structured information from Wikipedia
make this information available on the Web under an open license
interlink the DBpedia dataset with other open datasets on the Web
Contributors
Freie Universität Berlin (Germany)
Universität Leipzig (Germany)
OpenLink Software (UK)
Linking Open Data Community
(W3C SWEO)
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Outline
1. Extracting Structured Information from Wikipedia
2.
2 The DBpedia Dataset
3. Use Cases
1. Improving Wikipedia Search
2. Royalty-Free Data Source for other Applications
3. Nucleus for the Emerging Web of Data
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Extracting Structured Information from Wikipedia
Domain
specific
Data
Title
Ti l
Images
Description
Languages Infoboxes
Web Links
Categorization Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Extracting Structured Information from Wikipedia
http://en.wikipedia.org/wiki/Calgary
<http://dbpedia.org/resource/Calgary>
dbpedia:native_name “Calgary” ;
dbpedia:elevation “1048” ;
dbpedia:population_city “988193” ;
dbpedia:population_metro “1079310” ;
db di l ti t
mayor_name
dbpedia:Dave_Bronconnier
dbpedia:Dave Bronconnier ;
governing_body
dbpedia:Calgary_City_Council ;
_ _
...
using a PHP extraction framework
GPL license
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
The DBpedia Dataset
Data about 2.6 million “things”
including at least
213,000 persons
p
328,000 places
57,000 music albums
36,000 films
20,000 companies.
Altogether 274 million pieces of information (RDF triples)
609,000 links to images
3,150,000 links to external web pages
4 878 100 data links into external RDF datasets
4,878,100
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Multi-Lingual Abstracts
The dataset contains a short and a long abstract for each
concept.
Short abstracts
English: 2 613 000
2,613,000
German: 391,000
French: 383 000
383,000
Dutch: 284,000
Polish: 256 000
256,000
Italian: 286,000
Spanish: 226 000
226,000
Japanese: 199,000
Portuguese: 246 000
246,000
Swedish: 144,000
Chinese 101 000
Chinese: 101,000
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
DBpedia Use Cases
1. Improving Wikipedia Search
2. Royalty-Free Data Source for other Applications
3. Nucleus for the Emerging Web of Data
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
1. Improving Wikipedia Search
The DBpedia SPARQL Endpoint:
http://dbpedia.org/sparql
http://dbpedia org/sparql
can answer SPARQL queries like
Give me all Sitcoms that are set in NYC?
All German musicians that were born in Berlin in the 19th century?
All tennis players from Moscow?
All soccer players with tricot number 11, playing for a club having a
t di ith 40 000
stadium with over 40,000 seats and is born in a country with over 10
t di b i t ith
million inhabitants?
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Improving Wikipedia Search
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
2. Royalty-Free Data Source for other Applications
DBpedia is published under GNU Free Documentation License
Example use case: SPARQL generated tables within webpages
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
DBpedia Mobile
Displays Wikipedia data
on a map
Smushes the data with
data from other sources
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
3. Nucleus for the Emerging Web of Data
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
The Web of Documents
The Web is a single information space
build t d d dh li k
b ild on open standards and hyperlinks.
Web Search
Browsers Engines
HTTP
HTML HTML HTML HTML
hyper h
hyper hyper
h
links links links
A B C D
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Linked Data
Use RDF and HTTP to
1. publish structured data on the Web,
2. set data links between data from one data
source to data within other data sources.
Thing Thing Thing Thing Thing
Thing Thing Thing Thing Thing
data data data data
link link link link
A B C D E
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Example Data Links
Out-Bound Link
<http://dbpedia.org/resource/Berlin> owl:sameAs
<http://sws.geonames.org/2950159>
<http://sws geonames org/2950159> .
In-Bound Links
<http://richard.cyganiak.de/foaf.rdf#cygri> foaf:topic_interest
<http://dbpedia.org/resource/Semantic_Web> .
<http://blog.bizer.de/item1143> dc:subject
<http://dbpedia.org/resource/Belaruss> .
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
W3C Linking Open Data Project
Community effort to
y
publish existing open license datasets as Linked Data on the Web
interlink things between different data sources
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
LOD Datasets on the Web: May 2007
Over 500 million RDF triples. Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
LOD Datasets on the Web: April 2008
Over 2 billion RDF triples. Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
LOD Datasets on the Web: March 2009
4.5 billion triples
180 million data links
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
LOD Datasets on the Web: March 2009
Music Online Activities
Geographic Publications
P bli ti
Cross-Domain
Life Sciences
Lif S i
4.5 billion triples
180 million data links
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
What can I do with this?
Linked Data Search Linked Data
Browsers Engines Mashups
HTTP HTTP
Thing Thing Thing Thing Thing
Thing Thing Thing Thing Thing
data data data data
link link link link
A B C D E
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Falcons
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
DERI Semantic Web Pipes
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
3. What is next for DBpedia?
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Improve the Quality of Extracted Data
Problem
chaotic usage of infoboxes within Wikipedia
Solution
smarter version of the infobox extractor
smushes multiple properties with the same meaning
smushes different infoboxes for the same class
uses knowledge about property ranges
generates a cleaner class hierarchy
Status
First release of the DBpedia “Ontology” in November 2008
Still improve the mappings and extraction code
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Better Interfaces for Common Wikipedia Users
Cooperation with Neophonie (Berlin search engine company)
Direction: free-text search + facet-browsing
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Cross-Language Data Fusion
Opportunity
there are 264 Wikipedia Editions in different languages.
there are cross-language links.
the Italian Wikipedia knows more about Italian villages then
the English one.
the German Wikipedia contains more person infoboxes than
the English one.
Idea
Augment the infobox dataset with facts from other Wikipedia editions.
Result
A much richer DBpedia dataset.
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Augment DBpedia with Data from External Sources
Opportunity
the Linking Open Data cloud provides lots of useful data
which is not contained in Wikipedia yet.
For instance:
- EuroStat provides additional statistical information about countries.
- Musicbrainz contains additional information about other bands.
- Geonames provides additional information about locations.
Idea
Augment DBpedia with additional data from external sources.
Result
A much richer DBpedia dataset.
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Live Update
Current Situation
DBpedia update cycle: 3 month
Wikipedia provides us with access to the live update stream
Opportunity
Increase the currency of the DBpedia dataset using this update stream
Result
DBpedia in synchronization with Wikipedia
Wikipedia.
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Contribute back to the Wikipedia Community
Opportunity
augmentation with data from the LOD cloud makes the DBpedia dataset
richer than Wikipedia itself.
infobox data is extracted from Wikipedia editions in various languages
languages.
Idea
Extend the Wikipedia authoring environment with
- Suggestions for infobox values
- Cross-language consistency checking for infoboxes
Initialize Wikipedia Clean-Up Cycles
Data-driven search interfaces expose the weaknesses of Wikipedia
template system.
Preferred items not showing up in end-user interfaces may motivate
Wikipedia editors to use templates more stringently.
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Lots of Opportunities for nice Mashups
(mockup)
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Thanks!
References
DBpedia
http://dbpedia.org/About
http://dbpedia org/About
W3C Linking Open Data Project
http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/
LinkingOpenData
Tutorial: How to Publish Linked Data on the Web
http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
http://www4 wiwiss fu berlin de/bizer/pub/LinkedDataTutorial/
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Related docs
Get documents about "