Querying Wikipedia like a Database

Document Sample
Querying Wikipedia like a Database Powered By Docstoc
					                        Querying Wikipedia like a Database
                                                and
                    An Interlinking-Hub in the Web of Data

                                     Chris Bizer, Sören Auer,
                                Georgi Kobilarov, Jens Lehmann,
                                G     i K bil      J    L h
                              Christian Becker, Sebastian Hellmann

                                              Berlin,
                            Freie Universität Berlin Universität Leipzig

Berlin. April 4, 2009                                       Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
DBpedia

  DBpedia is a community effort to
     extract structured information from Wikipedia
     make this information available on the Web under an open license
     interlink the DBpedia dataset with other open datasets on the Web


  Contributors
     Freie Universität Berlin (Germany)
     Universität Leipzig (Germany)
     OpenLink Software (UK)
     Linking Open Data Community
      (W3C SWEO)




                                                Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Outline



     1. Extracting Structured Information from Wikipedia
     2.
     2 The DBpedia Dataset
     3. Use Cases
          1. Improving Wikipedia Search
          2. Royalty-Free Data Source for other Applications
          3. Nucleus for the Emerging Web of Data




                                               Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Extracting Structured Information from Wikipedia

                                                           Domain 
                                                           specific
                                                           Data

  Title
  Ti l
                                                           Images
  Description



  Languages                                                Infoboxes




  Web Links

  Categorization              Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Extracting Structured Information from Wikipedia

                    http://en.wikipedia.org/wiki/Calgary

                    <http://dbpedia.org/resource/Calgary>
                      dbpedia:native_name “Calgary” ;
                      dbpedia:elevation “1048” ;
                      dbpedia:population_city                “988193” ;
                      dbpedia:population_metro “1079310” ;
                      db di       l ti     t
                      mayor_name
                           dbpedia:Dave_Bronconnier
                           dbpedia:Dave Bronconnier ;
                      governing_body
                           dbpedia:Calgary_City_Council ;
                                          _    _
                     ...

                   using a PHP extraction framework
                   GPL license


                                       Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
The DBpedia Dataset

  Data about 2.6 million “things”
       including at least
       213,000 persons
                p
        328,000 places
       57,000 music albums
        36,000 films
       20,000 companies.
  Altogether 274 million pieces of information (RDF triples)
     609,000 links to images
     3,150,000 links to external web pages
     4 878 100 data links into external RDF datasets
      4,878,100




                                                 Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Multi-Lingual Abstracts

  The dataset contains a short and a long abstract for each
   concept.
  Short abstracts
     English: 2 613 000
               2,613,000
     German: 391,000
     French: 383 000
              383,000
     Dutch: 284,000
     Polish: 256 000
              256,000
     Italian: 286,000
     Spanish: 226 000
               226,000
     Japanese: 199,000
     Portuguese: 246 000
                  246,000
     Swedish: 144,000
     Chinese 101 000
      Chinese: 101,000

                                          Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
DBpedia Use Cases




     1. Improving Wikipedia Search
     2. Royalty-Free Data Source for other Applications
     3. Nucleus for the Emerging Web of Data




                                      Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
1. Improving Wikipedia Search

  The DBpedia SPARQL Endpoint:
   http://dbpedia.org/sparql
   http://dbpedia org/sparql
  can answer SPARQL queries like
     Give me all Sitcoms that are set in NYC?
     All German musicians that were born in Berlin in the 19th century?
     All tennis players from Moscow?
     All soccer players with tricot number 11, playing for a club having a
       t di      ith     40 000
      stadium with over 40,000 seats and is born in a country with over 10
                                     t  di b      i         t    ith
      million inhabitants?




                                                 Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Improving Wikipedia Search




                             Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
2. Royalty-Free Data Source for other Applications


  DBpedia is published under GNU Free Documentation License
  Example use case: SPARQL generated tables within webpages




                                      Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
DBpedia Mobile

                       Displays Wikipedia data
                        on a map
                       Smushes the data with
                        data from other sources




                 Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
3. Nucleus for the Emerging Web of Data




                             Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
The Web of Documents

      The Web is a single information space
     build          t d d       dh      li k
     b ild on open standards and hyperlinks.

                         Web              Search
                       Browsers           Engines


                                  HTTP




        HTML            HTML               HTML                     HTML
               hyper              h
                                  hyper               hyper
                                                      h
               links              links               links




         A                B                  C                        D

                                                    Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Linked Data

    Use RDF and HTTP to
    1. publish structured data on the Web,
    2. set data links between data from one data
       source to data within other data sources.


    Thing          Thing          Thing             Thing                      Thing


    Thing          Thing          Thing             Thing                      Thing


            data           data           data                     data
            link           link           link                     link



      A             B              C                   D                         E


                                                 Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Example Data Links

  Out-Bound Link

 <http://dbpedia.org/resource/Berlin> owl:sameAs
 <http://sws.geonames.org/2950159>
 <http://sws geonames org/2950159> .



  In-Bound Links

 <http://richard.cyganiak.de/foaf.rdf#cygri> foaf:topic_interest
 <http://dbpedia.org/resource/Semantic_Web> .


 <http://blog.bizer.de/item1143> dc:subject
 <http://dbpedia.org/resource/Belaruss>   .




                                              Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
W3C Linking Open Data Project




    Community effort to
             y
       publish existing open license datasets as Linked Data on the Web
       interlink things between different data sources




                                               Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
LOD Datasets on the Web: May 2007




 Over 500 million RDF triples.   Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
LOD Datasets on the Web: April 2008




 Over 2 billion RDF triples.   Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
LOD Datasets on the Web: March 2009




4.5 billion triples
180 million data links
                            Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
LOD Datasets on the Web: March 2009



                 Music      Online Activities



               Geographic                               Publications
                                                        P bli ti



                              Cross-Domain




                            Life Sciences
                            Lif S i


4.5 billion triples
180 million data links
                                                Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
What can I do with this?




       Linked Data                     Search                     Linked Data
        Browsers                       Engines                     Mashups


                         HTTP                       HTTP



     Thing          Thing               Thing              Thing                      Thing


     Thing          Thing               Thing              Thing                      Thing


             data               data             data                     data
             link               link             link                     link



       A             B                   C                    D                         E


                                                        Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Falcons




          Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
DERI Semantic Web Pipes




                          Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
3. What is next for DBpedia?




                               Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Improve the Quality of Extracted Data

  Problem
     chaotic usage of infoboxes within Wikipedia

  Solution
     smarter version of the infobox extractor
     smushes multiple properties with the same meaning
     smushes different infoboxes for the same class
     uses knowledge about property ranges
     generates a cleaner class hierarchy

  Status
     First release of the DBpedia “Ontology” in November 2008
     Still improve the mappings and extraction code




                                                 Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Better Interfaces for Common Wikipedia Users

  Cooperation with Neophonie (Berlin search engine company)
  Direction: free-text search + facet-browsing




                                          Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Cross-Language Data Fusion

  Opportunity
     there are 264 Wikipedia Editions in different languages.
     there are cross-language links.
     the Italian Wikipedia knows more about Italian villages then
      the English one.
     the German Wikipedia contains more person infoboxes than
      the English one.

  Idea
     Augment the infobox dataset with facts from other Wikipedia editions.

  Result
     A much richer DBpedia dataset.




                                                 Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Augment DBpedia with Data from External Sources

  Opportunity
     the Linking Open Data cloud provides lots of useful data
      which is not contained in Wikipedia yet.
     For instance:
          - EuroStat provides additional statistical information about countries.
          - Musicbrainz contains additional information about other bands.
          - Geonames provides additional information about locations.

  Idea
     Augment DBpedia with additional data from external sources.

  Result
     A much richer DBpedia dataset.




                                                        Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Live Update

  Current Situation
     DBpedia update cycle: 3 month
     Wikipedia provides us with access to the live update stream

  Opportunity
     Increase the currency of the DBpedia dataset using this update stream

  Result
     DBpedia in synchronization with Wikipedia
                                      Wikipedia.




                                                Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Contribute back to the Wikipedia Community

  Opportunity
     augmentation with data from the LOD cloud makes the DBpedia dataset
      richer than Wikipedia itself.
     infobox data is extracted from Wikipedia editions in various languages
                                                                   languages.

  Idea
     Extend the Wikipedia authoring environment with
          - Suggestions for infobox values
          - Cross-language consistency checking for infoboxes


  Initialize Wikipedia Clean-Up Cycles
     Data-driven search interfaces expose the weaknesses of Wikipedia
      template system.
     Preferred items not showing up in end-user interfaces may motivate
      Wikipedia editors to use templates more stringently.


                                                     Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Lots of Opportunities for nice Mashups

                                                                      (mockup)




                                   Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)
Thanks!




 References
     DBpedia
      http://dbpedia.org/About
      http://dbpedia org/About
     W3C Linking Open Data Project
      http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/
      LinkingOpenData
     Tutorial: How to Publish Linked Data on the Web
      http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
      http://www4 wiwiss fu berlin de/bizer/pub/LinkedDataTutorial/

                                                     Chris Bizer: Querying Wikipedia Like a Database (4/4/2009)