Web-scale Semantic Social Mash-Ups with Provenance
1. Problem Statement:
As the Web grows ever larger and increasing amounts of data are available on the Web, everyday users want free
and unrestricted access to combine data from multiple sources, often in order to discover particular information about a
particular entity, such as a person, place, or organization. For example, a person may want to know the mobile phone
number of Dan Connolly, and whether or not they have a social connection in common on LinkedIn or some other social
networking site that would help them find a job. This information is spread throughout multiple sites using keyword-based
search engines, where it is up to user to not only find the needle in the haystack, but integrate this knowledge themselves.
An alternative approach would have the user identify the entity they want information about, and then let a program find
and integrate on the Web across multiple websites. Websites that make this type of information available do so through
specialized APIs, but these have to be integrated on a service by service basis using editors such as Microsoft Popfly.
Companies like Google are releasing APIs like openSocial that claim to be interoperable Web standards, but actually are not
and work only under restricted conditions with a limited number of data-sources. The problem is that each web service, such
as LinkedIn, has its own “walled garden” of data that is incompatible and not easily “mashed up” with other sources of data.
One solution to this problem is to do the mash-ups based on semantics and open Web standards, by giving overt
semantics to standardized data and then by doing the “mash-up” based on this semantics. Many attempts to add semantics to
data seek to do so in an open-ended manner, by giving semantics for at least sizable fragments of natural language. Given
the unreliability of these methods and the inherent ambiguity of natural language, a far less ambitious but potentially Web-
scalable and more practical methodology would be to take advantage of common data formats that already have a clear, if
informal, meaning associated with them, such as business cards, calendars, social networks, and item reviews. In this case,
the user should be able to enter the name and desired data about the entity based on common data format (such as “All
business card data for Dan Connolly”) and the mash-up will be try to retrieve and “fill in the data” for the requested format,
relying and storing data using open Web standards such as OpenID for identity and Friend-of-a-Friend for social networks.
This approach will be based on the Semantic Web Resource Description Framework (RDF), a W3C standard for
metadata and data integration, and could “Web-scale” since each component of the “mash- up” will be given a distinct URI
to serve as the “globally unique foreign key” during the mash-up. In this manner, any data that can be mapped to a common
RDF vocabulary based around already widely-deployed data formats such as vCard can be “mashed-up” or integrated.
While much of the data on the Web lacks any semantics and attempts to add them have been viewed as too complex for
users, with the spread of microformats, an estimated 500 million web-pages have had semantics for these common data
formats explicitly added to them (http://microformats.org/). Furthermore, most of the APIs and structured HTML associated
with “Web 2.0” applications have an agreed upon semantics that can be mapped to the vocabularies such as iCal and vCard.
In this manner, instead of trying to “boot-strap” semantics first, we take advantage of the semantics already available in
large quantities on the Web, and then can if necessary supplement these techniques from information retrieval and natural
language processing such as named-entity recognition and even dependency parsing (Marshall, 2003).
However, another question that arises is the source and quality of the data. Some data sources will be of high
quality, such as LinkedIn, but not easily exportable due to lack of APIs. Using Myspace again as an example, invalid
HTML usage makes it difficult even for scrapers to convert to RDF. Other data sources, like Wikipedia, may require some
natural language processing (or spidering of the structured data Wikipedia gives in its “boxes”), a process fraught with
error. Without any sort of pre-processing it is difficult for ordinary users to create trustable mash-ups from their data.
Our solution is to track the provenance all the data in the mash-up, which includes not only the source of the data
but whatever processing is done to the data, with each step of processing tracked in a step-by-step fashion, like the steps of a
mathematical proof. This will allow the user themselves to be aware of where the data is going, and so “follow-their-nose”
to the source of the data in the mash-up, as well as information about the tools used to process the data, and so take into
account as much of the contextual elasticity of the semantics by annotating them explicitly as temporally-dated steps in the
proof. If the users finds an error in the integration, they should be able to correct it by simply removing the source or
component from the “mash-up.” By using a functional framework with tight ties to a formal logic via the Curry-Howard
Isomorphism(Wadler, 1989), provenance-tracking can be built into the very fabric of the mash-up itself, allowing
“provenance for free” with no additional work by mash-up creators. The provenance lets other users see if they trust the
results and correct errors, allowing ordinary users, not only experts, to create data with semantics in order to share this data
with other users. This framework lets users comment on and correct other people's mash-ups, with these changes also being
tracked via provenance information attached as “proofs” of the data, and so allowing the “wisdom of crowds” to be applied
to mash-up data in a principled way. We believe that possible future integration of such work with graphical interfaces such
as Popfly would allow users to themselves create mash-ups to extract semantics that combine their social networking and
personal data with the vast amount of data in their spreadsheets and documents using open Web standards, a much more
productive method for adding semantics to the Web than relying on experts or an API dominated by a single company.
2. Expected Outcome
The expected outcome is a proof-of-concept of using the Semantic Web as a transport and integration format for
powering Web-scale “mash-ups” from heterogeneous sources of data, including Microsoft Office documents, using a
logical and functional framework that builds provenance into the very process of extraction and integration. We will also
demonstrate how these mash-ups can be inserted into HTML code by showing how this functional framework can directly
be embedded into the DOM tree. More concretely, the project will deliver both a theory-based deliverable, a practical
deliverable that can be used in demonstrations, and guidelines to use our work with Microsoft Office documents.
A formal semantics for N3 Logic and a stabilized N3 syntax: Currently, there is little data (approximately 1
billion instances) already capable of being used by Semantic Web, especially given its Web-scale goals. Yet this can be
explained by the confusing serialization of the abstract RDF model into XML in RDF/XML, which almost all users and
developers find unreadable and overly complex. The main informal alternative to RDF/XML is a JSON-like syntax for
RDF called “N3.” However, even this grammar has never been formally defined and so has fragmented into MIT N3,
Turtle, N-Triples, and a fragment of the SPARQL syntax. N3 Logic extends RDF by adding variables and quantification for
querying and reasoning capabilities in RDF. Despite being used by MIT in their work on Policy-aware Data-Mining and
even used within Cleveland Medical Clinic to manage patient records, N3 Logic has yet to have a precisely defined formal
semantics (Berners-Lee, 2007). Once it is given a formal semantics and normalized syntax, we will then proceed to map N3
Logic to a functional framework in a principled manner to allow it to be used to run a large number of components as
functions (such HTML tidy, or named-entity and geo-tagging web services), where each step of the process automatically
produces a step in a proof, attached to the output produced by the component. To be published as academic paper.
A Large-Scale Proof-Of-Concept of Semantic Mash-ups with Provenance: This framework will
then be demonstrated using a diverse set of heterogeneous source of data. The structured data-set will be created by
selecting named-entities and locations from the Microsoft Live Search queries, and will be supplemented with unstructured
“Web in the wild” data by spidering the click-through results of the selected query. This data will vary according over the
amount of structure already inherent in the data. Some of the data will already be structured in the form of RDF gathered
from the Open-Linked Data Project (http://linkeddata.org/), and this data will have a high amount of structure. Another
source of data will be RSS feeds and microformat-enabled data that contain structured data automatically capable of being
extracted with its semantics via a mechanism like GRDDL, which lets a vocabulary author provide their own transform to
RDF in a self-describing manner for XML and HTML documents. It will be of varying quality but have only some
structure, and so this data will have to be run through multiple components (such as HTML “tidy”) of varying reliability in
order to extract the semantics. Lastly, we will also try to merge data from both high-quality natural language data from
Wikipedia, and non-reliable natural language data, using a pipeline of natural language processing tools including part- of-
speech taggers, named-entity recognition, dependency parsers, and geo-taggers. Any data from the “Web at large” garnered
through usage of the click-through records will likely have no structure and so require more pre-processing than the other
sources data. The provenance of exactly what component has processed the data is as important as where precisely the data
came from. This large-scale demonstration off of real data will be created using N3 Logic and its functional equivalent,
showing how this framework can tackle the problems of processing data from a wide variety of heterogeneous sources
while tracking and optimizing via provenance information attached as proofs. The results will be evaluated with a user-
study comparing it to traditional search engines, and released as open source to run via Web Services on live data.
Guidelines for Integrating Microsoft Office Documents to the Mash-Ups: As the majority of the world's
digital knowledge is stored in Microsoft Office, since Microsoft Office is capable of XML-based output, it should be
possible create a transformation from XML to RDF in order to integrate this data into a semantic mash-up. A guideline
made available on the Web for all will be produced, including any ideas for possible changes in the format. Furthermore,
these transformations will be capable of being integrated into the proof-of-concept framework, so that the demonstration can
feature data from Microsoft Excel being mashed-up with data from sources such as Facebook and Wikipedia.
Jan-March: Create the formal semantics for N3 Logic needed to track provenance. If possible, show its correspondence to
a functional framework via a Curry-Howard Isomorphism. At the same time, gather the needed data-set for demonstration
by processing the Live Search data-set and spidering, based on click-through results and Wikipedia, Semantic Web,and
microformat-based data. Also, create and disseminate a Web-based survey to determine other sources to concentrate on.
April-June: Program framework for functional provenance-tracking mash-up framework based on formal semantics
created in previous step. Convert the data-set collected previously to RDF. Produce first academic publication on formal
semantics of proof-based provenance tracing on the Semantic Web.
July-September: Produce proof-of-concept demonstration of how a Semantic Web-enhanced functional framework with
rules can create “mash-ups” from data-set with evaluation. Produce second academic publication detailing the practical use
of the Semantic Web as a transport layer for “mash-ups.”
Oct-November: Set-up web services in order to do allow users to do integration dynamically of new data with our data-set.
Experiments seeing how adding new data to the data-set effects performance.
December: Investigate the integration of data from Microsoft Office to semantically-enabled data-set, and produce a series
of guidelines on how current Microsoft Office documents can be integrated into Semantic-Web “mash-ups.” Finished.
4. Use of Funds:
Half-time pay for Harry Halpin: $29,250. Note that Harry Halpin is a postgraduate researcher whose income consists
entirely of research grants. If he is funded through this proposal, he will devote at least half-time to this proposal.
10%-time pay for Henry Thompson: $10,000. This lets Henry Thompson participate substantially.
Development Server Costs: $1,820. This is the standard cost of high-performance development server needed for
developing, unit-testing and model-testing the mash-up framework.
Web Services Costs: $1,340. This is the cost of the virtual servers needed to host the web services for named-entity
recognition, dependency parsing, tokenization, and GRDDL-transformations needed for live demonstrations.
Evaluation Costs: $1,100. The cost of paying each evaluator to evaluate the results of the proof of concept.
Total Requested: $43,510
5. Use of Microsoft Technologies:
This proposal will critically rely on Microsoft technologies. First, for the evaluation corpus we will crucially rely
on the Live Search results and click-through logs, as well as the improved access given to us to LiveSearch via the SDK, as
detailed in Section 7 on “Dissemination and Evaluation.” Second, in order to facilitate easy integration with and possibly
take advantage of Silverlight-based applications such as Popfly, we will do as much programming as possible in .NET, and
if for some reason this presents difficulties with regards our schedule, we will try to use a programming framework that is
also compatible with Silverlight. As demonstrated by Halpin in the GRDDL Primer (http://www.w3.org/TR/grddl-primer/),
it is possible to integrate Microsoft Office based document formats into our mash-ups, so our demonstration of this will rely
critically on Microsoft-supported XML formats for Office Applications.
6. Related Research:
There has been steady commercial interest in the creation of “mash-ups” as exemplified by Microsoft Popfly, and work on
Yahoo Pipes, and Google Mash-up Editor, and academic work in this area (Halpin and Thompson, 2006). This area of work
would converge with the work of the W3C XML Processing Model Working Group (http://www.w3.org/XML/Processing/)
and work by commercial companies such as NetKernel (http://www.1060research.com/netkernel/) on distributed pipeline
processing. This would further build on combining the operational semantics of functional XML-processing with logic-
based semantics by researchers such as Simeon and Patel-Schneider (2003). Furthermore, this naturally results in a proof-
based approach to provenance, as has been recently explored by the Decentralized Information Group at MIT (Kagal, 2006)
and McGuinness's Proof Markup Language (Da Silva et. al. , 2006), although compared to McGuinness's approach our
approach would be based on N3 Logic, whose formal properties are currently unknown, although they should be
discoverable and will likely be a subset of First-Order Logic with desirable properties (Berners-Lee et. al., 2007). This
contrasts with options for provenance-tracking using full first-order logic variants such as SWRL
(http://www.w3.org/Submission/SWRL/) and WSML(http://www.w3.org/Submission/WSML/) that already have a number
of undesirable (undecidable and perhaps intractable) properties. As regards data sources, a dependency-parsed version of
English Wikipedia has just been released and increasingly number of Web 2.0 sites are making their data available via
APIs (Zaragoza, 2007). This research is also a natural continuation of GRDDL (http://www.w3.org/2001/sw/grddl-wg/),
whose tutorial at WWW2007 was fully-attended, as well as recent workshops at the WWW conference focused on going
beyond traditional search paradigms such as the I3 Workshop (http://okkam.dit.unitn.it/i3/).
7. Dissemination and Evaluation:
For evaluation, we will do to a survey (using Web-based SurveyMonkey) targeted towards both
“Web 2.0” developers and users to determine the relevancy of both the queries that we selected from LiveSearch for our
demo as well as the sources of data to integrate with semantics. A user-study with selected judges, who will be paid for their
evaluation, to assess the quality of our integration by displaying for them the initial query containing a person, reviewed
product, or location and then displaying for them the results. There will be four types of error. The first, absolute error, is
the proof-of-concept fails to identify relevant information for a query. An ambiguity error will be found when it integrates
data of two distinct referents into a single object, as it might with ambiguous names such as “Patrick Hayes.” Patrick Hayes
is both a Canadian politician and a British Computer Scientist, so mixing their personal information in a vCard would be an
ambiguity error, although steps will be taken to reduce this possibility through statistical methods. An incompleteness error
will be found when the data integration solution correctly integrates data about the entity specified by the user but the user's
information need is not satisfied. A correct answer will be when the data integration finds information the judge feels is
useful. Furthermore, we will keep track of results where the judge feels that the data integration has revealed “surprising”
facts about the entity they want information about. The judge will also be asked to discover the same information via
traditional searching, and a comparison of searching time and results to semantic-based data integration will be done.
The results of this research, in particular the formal framework and the results of the proof-of- concept study and
evaluation of using RDF as a transport layer for Web-scale mash-ups, will be submitted to first-rate academic conferences
on logical semantics and the Web respectively. We will release to the public for free use any data from our evaluation
corpus that is already in the public domain. This will not include the queries from Microsoft LiveSearch, but will include
our processing of public domain data such as Wikipedia. We will release under an BSD-style licenses any transformations
from common data-sources to RDF (such as VCard, tagging, and social networking formats) as well as the functional
pipelinng and provenance framework. Our report on transforming Microsoft Office XML-based data formats to RDF, along
with any associated open-source transformations, will be released on the Web for public use.
8. Other Support:
At this time there is no additional financial support for this project, although we are actively searching.
9. Qualifications of Principle Investigator:
Harry Halpin is a post-graduate researcher in the School of Informatics at the University of Edinburgh, with interests in
natural language processing, information retrieval, formal semantics, and data integration. He is Chair of the W3C GRDDL
(Gleaning Resource Descriptions from Dialects of Language, a methodology for extracting RDF from XML and HTML,
including microformats) Working Group, and is co-editor of the GRDDL Primer, a primer on integrating data ranging
from social networking to Excel data using RDF. He has academic publications in natural language processing in
conferences such as the ACL and EMNLP, and his most recent work is a study of collaborative tagging published in the
WWW 2007, where he has also chaired a workshop (I3) on semantically-aware alternatives to traditional searching. He is a
member of the W3C Semantic Web Co-ordination Group, where he standardizes semantic vocabularies for well-known
formats such as vCard, and just completed a 3-D mash-up of computing history using patents, videos, and Wikipedia data.
Henry S. Thompson divides his time between the School of Informatics at the University of Edinburgh, where he is
Reader in Informatics, based in the Language Technology Group of the Human Communication Research Centre, and the
World Wide Web Consortium (W3C). He received his Ph.D. in Linguistics from the University of California at Berkeley,
and was affiliated with the Natural Language Research Group at the Xerox Palo Alto Research Center. His current
research is focused on the semantics of markup and XML pipelines. He was a member of the SGML Working Group of
the World Wide Web Consortium which designed XML, and is currently a member of the XML Core and XML Processing
Model Working Groups of the W3C. He has been elected twice to the W3C TAG (Technical Architecture Group), chaired
by Berners-Lee to the build consensus around principles of Web architecture to help coordinate cross-technology
architecture developments outside the W3C. This research project will be done under the University of Edinburgh.
Da Silva ,P., McGuinness, D. and Fikes, R.. A Proof Markup Language for Semantic Web Services. Information
Systems, 31:4, 2006, pp. 381-395.
Halpin, H. and Thompson, H.S. One Document to Bind Them. In Proceedings of the WWW Conference, 2006.
Kagal, L., Berners-Lee,T., Connolly, D. and Weitzner, D. "Using Semantic Web Technologies for Open Policy
Management on the Web", 21st National Conference on Artificial Intelligence, 2006.
Marshall, C.C. and Shipman, F.M. Which Semantic Web?.Proceedings of ACM Hypertext 2003, Nottingham, pp. 57-66.
Patel-Schneider, P. and Simeon, J.. The Yin/Yang Web: A Unified Model for XML Syntax and RDF Semantics. IEEE
Transactions on Knowledge and Data Engineering, 15:3, 2003, pp. 797–812.
Wadler, P. Theorems for free! Proceedings of the Fourth International Conference on Functional Programming
Languages and Computer Architecture, 1989.London, UK ,pp. 347–359.
Zaragoza, H. and Rode, H. and Mika, P. and Atserias, J. and Ciaramita, M. and Attardi,G.. Ranking Very Many
Typed Entities on Wikipedia. In Proceedings of CIKM '07.