Web-scale Semantic Social Mash-Ups with Provenance

Document Sample
Web-scale Semantic Social Mash-Ups with Provenance Powered By Docstoc
					Web-scale Semantic Social Mash-Ups with Provenance
1. Problem Statement:
         As the Web grows ever larger and increasing amounts of data are available on the Web, everyday users want free

and unrestricted access to combine data from multiple sources, often in order to discover particular information about a

particular entity, such as a person, place, or organization. For example, a person may want to know the mobile phone

number of Dan Connolly, and whether or not they have a social connection in common on LinkedIn or some other social

networking site that would help them find a job. This information is spread throughout multiple sites using keyword-based

search engines, where it is up to user to not only find the needle in the haystack, but integrate this knowledge themselves.

An alternative approach would have the user identify the entity they want information about, and then let a program find

and integrate on the Web across multiple websites. Websites that make this type of information available do so through

specialized APIs, but these have to be integrated on a service by service basis using editors such as Microsoft Popfly.

Companies like Google are releasing APIs like openSocial that claim to be interoperable Web standards, but actually are not

and work only under restricted conditions with a limited number of data-sources. The problem is that each web service, such

as LinkedIn, has its own “walled garden” of data that is incompatible and not easily “mashed up” with other sources of data.

         One solution to this problem is to do the mash-ups based on semantics and open Web standards, by giving overt

semantics to standardized data and then by doing the “mash-up” based on this semantics. Many attempts to add semantics to

data seek to do so in an open-ended manner, by giving semantics for at least sizable fragments of natural language. Given

the unreliability of these methods and the inherent ambiguity of natural language, a far less ambitious but potentially Web-

scalable and more practical methodology would be to take advantage of common data formats that already have a clear, if

informal, meaning associated with them, such as business cards, calendars, social networks, and item reviews. In this case,

the user should be able to enter the name and desired data about the entity based on common data format (such as “All

business card data for Dan Connolly”) and the mash-up will be try to retrieve and “fill in the data” for the requested format,

relying and storing data using open Web standards such as OpenID for identity and Friend-of-a-Friend for social networks.

         This approach will be based on the Semantic Web Resource Description Framework (RDF), a W3C standard for

metadata and data integration, and could “Web-scale” since each component of the “mash- up” will be given a distinct URI

to serve as the “globally unique foreign key” during the mash-up. In this manner, any data that can be mapped to a common

RDF vocabulary based around already widely-deployed data formats such as vCard can be “mashed-up” or integrated.

While much of the data on the Web lacks any semantics and attempts to add them have been viewed as too complex for

users, with the spread of microformats, an estimated 500 million web-pages have had semantics for these common data
formats explicitly added to them (http://microformats.org/). Furthermore, most of the APIs and structured HTML associated

with “Web 2.0” applications have an agreed upon semantics that can be mapped to the vocabularies such as iCal and vCard.

In this manner, instead of trying to “boot-strap” semantics first, we take advantage of the semantics already available in

large quantities on the Web, and then can if necessary supplement these techniques from information retrieval and natural

language processing such as named-entity recognition and even dependency parsing (Marshall, 2003).

         However, another question that arises is the source and quality of the data. Some data sources will be of high

quality, such as LinkedIn, but not easily exportable due to lack of APIs. Using Myspace again as an example, invalid

HTML usage makes it difficult even for scrapers to convert to RDF. Other data sources, like Wikipedia, may require some

natural language processing (or spidering of the structured data Wikipedia gives in its “boxes”), a process fraught with

error. Without any sort of pre-processing it is difficult for ordinary users to create trustable mash-ups from their data.

         Our solution is to track the provenance all the data in the mash-up, which includes not only the source of the data

but whatever processing is done to the data, with each step of processing tracked in a step-by-step fashion, like the steps of a

mathematical proof. This will allow the user themselves to be aware of where the data is going, and so “follow-their-nose”

to the source of the data in the mash-up, as well as information about the tools used to process the data, and so take into

account as much of the contextual elasticity of the semantics by annotating them explicitly as temporally-dated steps in the

proof. If the users finds an error in the integration, they should be able to correct it by simply removing the source or

component from the “mash-up.” By using a functional framework with tight ties to a formal logic via the Curry-Howard

Isomorphism(Wadler, 1989), provenance-tracking can be built into the very fabric of the mash-up itself, allowing

“provenance for free” with no additional work by mash-up creators. The provenance lets other users see if they trust the

results and correct errors, allowing ordinary users, not only experts, to create data with semantics in order to share this data

with other users. This framework lets users comment on and correct other people's mash-ups, with these changes also being

tracked via provenance information attached as “proofs” of the data, and so allowing the “wisdom of crowds” to be applied

to mash-up data in a principled way. We believe that possible future integration of such work with graphical interfaces such

as Popfly would allow users to themselves create mash-ups to extract semantics that combine their social networking and

personal data with the vast amount of data in their spreadsheets and documents using open Web standards, a much more

productive method for adding semantics to the Web than relying on experts or an API dominated by a single company.

2. Expected Outcome
         The expected outcome is a proof-of-concept of using the Semantic Web as a transport and integration format for

powering Web-scale “mash-ups” from heterogeneous sources of data, including Microsoft Office documents, using a

logical and functional framework that builds provenance into the very process of extraction and integration. We will also
demonstrate how these mash-ups can be inserted into HTML code by showing how this functional framework can directly

be embedded into the DOM tree. More concretely, the project will deliver both a theory-based deliverable, a practical

deliverable that can be used in demonstrations, and guidelines to use our work with Microsoft Office documents.

A formal semantics for N3 Logic and a stabilized N3 syntax: Currently, there is little data (approximately 1

billion instances) already capable of being used by Semantic Web, especially given its Web-scale goals. Yet this can be

explained by the confusing serialization of the abstract RDF model into XML in RDF/XML, which almost all users and

developers find unreadable and overly complex. The main informal alternative to RDF/XML is a JSON-like syntax for

RDF called “N3.” However, even this grammar has never been formally defined and so has fragmented into MIT N3,

Turtle, N-Triples, and a fragment of the SPARQL syntax. N3 Logic extends RDF by adding variables and quantification for

querying and reasoning capabilities in RDF. Despite being used by MIT in their work on Policy-aware Data-Mining and

even used within Cleveland Medical Clinic to manage patient records, N3 Logic has yet to have a precisely defined formal

semantics (Berners-Lee, 2007). Once it is given a formal semantics and normalized syntax, we will then proceed to map N3

Logic to a functional framework in a principled manner to allow it to be used to run a large number of components as

functions (such HTML tidy, or named-entity and geo-tagging web services), where each step of the process automatically

produces a step in a proof, attached to the output produced by the component. To be published as academic paper.

A Large-Scale Proof-Of-Concept of Semantic Mash-ups with Provenance: This framework will

then be demonstrated using a diverse set of heterogeneous source of data. The structured data-set will be created by

selecting named-entities and locations from the Microsoft Live Search queries, and will be supplemented with unstructured

“Web in the wild” data by spidering the click-through results of the selected query. This data will vary according over the

amount of structure already inherent in the data. Some of the data will already be structured in the form of RDF gathered

from the Open-Linked Data Project (http://linkeddata.org/), and this data will have a high amount of structure. Another

source of data will be RSS feeds and microformat-enabled data that contain structured data automatically capable of being

extracted with its semantics via a mechanism like GRDDL, which lets a vocabulary author provide their own transform to

RDF in a self-describing manner for XML and HTML documents. It will be of varying quality but have only some

structure, and so this data will have to be run through multiple components (such as HTML “tidy”) of varying reliability in

order to extract the semantics. Lastly, we will also try to merge data from both high-quality natural language data from

Wikipedia, and non-reliable natural language data, using a pipeline of natural language processing tools including part- of-

speech taggers, named-entity recognition, dependency parsers, and geo-taggers. Any data from the “Web at large” garnered

through usage of the click-through records will likely have no structure and so require more pre-processing than the other

sources data. The provenance of exactly what component has processed the data is as important as where precisely the data
came from. This large-scale demonstration off of real data will be created using N3 Logic and its functional equivalent,

showing how this framework can tackle the problems of processing data from a wide variety of heterogeneous sources

while tracking and optimizing via provenance information attached as proofs. The results will be evaluated with a user-

study comparing it to traditional search engines, and released as open source to run via Web Services on live data.

Guidelines for Integrating Microsoft Office Documents to the Mash-Ups: As the majority of the world's

digital knowledge is stored in Microsoft Office, since Microsoft Office is capable of XML-based output, it should be

possible create a transformation from XML to RDF in order to integrate this data into a semantic mash-up. A guideline

made available on the Web for all will be produced, including any ideas for possible changes in the format. Furthermore,

these transformations will be capable of being integrated into the proof-of-concept framework, so that the demonstration can

feature data from Microsoft Excel being mashed-up with data from sources such as Facebook and Wikipedia.

3. Schedule:
Jan-March: Create the formal semantics for N3 Logic needed to track provenance. If possible, show its correspondence to

a functional framework via a Curry-Howard Isomorphism. At the same time, gather the needed data-set for demonstration

by processing the Live Search data-set and spidering, based on click-through results and Wikipedia, Semantic Web,and

microformat-based data. Also, create and disseminate a Web-based survey to determine other sources to concentrate on.

April-June: Program framework for functional provenance-tracking mash-up framework based on formal semantics

created in previous step. Convert the data-set collected previously to RDF. Produce first academic publication on formal

semantics of proof-based provenance tracing on the Semantic Web.

July-September: Produce proof-of-concept demonstration of how a Semantic Web-enhanced functional framework with

rules can create “mash-ups” from data-set with evaluation. Produce second academic publication detailing the practical use

of the Semantic Web as a transport layer for “mash-ups.”

Oct-November: Set-up web services in order to do allow users to do integration dynamically of new data with our data-set.

Experiments seeing how adding new data to the data-set effects performance.

December: Investigate the integration of data from Microsoft Office to semantically-enabled data-set, and produce a series

of guidelines on how current Microsoft Office documents can be integrated into Semantic-Web “mash-ups.” Finished.

4. Use of Funds:
Half-time pay for Harry Halpin: $29,250. Note that Harry Halpin is a postgraduate researcher whose income consists

entirely of research grants. If he is funded through this proposal, he will devote at least half-time to this proposal.

10%-time pay for Henry Thompson: $10,000. This lets Henry Thompson participate substantially.

Development Server Costs: $1,820. This is the standard cost of high-performance development server needed for
developing, unit-testing and model-testing the mash-up framework.

Web Services Costs: $1,340. This is the cost of the virtual servers needed to host the web services for named-entity

recognition, dependency parsing, tokenization, and GRDDL-transformations needed for live demonstrations.

Evaluation Costs: $1,100. The cost of paying each evaluator to evaluate the results of the proof of concept.

Total Requested: $43,510

5. Use of Microsoft Technologies:
         This proposal will critically rely on Microsoft technologies. First, for the evaluation corpus we will crucially rely

on the Live Search results and click-through logs, as well as the improved access given to us to LiveSearch via the SDK, as

detailed in Section 7 on “Dissemination and Evaluation.” Second, in order to facilitate easy integration with and possibly

take advantage of Silverlight-based applications such as Popfly, we will do as much programming as possible in .NET, and

if for some reason this presents difficulties with regards our schedule, we will try to use a programming framework that is

also compatible with Silverlight. As demonstrated by Halpin in the GRDDL Primer (http://www.w3.org/TR/grddl-primer/),

it is possible to integrate Microsoft Office based document formats into our mash-ups, so our demonstration of this will rely

critically on Microsoft-supported XML formats for Office Applications.

6. Related Research:
There has been steady commercial interest in the creation of “mash-ups” as exemplified by Microsoft Popfly, and work on

Yahoo Pipes, and Google Mash-up Editor, and academic work in this area (Halpin and Thompson, 2006). This area of work

would converge with the work of the W3C XML Processing Model Working Group (http://www.w3.org/XML/Processing/)

and work by commercial companies such as NetKernel (http://www.1060research.com/netkernel/) on distributed pipeline

processing. This would further build on combining the operational semantics of functional XML-processing with logic-

based semantics by researchers such as Simeon and Patel-Schneider (2003). Furthermore, this naturally results in a proof-

based approach to provenance, as has been recently explored by the Decentralized Information Group at MIT (Kagal, 2006)

and McGuinness's Proof Markup Language (Da Silva et. al. , 2006), although compared to McGuinness's approach our

approach would be based on N3 Logic, whose formal properties are currently unknown, although they should be

discoverable and will likely be a subset of First-Order Logic with desirable properties (Berners-Lee et. al., 2007). This

contrasts with options for provenance-tracking using full first-order logic variants such as SWRL

(http://www.w3.org/Submission/SWRL/) and WSML(http://www.w3.org/Submission/WSML/) that already have a number

of undesirable (undecidable and perhaps intractable) properties. As regards data sources, a dependency-parsed version of

English Wikipedia has just been released and increasingly number of Web 2.0 sites are making their data available via

APIs (Zaragoza, 2007). This research is also a natural continuation of GRDDL (http://www.w3.org/2001/sw/grddl-wg/),
whose tutorial at WWW2007 was fully-attended, as well as recent workshops at the WWW conference focused on going

beyond traditional search paradigms such as the I3 Workshop (http://okkam.dit.unitn.it/i3/).

7. Dissemination and Evaluation:
         For evaluation, we will do to a survey (using Web-based SurveyMonkey) targeted towards both

“Web 2.0” developers and users to determine the relevancy of both the queries that we selected from LiveSearch for our

demo as well as the sources of data to integrate with semantics. A user-study with selected judges, who will be paid for their

evaluation, to assess the quality of our integration by displaying for them the initial query containing a person, reviewed

product, or location and then displaying for them the results. There will be four types of error. The first, absolute error, is

the proof-of-concept fails to identify relevant information for a query. An ambiguity error will be found when it integrates

data of two distinct referents into a single object, as it might with ambiguous names such as “Patrick Hayes.” Patrick Hayes

is both a Canadian politician and a British Computer Scientist, so mixing their personal information in a vCard would be an

ambiguity error, although steps will be taken to reduce this possibility through statistical methods. An incompleteness error

will be found when the data integration solution correctly integrates data about the entity specified by the user but the user's

information need is not satisfied. A correct answer will be when the data integration finds information the judge feels is

useful. Furthermore, we will keep track of results where the judge feels that the data integration has revealed “surprising”

facts about the entity they want information about. The judge will also be asked to discover the same information via

traditional searching, and a comparison of searching time and results to semantic-based data integration will be done.

         The results of this research, in particular the formal framework and the results of the proof-of- concept study and

evaluation of using RDF as a transport layer for Web-scale mash-ups, will be submitted to first-rate academic conferences

on logical semantics and the Web respectively. We will release to the public for free use any data from our evaluation

corpus that is already in the public domain. This will not include the queries from Microsoft LiveSearch, but will include

our processing of public domain data such as Wikipedia. We will release under an BSD-style licenses any transformations

from common data-sources to RDF (such as VCard, tagging, and social networking formats) as well as the functional

pipelinng and provenance framework. Our report on transforming Microsoft Office XML-based data formats to RDF, along

with any associated open-source transformations, will be released on the Web for public use.

8. Other Support:
At this time there is no additional financial support for this project, although we are actively searching.

9. Qualifications of Principle Investigator:

Harry Halpin is a post-graduate researcher in the School of Informatics at the University of Edinburgh, with interests in
natural language processing, information retrieval, formal semantics, and data integration. He is Chair of the W3C GRDDL

(Gleaning Resource Descriptions from Dialects of Language, a methodology for extracting RDF from XML and HTML,

including microformats) Working Group, and is co-editor of the GRDDL Primer, a primer on integrating data ranging

from social networking to Excel data using RDF. He has academic publications in natural language processing in

conferences such as the ACL and EMNLP, and his most recent work is a study of collaborative tagging published in the

WWW 2007, where he has also chaired a workshop (I3) on semantically-aware alternatives to traditional searching. He is a

member of the W3C Semantic Web Co-ordination Group, where he standardizes semantic vocabularies for well-known

formats such as vCard, and just completed a 3-D mash-up of computing history using patents, videos, and Wikipedia data.

Henry S. Thompson divides his time between the School of Informatics at the University of Edinburgh, where he is

Reader in Informatics, based in the Language Technology Group of the Human Communication Research Centre, and the

World Wide Web Consortium (W3C). He received his Ph.D. in Linguistics from the University of California at Berkeley,

and was affiliated with the Natural Language Research Group at the Xerox Palo Alto Research Center. His current

research is focused on the semantics of markup and XML pipelines. He was a member of the SGML Working Group of

the World Wide Web Consortium which designed XML, and is currently a member of the XML Core and XML Processing

Model Working Groups of the W3C. He has been elected twice to the W3C TAG (Technical Architecture Group), chaired

by Berners-Lee to the build consensus around principles of Web architecture to help coordinate cross-technology

architecture developments outside the W3C. This research project will be done under the University of Edinburgh.

10. Bibliography:
Da Silva ,P., McGuinness, D. and Fikes, R.. A Proof Markup Language for Semantic Web Services. Information

Systems, 31:4, 2006, pp. 381-395.

Halpin, H. and Thompson, H.S. One Document to Bind Them. In Proceedings of the WWW Conference, 2006.

Kagal, L., Berners-Lee,T., Connolly, D. and Weitzner, D. "Using Semantic Web Technologies for Open Policy

Management on the Web", 21st National Conference on Artificial Intelligence, 2006.

Marshall, C.C. and Shipman, F.M. Which Semantic Web?.Proceedings of ACM Hypertext 2003, Nottingham, pp. 57-66.

Patel-Schneider, P. and Simeon, J.. The Yin/Yang Web: A Unified Model for XML Syntax and RDF Semantics. IEEE

Transactions on Knowledge and Data Engineering, 15:3, 2003, pp. 797–812.

Wadler, P. Theorems for free! Proceedings of the Fourth International Conference on Functional Programming

Languages and Computer Architecture, 1989.London, UK ,pp. 347–359.

Zaragoza, H. and Rode, H. and Mika, P. and Atserias, J. and Ciaramita, M. and Attardi,G.. Ranking Very Many

Typed Entities on Wikipedia. In Proceedings of CIKM '07.

Shared By: