Populating the Semantic Web by
Macro-Reading Internet Text
T.M Mitchell, J. Betteridge, A.
Carlson, E. Hruschka, R. Wang
Presented by: Will Darby
Problem
• Semantic web offers many promises:
– Standardized ontologies
– Vast machine readable data repositories
– Intelligent agents
• Internet currently contains unstructured documents
– Mark up used for visual presentation
– Little or no valuable metadata
• How to migrate WWW to Semantic Web
– Author future documents with explicit ontologies?
– Publish existing databases with semantic web services?
– Augment existing Internet data with automatically
extracted semantics?
Approach
• Initially, extract common, redundant data
• “Macro-Read” Internet
– Extract most prevalent facts from large text collection
– Natural language processing to identify simple facts
– Statistically combine evidence to select most likely information
• Ontology based analysis
– Focus on relevant subset of text corpus
– Ontology defines categories and relations to guide learning
• Semi-supervised machine learning
– Bootstrapped from seed examples corresponding to Ontology
– Separate learning systems for HTML and text
– Increase in Ontology complexity results in higher accuracies