Efficient Processing of Semantic
Information on the Web
Processing of Semantic Information on the Web
• The amount of available information on Web still is increasing rapidly.
• (Semi-)Automatic Data Extraction .
• Resource Description Framework (RDF) .
• SPARQL is the standard query language for RDF.
• Efficiency and Scalability of query processing.
Efficiency and Scalability: A Variety of Approaches
• Single machine RDF stores
• Parallel Database Approach: Vertica and others
• Approaches based on Hadoop (MapReduce Paradigm)
– Integration of databases: HadoopDB
– Language translation
• Mapping SPARQL to Hadoop/HBase directly
• Mapping SPARQL to Pig Latin
• Non Hadoop clusters
Cluster-based Parallelism vs Parallel Database/Single
Each technology has its own advantages and problems.
Parallel Database / Single
Machine RDF-Store + -
Cluster-based Parallelism - +
Loading in the context of Web research: Extract Transform Load schema.
SPARQL provides a declarative way for specifying the transformation and querying.
ETL and Querying in the context of Web research
Web documents Initial RDF graph RDF store
PigSPARQL: Mapping SPARQL to PigLatin; to appear Semantic Web Information Management – SWIM 2011