Distributed RDF Triple Store using DHT based P2P Network

Document Sample
Distributed RDF Triple Store using DHT based P2P Network Powered By Docstoc
					MapReduce – An overview

        Medha Atre
            (May 7, 2008)
      Dept of Computer Science
    Rensselaer Polytechnic Institute
   Motivation
   MapReduce by Google
   Tools
       Hadoop
       Hbase
   MapReduce for SPARQL
       HRdfStore
   References
   MapReduce inspired by the map and reduce
    primitives present in Lisp and other functional
    programming languages.
   Managing large amounts of data on the clusters of
   Processing this data in distributed fashion without
    aggregating it at single point.
   Minimal user expertise required to carry out the tasks
    in parallel on cluster of machines.
     MapReduce by Google
        Input: a set of key-value pairs
        Output: a set of key-value pairs (not
         necessarily same as input)!

map(String key, String value):       reduce(String key, Iterator values):
   // key: document name                 // key: a word
   // value: document contents           // values: a list of counts
   for each word w in value:             int result = 0;
      EmitIntermediate(w, "1");          for each v in values:
                                            result += ParseInt(v);

                     Example of Word-Count
MapReduce by Google                (contd..)

   Architecture has a master server
       Map task workers
       Reduce task workerss
   Map task split into M splits and distributed to
    Map workers
   Reduce invocations distributed to R nodes by
    partitioning the intermediate key space.
       E.g. Hash(key) mod R
MapReduce by Google                   (contd..)

   Uses Google File System (GFS)
   Provides fault-tolerance
   Preserves locality by scheduling jobs on
    machines on same cluster and having
    replicated input data.
   Trade-off in selection of M and R values.
       Master makes O(M + R) scheduling decisions, and
        keeps O(M*R) states in memory
       Typically M = 200,000 R = 5,000 with 2000
        worker machines.
   Hadoop         (http://hadoop.apache.org/core/) (This talk is not going
    to detail Hadoop APIs)
        Uses Hadoop Distributed File System (HDFS) – specifically
         meant for large distributed data intensive applications
         running on commodity hardware.
        Inspired by Google File System (GFS)
        For MapReduce operations
             Master JobTracker and one slave TaskTracker per cluster-node
             Applications specify input/output locations
             Supply map and reduce functions implementing appropriate interfaces
              and abstract classes.
        Implemented in Java but applications can be written in other
             Using Hadoop Streaming
Tools           (contd..)

   Hbase      (http://wiki.apache.org/hadoop/Hbase )
       Inspired by Google’s Bigtable architecture for distributed
        data storage using sparse tables
       It’s like a multidimensional sorted map, indexed by a row
        key, column key, and a timestamp.
            A column name has the form <family>:<label>
            Single table enforces set of column families.
            Column families stored physically close on the disk to improve
             locality while searching.
   Hbase storage

row Key       timestamp    contents   anchor                                 mime
                                      anchor:cnnsi.com   anchor:my.look.ca

com.cnn.www   t9                      CNN

              t8                                         CNN.com

              t6           <html..>                                          Text/html

              t5           <html..>

              t3           <html..>

                          Hbase table view
   Hbase storage

Row key       Timestamp      Contents

com.cnn.www   t6             <html..>
                                               Row key        Timestamp   Mime
              t5             <html..>
                                               com.cnn.www    t6          text/html
              t3             <html..>

          Row key       Timestamp       Anchor

                                        cnnsi.com   my.look.ca

          com.cnn.www   t9              CNN

                        t8                          CNN.com
    Hbase architecture
   Table is a list of data tuples sorted by the row key.
   Physically broken into HRegions -> tablename, start and
    end key.
   HRegion served by HRegionServer.
   HStore for each column group.
       HStoreFiles B-Tree like structure
   HMaster to control HRegionServers
       META table to store meta info about HRegions and
        HRegionServer locations.
  Hbase architecture


  HRegionServer1                         HRegionServer2

HRegion1   HRegion2                  HRegion3    HRegion4


                      HRegion5   HRegion6
MapReduce for SPARQL (HRdfStore)
   Use HRdfStore Data Loader (HDL) to read RDF files
    and organize data in HBase.
   Sparcity of RDF data specifically useful to store in
       Hbase’s compression techniques useful
   HRdfStore Query Processor (HQP) executes RDF
    queries on HBase tables.
       SPARQL Query -> Parse tree -> Logical operator tree ->
        Physical operator tree -> Execution
MapReduce for SPARQL
(some more thoughts)

   How to organize RDF data in Hbase?
       Subjects/Object as Row Keys!
       “Predicates” column family
            Each predicate as “label” e.g. “Predicates-rdf:type”.
       Or predicates as row keys
            Subjects/Objects as column families.
       Convert each SPARQL query into associated query
        for Hbase.
MapReduce for SPARQL
(some more thoughts)

   Each RDF triple mapped to one of more keys and
    stored in Hbase according to these keys.
   Each cluster node being responsible for triples
    associated with one or more particular keys.
   Map each triple pattern in the SPARQL query to a key
    with associated restrictions e.g. FILTERs.
   Execute the query by mapping the triple patterns to
    cluster nodes associated with those keys.
       This is nothing but Distributed Hash Table like system.
       Can employ a different hashing scheme to avoid skew in
        triple distribution as experienced in conventional DHT based
        P2P systems.
Map-Reduce-Merge (an application)
   Map-Reduce do not work well with heterogeneous
       It does not directly support join.
   Map-Reduce-Merge (as proposed by Yahoo! And UCLA
    researchers) support features of Map-Reduce while
    providing relational algebra to the list of database
   MapReduce - http://labs.google.com/papers/mapreduce.html
   BigTable - http://labs.google.com/papers/bigtable.html
   Hadoop – http://hadoop.apache.org/core/docs/r0.15.3/index.html
   Hbase - http://hadoop.apache.org/hbase/docs/r0.1.1/
   HrdfStore - http://wiki.apache.org/incubator/HRdfStoreProposal
   IBM MapReduce tool for Eclipse -
   Map-Reduce-Merge: Simplified Relational Data
    Processing on Large Clusters, SIGMOD’07.

Shared By: