Try the all-new QuickBooks Online for FREE.  No credit card required.

Distributed RDF Triple Store using DHT based P2P Network

Document Sample
Distributed RDF Triple Store using DHT based P2P Network Powered By Docstoc
					MapReduce – An overview

        Medha Atre
            (May 7, 2008)
      Dept of Computer Science
    Rensselaer Polytechnic Institute
   Motivation
   MapReduce by Google
   Tools
       Hadoop
       Hbase
   MapReduce for SPARQL
       HRdfStore
   References
   MapReduce inspired by the map and reduce
    primitives present in Lisp and other functional
    programming languages.
   Managing large amounts of data on the clusters of
   Processing this data in distributed fashion without
    aggregating it at single point.
   Minimal user expertise required to carry out the tasks
    in parallel on cluster of machines.
     MapReduce by Google
        Input: a set of key-value pairs
        Output: a set of key-value pairs (not
         necessarily same as input)!

map(String key, String value):       reduce(String key, Iterator values):
   // key: document name                 // key: a word
   // value: document contents           // values: a list of counts
   for each word w in value:             int result = 0;
      EmitIntermediate(w, "1");          for each v in values:
                                            result += ParseInt(v);

                     Example of Word-Count
MapReduce by Google                (contd..)

   Architecture has a master server
       Map task workers
       Reduce task workerss
   Map task split into M splits and distributed to
    Map workers
   Reduce invocations distributed to R nodes by
    partitioning the intermediate key space.
       E.g. Hash(key) mod R
MapReduce by Google                   (contd..)

   Uses Google File System (GFS)
   Provides fault-tolerance
   Preserves locality by scheduling jobs on
    machines on same cluster and having
    replicated input data.
   Trade-off in selection of M and R values.
       Master makes O(M + R) scheduling decisions, and
        keeps O(M*R) states in memory
       Typically M = 200,000 R = 5,000 with 2000
        worker machines.
   Hadoop         ( (This talk is not going
    to detail Hadoop APIs)
        Uses Hadoop Distributed File System (HDFS) – specifically
         meant for large distributed data intensive applications
         running on commodity hardware.
        Inspired by Google File System (GFS)
        For MapReduce operations
             Master JobTracker and one slave TaskTracker per cluster-node
             Applications specify input/output locations
             Supply map and reduce functions implementing appropriate interfaces
              and abstract classes.
        Implemented in Java but applications can be written in other
             Using Hadoop Streaming
Tools           (contd..)

   Hbase      ( )
       Inspired by Google’s Bigtable architecture for distributed
        data storage using sparse tables
       It’s like a multidimensional sorted map, indexed by a row
        key, column key, and a timestamp.
            A column name has the form <family>:<label>
            Single table enforces set of column families.
            Column families stored physically close on the disk to improve
             locality while searching.
   Hbase storage

row Key       timestamp    contents   anchor                                 mime

com.cnn.www   t9                      CNN


              t6           <html..>                                          Text/html

              t5           <html..>

              t3           <html..>

                          Hbase table view
   Hbase storage

Row key       Timestamp      Contents

com.cnn.www   t6             <html..>
                                               Row key        Timestamp   Mime
              t5             <html..>
                                               com.cnn.www    t6          text/html
              t3             <html..>

          Row key       Timestamp       Anchor


          com.cnn.www   t9              CNN

    Hbase architecture
   Table is a list of data tuples sorted by the row key.
   Physically broken into HRegions -> tablename, start and
    end key.
   HRegion served by HRegionServer.
   HStore for each column group.
       HStoreFiles B-Tree like structure
   HMaster to control HRegionServers
       META table to store meta info about HRegions and
        HRegionServer locations.
  Hbase architecture


  HRegionServer1                         HRegionServer2

HRegion1   HRegion2                  HRegion3    HRegion4


                      HRegion5   HRegion6
MapReduce for SPARQL (HRdfStore)
   Use HRdfStore Data Loader (HDL) to read RDF files
    and organize data in HBase.
   Sparcity of RDF data specifically useful to store in
       Hbase’s compression techniques useful
   HRdfStore Query Processor (HQP) executes RDF
    queries on HBase tables.
       SPARQL Query -> Parse tree -> Logical operator tree ->
        Physical operator tree -> Execution
MapReduce for SPARQL
(some more thoughts)

   How to organize RDF data in Hbase?
       Subjects/Object as Row Keys!
       “Predicates” column family
            Each predicate as “label” e.g. “Predicates-rdf:type”.
       Or predicates as row keys
            Subjects/Objects as column families.
       Convert each SPARQL query into associated query
        for Hbase.
MapReduce for SPARQL
(some more thoughts)

   Each RDF triple mapped to one of more keys and
    stored in Hbase according to these keys.
   Each cluster node being responsible for triples
    associated with one or more particular keys.
   Map each triple pattern in the SPARQL query to a key
    with associated restrictions e.g. FILTERs.
   Execute the query by mapping the triple patterns to
    cluster nodes associated with those keys.
       This is nothing but Distributed Hash Table like system.
       Can employ a different hashing scheme to avoid skew in
        triple distribution as experienced in conventional DHT based
        P2P systems.
Map-Reduce-Merge (an application)
   Map-Reduce do not work well with heterogeneous
       It does not directly support join.
   Map-Reduce-Merge (as proposed by Yahoo! And UCLA
    researchers) support features of Map-Reduce while
    providing relational algebra to the list of database
   MapReduce -
   BigTable -
   Hadoop –
   Hbase -
   HrdfStore -
   IBM MapReduce tool for Eclipse -
   Map-Reduce-Merge: Simplified Relational Data
    Processing on Large Clusters, SIGMOD’07.

Shared By: