Docstoc

BitMat Main memory Bit Matrix of RDF Triples

Document Sample
BitMat Main memory Bit Matrix of RDF Triples Powered By Docstoc
					Matrix “Bit”loaded: A Scalable Lightweight
   Join Query Processor for RDF Data

Medha Atre1, Vineet Chaoji2, Mohammed J. Zaki1, and James A. Hendler1
   1Dept.   of Computer Science, Rensselaer Polytechnic Institute, Troy NY, USA
                          2Yahoo! Labs, Bangalore, India




                                April 29, 2010
                           WWW 2010, Raleigh NC, USA
Overview
•   Introduction
•   Challenges
•   Motivation
•   BitMat structure
    – Construction & operations
• Query processing algorithm
• Experimental evaluation
• Future roadmap



                                  WWW 2010, Raleigh NC, USA
 Introduction
• RDF (Resource Description Framework) for representing any
  information
   – triple form – [<subject> <predicate> <object>]
   – Depicted as a directed edge
                                       Predicate
                            Subject                Object




• RDF graphs of hundreds of millions of triples to a few billion
  triples are common nowadays
   –   DBPedia (103 million triples)
   –   UniProt (845 million triples)
   –   US Census (1 billion triples)
   –   Bio2RDF (2.3 billion triples)
   –   Data.gov (5+ billion triples)


                                                            WWW 2010, Raleigh NC, USA
 Challenges – Storing RDF Data
• RDF graphs of more than a billion triples (400 GB+ on-disk size).

• Traditional DB based efforts
   – Jena-TDB (custom indexes and storage)
   – C-store
   – MonetDB (open-source DB system)


• Exploit RDF data characteristics on top of DB storage
   – Vertical partitioning: create separate predicate table for each predicate.


• Compression based techniques
   – MonetDB and RDF-3X

                                                                 WWW 2010, Raleigh NC, USA
 Challenges – Querying RDF Data
• Limited main memory compared to disk space
• Large intermediate join tables
• Scans over large percentage of indexes
   – Even for aggressive indexing + compression. E.g. Hexastore, RDF-3X


• Optimizations
   – Selectivity estimation in case of multiple level joins, left deep join tree
   – Sideways (parallel) information passing for several merge-joins
   – Semi-joins: Semi-joins reduce the database for a given join query




                                                                   WWW 2010, Raleigh NC, USA
 Motivation for this work
• SPARQL join queries can be broadly classified into 3 types:
   1) Queries having highly selective triple patterns,
      e.g., (?s :residesIn USA)(?s :hasSSN “123-45-6789”)
       •   Existing techniques handle these queries very efficiently


   2) Queries with low-selectivity triple patterns but highly selective results,
      e.g., (?s :residesIn China)(?s :citizenOf India)

   3) Queries with low-selectivity triple patterns and low-selectivity results,
      e.g., (?s :residesIn USA)(?s :hasSSN ?y)
       •   Such queries involving multi-level joins can lead to large intermediate results




                                                                               WWW 2010, Raleigh NC, USA
 Our Contribution
• A compressed data structure – BitMat to store the RDF data



• A join query algorithm which operates directly on the
  compressed data:
   – No intermediate join tables, instead, use a 2-phase query algorithm
       • First phase: prune the candidate RDF triples
       • Second phase: stitch the final results directly from the pruned triples
   – Can guaranty memory requirements at the beginning of the query
   – Online/streaming result generation




                                                                        WWW 2010, Raleigh NC, USA
 BitMat Construction
• Conceptually construct a bit-cube of subject (S), predicate (P),
  object (O) dimensions
• Mapping dictionary:
   – Vs: Set of subjects, Vp: Set of predicates, Vo: Set of objects, Vso= Vs  Vo
   – Common subject and object URIs mapped to same integer IDs 1 to |Vso|
   – Subject only URIs mapped to integer IDs |Vso|+1 to |Vs|
                         S-dimension

                            Vs
                                               P-dimension
                            Vso




                            1
                                  1




                                         Vso
                                                         O-dimension

                                                        Vo             WWW 2010, Raleigh NC, USA
    BitMat Construction (continued..)
                                                                  S-dimension
                                                                                         P3
      Subject               Predicate     Object
                                                                                             0
      :the_matrix           :releasedIn “1999”                                                        0
                                                                                  P2         0            1
      :the_thirteenth_floor :releasedIn “1999”                                                        0         0
                                                                                   0                      1
      :the_matrix           :similar_to   :the_matrix_reloaded              P1
                                                                                   1         0                  0
                                                                                                  0
      :the_thirteenth_floor :similar_to   :the_matrix                SO2     0               0            1
                                                                                                  0
                                                                      SO1    0     1                      0
      :the_matrix           rdf:type      :movie                                         0
                                                                            SO1    1              0
      :the_thirteenth_floor rdf:type      :movie
                                                                 P-dimension      SO2    0
                                                                                        O3        0
                                                                                                 O4        O-dimension



•    Slice along P dimension and store S-O and O-S BitMats
•    Apply gap-encoding to each row of the BitMat before storing it
•    Storage: 2 |Vp| + |Vs| + |Vo| BitMats
•    Additionally store condensed representation of rows and
     columns and number of triples in each of the 4 types of BitMats
                                                                                                      WWW 2010, Raleigh NC, USA
 Operations on BitMat
• Join algorithm uses two basic operations: fold & unfold
• fold(BitMat, dimension) returns bitArray
   – Folds the input BitMat by retaining the dimension

                         1   1        1

                         1   1   1    1




• unfold(BitMat, MaskBitArray, dimension)
   – Unfolds MaskBitArray on the BitMat in dimension


• Fold & unfold operate by doing bitwise AND/OR operations on
  gap compressed bit-vectors
                                                         WWW 2010, Raleigh NC, USA
 Query Processing Algorithm
• Build a constraint graph

• E.g., query (?m rdf:type :movie)(?n rdf:type movie)(?m :similar_to :n) has
  constraint graph as

                                    ?m                       ?n           Gjvar




              ?m rdf:type :movie         ?m :similar_to ?n        ?n rdf:type :movie        Gtp
                                   SS                        SO

• Each triple pattern has a BitMat containing only triples matching that triple
  pattern
• Propagate the constraints on join variable bindings imposed by each triple
  pattern



                                                                                  WWW 2010, Raleigh NC, USA
Phase 1 -- Pruning phase
1. Embed a tree on the subgraph Gjvar
2. Walk over this tree from root to leaves and back in BFS order
3. At each node in the tree over Gjvar, collect all the variable
   bindings from the BitMats of the triple patterns containing that
   variable (fold operation)
4. Do a bitwise AND of all folded bit-arrays obtained
5. Relay back the results of bitwise AND on the BitMats (unfold
   operation)
•   Simple optimizations:
    – Tree root selection: Select the join variable having the least number of
      triples in their BitMats as the root of the tree over Gjvar
    – Early stopping: If at any point, the result of bitwise AND of folded bit-
      arrays is null

                                                                   WWW 2010, Raleigh NC, USA
   Pruning phase
                                     ?                                                               ?n
                                     ?m
                                     m
                                                                                             1   1                1
                                                                                             1   1                1
                                                                                                                  1
                     1      1     1      1
                     11 1 1 111 1 11 1 1 11      1
                                                 11
                                                                                             1   1                1
                                                                                             1
                         fold unfold             foldunfold                       fold
                                                                              unfold                                             fold
                                                                                                                               unfold
                                                                                             1                    1

           1                 1                           1                1        1                      1                      1          1
                     1                       1                    1                      1                            1                         1
                         1           1                                                                        1                         1
               1                                              1           1                                       1              1
                         1               1                        1                1
               1                 1                                    1                                                   1                         1




                   ?m rdf:type :movie                         ?m :similar_to ?n                               ?n rdf:type :movie



In the reverse traversal while propagating effect of join over “?n”, the fold of 2nd BitMat yields same bit-
array as the mask bit-array of ?m before, hence there is no need to do fold/unfold again on the first BitMat

                                                                                                                              WWW 2010, Raleigh NC, USA
Phase 2 -- Final result generation
• Resembles a multi-way join
• Start with the triple pattern with least number of triples left in its
  BitMat
• Generate bindings for variables in that triple pattern
• Next, select another triple pattern which shares a join variable with any
  of the previously selected triple patterns
• Check if it can generate the same bindings for the shared join variable
  and generate bindings for its other variables
• Continue this and at the end of one round when all triple patterns are
  processed and all variables have consistent bindings, output the result



                                                               WWW 2010, Raleigh NC, USA
    Final result generation
                                                                          Sample query
                                                                          ?a rdf:type :Person
                                                                          ?a :worksFor ?b
                                                                          ?c :departmentOf ?b
 ?a rdf:type :Person
                        1                   1                1
                   ?a               1                1
                            1               1



                                            ?b                       Var Val
 ?a :worksFor ?b
                        1       1                1               1
                                                                     ?a   :s1
                   ?a                       1            1
                                                                     ?b    :o2
                                                                          :o3    Output this result
                                        1
                                                                     ?c   :t4
                                                                          :t3
                                            ?b
?c :departmentOf ?b
                                1                1

                   ?c                   1        1               1
                        1       1           1            1




                                                                                     WWW 2010, Raleigh NC, USA
 Evaluation setup
• Competitive RDF stores:
   – MonetDB
   – RDF-3X
• Datasets:
   – UniProt: Protein dataset with ~845M triples, ~147M subjects, 95 predicates, and
     ~128M objects
   – LUBM: Synthetic university dataset with ~1.33B triples, ~217M subjects, 18
     predicates, and ~161M objects
• Queries:
   – UniProt: Queries published by UniProt dataset owners and RDF-3X
   – LUBM: Queries published by OpenRDF
• Development environment:
   – Dell Optiplex 755 PC, 3.0 GHz Intel E6850 Core 2 Duo Processor, 4 GB memory.
   – 7 GB swap space on 7200 rpm 1 TB disk.
   – 64 bit 2.6.28-15 Linux kernel (Ubuntu 9.04 distribution).


                                                                     WWW 2010, Raleigh NC, USA
 Results
• For queries with low-selectivity triple patterns, BitMat
  outperformed MonetDB and RDF-3X by 2-3 orders of magnitude

• For highly selective triple patterns, RDF-3X gave superior
  performance, especially for queries where sideways-information-
  passing (SIP) could benefit

• BitMat’s shortcomings in case of highly selective queries:
   – The 2-phase query processing can create additional overheads for highly
     selective queries
   – No cache memory optimization
   – No memory mapping of disk files


                                                               WWW 2010, Raleigh NC, USA
                                   UniProt 845 million triples (time in sec)

              Q1            Q2           Q3             Q4             Q5                Q6           Q7             Q8
              (4)           (7)          (8)            (4)            (3)               (7)          (2)           (12)
                                                       Cold cache
BitMat     451.365       269.526      173.324      9.396            78.35        1.34             9.33         13.06

MonetDB    548.21        303.213      124.356      9.63             97.28        11.28            9.91         15.93

RDF-3X     Aborted       525.125      224.58       1.38             4.636        0.902            0.892        1.353

                                                    Warm cache

BitMat     440.868       263.071      168.673      8.305            77.442       0.448            8.36         10.87

MonetDB    495.64        267.53       113.818      0.584            96.02        0.822            0.861        0.362

RDF-3X     Aborted       487.182      226.05       0.077            1.008        0.0064           0.003        0.03

#Results   160,198,689   90,981,843   50,192,929   0                179,316      0                0            19

#Initial   92,965,468    73,618,481   78,840,372   16,626,073       60,260,006   15,408,126       16,625,901   53,677,336
triples


                                                                                               More results in the paper



                                                                                                   WWW 2010, Raleigh NC, USA
                                 LUBM 1.33 billion triples (time in sec)


                    Q1 (Circ)    Q2 (Star)         Q3(Circ)      Q4 (Star)     Q5(Star)                Q6
                                                   Cold cache
BitMat             51.21         2.71          6.56             2.45          0.503            3.81

MonetDB            548.21        27.17         455.23           34.12         18.89            14.6

RDF-3X             Aborted       34.868        2324.753         0.588         0.425            1.129

                                                   Warm cache

BitMat             48.57         2.11          1.94             0.686         0.27             2.85

MonetDB            96.65         6.56          398.46           3.209         0.566            0.542

RDF-3X             Aborted       29.033        2028.685         0.0024        0.0029           0.1814

#Results           2528          10,799,863    0                10            10               125

#Initial triples   165,397,764   224,805,759   219,416,877      438,912,513   3,000,966        9,100,649




                                                                                          WWW 2010, Raleigh NC, USA
Comparison of index storage space


                     BitMat            RDF-3X   MonetDB     Raw triples
          (including LZ77 compressed                      (uncompressed)
              dictionary mapping)
UniProt            51.2 GB             42 GB     16 GB        205 GB
LUBM               68.8 GB             70 GB     25 GB        451 GB




                                                            WWW 2010, Raleigh NC, USA
 Future Roadmap
• Does not allow a subset of variables to be specified by the
  SELECT clause

• Does not have ability to process other class of SPARQL queries,
  e.g., OPTIONAL, UNION, FILTER etc.

• S-P or P-O dimensional joins not handled
   – Rare in assertional RDF data

• Cannot perform addition/deletion/update of triples

• Incorporate lazy-loading of BitMats to avoid overheads for highly
  selective queries



                                                       WWW 2010, Raleigh NC, USA
Thank you!




             WWW 2010, Raleigh NC, USA

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:16
posted:3/23/2011
language:English
pages:22