MD-HBase_MDM2011

Document Sample
MD-HBase_MDM2011 Powered By Docstoc
					MD-HBase: A Scalable Multi-dimensional Data
 Infrastructure for Location Aware Services


     Shoji Nishimura (NEC Service Platforms Labs.),
     Sudipto Das, Divyakant Agrawal, Amr El Abbadi
        (University of California, Santa Barbara)


Page 1             * Work done as a visiting researcher at UCSB
Overview

▐ A Motivating Story
▐ Existing Technologies
▐ Our proposal
▐ Evaluation
▐ Conclusion




 Page 2
Motivating Scenario: Mobile Coupon Distribution

                     Mobile Coupon
                      Distributer



          Current                                  Distribution
          Location Current                           Policy
                            Current
                   Location
                            Location            • Area
                                       Coupon   • # of coupons




 Page 3
Motivating Scenario: Mobile Coupon Distribution
 System Scalability                                      Efficient Complex Queries
 Large amounts of Data                                           Multi-Dimensional Query
 High Throughput                                                 Nearest Neighbors Query
                                 Current Current
                     Current
                                 Location Location
                     Location
                                   Current
            Current                Location       Current
                  Current
            Location                             Location
                 Location Current
                                Current Current
                          Location
                                               Current                    Distribution Policy
                                              Location
                                              Current
                               Location Location
                                             Location         Coupon
                                                                         • Area
                                                             Coupon
                                                            Coupon       • # of coupons




   125,000,000 subscribers
          in Japan

 Page 4
Existing Technologies
                         Multi-
                      dimensional         Scalability
                        Queries
                                    Commercial products

   Relational DBs                           but expensive
    Spatial DBs                     Open source products



          Key-Value
           Stores



   What We Want

                                      at a reasonable price
 Page 5
Ordered Key-Value Stores

              Buckets
                             Sorted by key
             key00 value00

             key01 value01
                                             Good at 1-D Range Query
  Index
             key0X value0X

   key00
                                              But, our target is
   key11     key11 value11
                                              multi-dimensional…
             key12 value12

   keynn
             key1Y value1Y


                                  Latitude
             keynn valuenn
                                                                Time
                                                 Longitude
 Page 6
Naïve Solution: Linearlization
   Projects n-D space to 1-D space
               key00 value00        Apply a Z-ordering curve…
               key01 value01


               key0X value0X           5    7    13   15
     key00
     key11     key11 value11           4    6    12   14
               key12 value12

     keynn                             1    3     9   11
               key1Y value1Y

                                       0    2     8   10
               keynn valuenn




                               Simple, but problematic…
 Page 7
Problem: False positive scans

▐ MD-query on Linearized space
      Translate a MD-query to
       linearized range query.
          • Ex. Query from 2 to 9.              5   7   13   15
      Scan queried linearized range.
      Filter points out of the queried area.   4   6   12   14
          • ex. blue-hatched area (4 to 7)

                                                1   3   9    11

                                                0   2   8    10
     Require the boundary information of
     the original space.



 Page 8
Our Approach: MD-HBase
     Build a Multi-dimensional Index Layer on top of an Ordered Key-
      Value store


             Ordered Key-Value Store              MD-HBase
             ex. BigTable, HBase, …

                                            Multi-Dimensional Index
             Single Dimensional Index




    Page 9
Introduce Multi-dimensional Index

▐ Multi-dimensional Index (ex. The K-d tree, The Quad tree)
      Divide a space into subspaces containing almost same # of points
      Organize subspaces as tree



     Efficient subspace pruning → to avoid false positive scans



             Divide into              Organize as




 Page 10
Space Partition By the K-d tree

       Binary Z-ordering space                 Partitioned space by
                                                   the K-d tree
            bitwise interleaving


   11 0101 0111 1101 1111                11   0101   0111   1101   1111


   10 0100 0110 1100 1110                10   0100 0110     1100   1110


   01 0001 0011 1001 1011                01   0001 0011 1001 1011


   00 0000 0010 1000 1010                00   0000 0010 1000 1010


           00   01    10    11                 00    01     10     11

                                   How do we represent these subspaces?
 Page 11
Key Idea: The longest common prefix naming scheme

 Subspaces represented as the longest common prefix of keys!

                                       Remarkable Property
                                       • Preserve boundary information
    11     0101   0111   1101   1111
                                         of the original space

    10     0100 0110     1100   1110
                                                      1***
                                           *→0                  *→1
    01     0001 0011 1001 1011
                                         1000                 1111
    00     0000 0010 1000 1010

            00    01     10      11     (10, 00)             (11, 11)

   000*                         1***    Left-bottom           Right-top
                                          corner               corner
 Page 12
Build an index with the longest common prefix of keys

                                                       Buckets


                                        Index            000*
   11      0101   0111   1101    1111
             01**                                        001*
   10      0100 0110     1100   1110
                              1***
   01      0001 0011 1001 1011                           01**
           000*   001*
   00      0000 0010 1000 1010
                                                         1***
            00      01   10      11
                                                allocate per subspace


 Page 13
Multi-dimensional Range Query

                              Scan 0010 -1001
                                on the index
                                  Index Subspace Pruning        000*
11      0101 0111 1101 1111        000*
                                   001*
                                                                001*   Scan
10      0100 0110 1100 1110                              001*
                                   01**         Filter
        0001 0011 1001 1011                              10**   01**
01                                 10**
        0000 0010 1000 1010        11**
00                                                              10**   Scan
        00     01   10   11
                                                                11**
                             Reconstruct the boundary Info. &
                         Check whether intersecting the queried area


     Page 14
K Nearest Neighbors Query

▐ The best first algorithm can be applied.
      the most efficient technique in practical case
▐ Check the detail in our paper




                              4               5



                          3       1           2


 Page 15
Variations of Storage Layer
   Table Share Model
       Use single table, Maintain bucket boundary
       Most space efficiency
       Monitor

   Table per Bucket Model
       Allocate a table per bucket
       Most flexible mapping
            One-to-one, one-to-many, many-to-one
       Bucket split is expensive
            Copy all points to the new buckets.

   Region per Bucket Model
       Allocate a region per bucket
       Most bucket split efficiency
            Asynchronous bucket split
       Require modification of HBase
Experimental Results: Multi-dimensional Range Query
     Dataset: 400,000,000 points
     Queries: select objects within MD ranges and change selectivity
     Cluster size: 16 nodes
     MD-HBase responses 10~100 times faster than others
      and responses proportional time to selectivity.
                                             MD-HBase    HBase(ZOrder)     MapReduce
                                    1000
              Response Time (Sec)




                                     100


                                      10


                                       1
                                           0.01         0.1            1          10
                                                         Selectivity (%)
    Page 17
Experimental Results: k Nearest Neighbors Query

     Dataset: 400,000,000 points
     Queries: choose a point and change the number of neighbors
     Cluster size: 16 nodes
     MD-HBase responses 1.5 sec where k ≦ 100,
      and 11 sec even if k = 10,000
              Response Time (Sec)




                                    12
                                    10
                                     8
                                     6
                                     4
                                     2
                                     0
                                         1     10     100    1000     10000
                                             k: Number of Neighbors
    Page 18
Experimental Results: Insert

     Dataset: spatially skewed data generated by zipfian distribution
     MD-HBase shows good scalability without significant overhead.


                     250,000
     (records/sec)
      Thoughput




                     200,000
                                                              MD-HBase
                     150,000
                     100,000                                  Hbase
                      50,000                                  (Zorder)
                          0
                               0   4   8    12   16   20
                                   Number of nodes

    Page 19
Conclusions
     Designed a scalable multi-dimensional data store.
             Scalability & Efficient multi-dimensional queries
             Key Idea: indexing the longest common prefix of keys
             Easily extend general ordered key-value stores.
     Demonstrated scalable insert throughput and excellent query
      performance.
             Range Query: 10-100 times faster than existing technologies.
             kNN Query: 1.5 s when k ≦ 100.
             Insert: 220K inserts/sec on 16 nodes cluster without overhead



       Thank you.
       Any Questions?

    Page 20

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:10/11/2011
language:English
pages:20