Docstoc

Concurrent Cache-Oblivious B-trees using Transactional Memory

Document Sample
Concurrent Cache-Oblivious B-trees using Transactional Memory Powered By Docstoc
					Concurrent Cache-Oblivious
B-trees Using Transactional
         Memory

           Jim Sukha

        Bradley Kuszmaul
           MIT CSAIL
          June 10, 2006
           Thought Experiment
Imagine that, one day, you are assigned the
  following task:
 Enclosed is code for a serial, cache-
 oblivious B-tree. We want a reasonably
 efficient parallel implementation that
 works for disk-resident data.
 Attach:   COB-tree.tar.gz

 PS. We want to be able to restore the data
 to a consistent state after a crash too.
 PPS. Our deadline is next week. Good luck!
       Concurrent COB-tree?
Question:
 How can one program a concurrent,
 cache-oblivious B-tree?


Approach:
 We employ transactional memory. What
 complications does I/O introduce?
  Potential Pitfalls Involving I/O
Suppose our data structure resides on disk.
1. We might need to make explicit I/O calls to
   transfer blocks between memory and disk.
   But a cache-oblivious algorithm doesn’t know
   the block size B!
2. We might need buffer management code if the
   data doesn’t fit into main memory.
3. We might need to unroll I/O if we abort a
   transaction that has already written to disk.
        Our Solution: Libxac
• We have implemented Libxac, a page-
  based transactional memory system that
  operates on disk-resident data. Libxac
  supports ACID transactions on a memory-
  mapped file.
• Using Libxac, we are able to implement a
  complex data structure that operates on
  disk-resident data, e.g. a cache-oblivious
  B-tree.
     Libxac Handles Transaction I/O
1.    We might need to make explicit I/O calls to transfer blocks
      between memory and disk.
      Similar to mmap, Libxac provides a function xMmap. Thus,
      we can operate on disk-resident data without knowing
      block size.

2.    We might need buffer management code if the data
      doesn’t fit into main memory.
      Like mmap, the OS automatically buffers pages in memory.

3.    We might need to unroll I/O if we abort a transaction that
      has already written to disk.
      Since Libxac implements multiversion concurrency
      control, we still have the original version of a page even if
      a transaction aborts.
                Outline
• Programming with Libxac



• Cache-Oblivious B-trees
    Example Program with Libxac
int main(void) {
  int* x;                                Runtime initialization function. For
  int status = FAILURE;                  durable transactions, logs are stored
  xInit(“/logs”, DURABLE);               in the specified directory.*
    x = xMmap(“input.db”, 4096);         Transactionally maps the first
                                         page of the input file.

    while (status != SUCCESS) {
                                         Transaction body. The body can be
      xbegin();
        x[0] ++;
                                         a complex function (e.g., a cache-
      status = xend();                   oblivious B-tree insert!).
    }

    xMunmap(x);                          Unmap the region.
    xShutdown();                         Shutdown runtime.
    return 0;
}                            * Currently Libxac logs the transaction commits, but we
                             haven’t implemented the recovery program yet.
            Libxac Memory Model
                                           1. Aborted transactions are
int main(void) {                              visible to the programmer
  int* x;
  int status = FAILURE;                       (thus, programmer must
  xInit(“/logs”, DURABLE);                    explicitly retry transaction).
                                              Control flow always proceeds
    x = xMmap(“input.db”, 4096);              from xbegin() to xend().
                                              Thus, the xaction body can
    while (status != SUCCESS) {               contain system/library calls.
      xbegin();
        x[0] ++;                           2. At xend(), all changes to
      status = xend();                        xMmap’ed region are
    }                                         discarded on FAILURE, or
                                              committed on SUCCESS.
    xMunmap(x);
    xShutdown();     *Libxac supports      3. Aborted transactions always
                     concurrent
    return 0;                                 see consistent state. Read-
                     transactions on
}                    multiple processes,      only transactions can always
                     not threads.             succeed.
        Implementation Sketch
• Libxac detects memory accesses by using a
  SIGSEGV handler to catch a memory protection
  violation on a page that has been mmap’ed.
• This mechanism is slow for normal transactions:
   – Time for mmap, SIGSEGV handler: ~ 10 ms


• Efficient if we must perform disk I/O to log
  transaction commits.
   – Time to access disk:            ~ 10 ms
             Is xMmap practical?
Experiment on a 4-
proc. AMD Opteron,
performing 100,000
insertions of
elements with
random keys into a
B-tree.
Each insert is a
separate
transaction.
Libxac and BDB
both implement
group commit.



B-tree and COB-tree both use Libxac. Note that none of the three data
structures have been properly tuned.
Conclusion: We should achieve good performance.
                Outline
• Programming with Libxac



• Cache-Oblivious B-trees
What is a Cache-Oblivious B-tree?
• A cache-oblivious B-tree (e.g. [BDFC00]) is a
  dynamic dictionary data structure that supports
  searches, insertions/deletions, and range-
  queries.
• An cache-oblivious algorithm/data structure
  does not know system parameters (e.g. the
  block size B.)
• Theorem [FLPR99]: a cache-oblivious algorithm
  that is optimal for a two-level memory hierarchy
  is also optimal for a multi-level hierarchy.
       Cache-Oblivious B-Tree Example
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                              45


            4                              16                                38                                54


   4              10                16            21                  38            45              54                83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 38    39 40 -- 45    -- 48 54 --        56 59 70 83

 Packed Memory Array (PMA)

The COB-tree can be divided into two pieces:
1. A packed memory array that stores the data in order, but contains gaps.
2. A static cache-oblivious binary-tree that indexes the packed memory array.
       Cache-Oblivious B-Tree Insert
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                              45


            4                              16                                38                                54


   4              10                16            21                  38            45              54                83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 38    39 40 -- 45    -- 48 54 --        56 59 70 83


To insert a key of 37:
       Cache-Oblivious B-Tree Insert
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                              45


            4                              16                                38                                54


   4              10                16            21                  38            45              54                83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 38    39 40 -- 45    -- 48 54 --        56 59 70 83


To insert a key of 37:                          37
1. Find correct section of PMA location using static tree.
       Cache-Oblivious B-Tree Insert
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                              45


            4                              16                                38                                54


   4              10                16            21                  38            45              54                83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 38    39 40 -- 45    -- 48 54 --        56 59 70 83


To insert a key of 37:                          37
1. Find correct section of PMA location using static tree.
2. Insert into PMA. This step may cause a rebalance of the PMA.
       Cache-Oblivious B-Tree Insert
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                                  45


            4                              16                                38                                54


   4              10                16            21                  38             45                 54             83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 37    38 39 40 --        45 48 54 56    59 70 83 --


To insert a key of 37:
1. Find correct section of PMA location using static tree.
2. Insert into PMA. This step possibly requires a rebalance.
3. Fix the static tree.
       Cache-Oblivious B-Tree Insert
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                                  40


            4                              16                                37                                56


   4              10                16            21                  37             40                 56             83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 37    38 39 40 --        45 48 54 56    59 70 83 --


To insert a key of 37:
1. Find correct section of PMA location using static tree.
2. Insert into PMA. This step possibly requires a rebalance.
3. Fix the static tree.
       Cache-Oblivious B-Tree Insert
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                                  40


            4                              16                                37                                56


   4              10                16            21                  37             40                 56             83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 37    38 39 40 --        45 48 54 56    59 70 83 --



  Insert is a complex operation. If we wanted to use locks,
  what is the locking protocol? What is the right (cache-
  oblivious?) lock granularity?
            Conclusions
A page-based TM system such as Libxac
• Represents a good match for disk-resident
  data structures.
  – The per-page overheads of TM are small
    compared to cost of I/O.
• Is easy to program with.
  – Libxac allows us to program a concurrent,
    disk-resident data structure with ACID
    properties, as though it was stored in
    memory.
      Semantics of Local Variables
int main(void) {
  int y=0, z=0, a=0, b=0;
  int* x;                         In this system, Libxac guarantees
  int status = FAILURE;           that after loop completes:
  xInit(“/logs”, DURABLE);
  x = xMmap(“input.db”, 4096);

    while (status != SUCCESS) {   a == y
      a++;
      xbegin();                   Value of a is # of times transaction
        b = x[0];                 is attempted.
        y++;
        x[0]++;
        z = x[0] – 1;
      status = xend();
                                  We always have b == z because
    }                             aborted transactions always see
                                  consistent state, even if other
    xMunmap(x);                   programs concurrently access the
    xShutdown();
                                  first page of input.db.
    return 0;
}
   TM System Improvements?
Possible improvements to Libxac:
  – Provide more efficient support for non-durable
    transactions by modifying the OS to track
    report pages accessed?
  – Integrate Libxac with another TM system to
    provide concurrency control on both multiple
    threads and multiple processes?
                Implementation Sketch
   x[0]                    1       int a;
          PROT_NONE                xbegin();
                                     a = x[0];
                                     x[1024] += a+1;
x[1024]                    7       xend();
          PROT_NONE


x[2048]                    2
          PROT_NONE


x[3072]                    9
          PROT_NONE


          Memory Map   input.txt      Buffer File      Log File
                Implementation Sketch
   x[0]                    1       int a;
          PROT_NONE
          PROT_READ                xbegin();
                                     a = x[0];
                                     x[1024] += a+1;
x[1024]                    7       xend();
          PROT_NONE
                                              Segmentation
                                                 Fault
x[2048]                    2
          PROT_NONE


x[3072]                    9
          PROT_NONE


          Memory Map   input.txt      Buffer File      Log File
                Implementation Sketch
   x[0]                    1       int a;
          PROT_NONE
          PROT_READ                xbegin();
                                     a = x[0];
                                     x[1024] += a+1;
x[1024]                    7       xend();
          PROT_NONE
          PROT_READ
                                              Segmentation
                                                 Fault
x[2048]                    2
          PROT_NONE


x[3072]                    9
          PROT_NONE


          Memory Map   input.txt      Buffer File      Log File
                Implementation Sketch
   x[0]                    1       int a;
          PROT_NONE
          PROT_READ                xbegin();
                                     a = x[0];
                                     x[1024] += a+1;
x[1024]                    7       xend();
          PROT_NONE
          PROT_READ                            Segmentation
                                                Fault (2nd)
x[2048]                    2
          PROT_NONE


x[3072]                    9
          PROT_NONE


          Memory Map   input.txt      Buffer File      Log File
                Implementation Sketch
   x[0]                    1       int a;
          PROT_NONE
          PROT_READ                xbegin();
                                     a = x[0];
                                     x[1024] += a+1;
x[1024]                    7       xend();
          PROT_READ                                Segmentation
                                   copy contents
                                                    Fault (2nd)
x[2048]                    2                  7
          PROT_NONE


x[3072]                    9
          PROT_NONE


          Memory Map   input.txt       Buffer File      Log File
                Implementation Sketch
   x[0]                    1       int a;
          PROT_NONE
          PROT_READ                xbegin();
                                     a = x[0];
                                     x[1024] += a+1;
x[1024]                    7       xend();
          PROT_READ|                           Segmentation
          PROT_WRITE
                                                Fault (2nd)
x[2048]                    2               7
          PROT_NONE


x[3072]                    9
          PROT_NONE


          Memory Map   input.txt      Buffer File      Log File
                Implementation Sketch
   x[0]                    1       int a;
          PROT_NONE
          PROT_READ                xbegin();
                                     a = x[0];
                                     x[1024] += a+1;
x[1024]                    7       xend();
          PROT_READ|
          PROT_WRITE

x[2048]                    2               2
          PROT_NONE


x[3072]                    9
          PROT_NONE


          Memory Map   input.txt      Buffer File      Log File
                Implementation Sketch
   x[0]                    1       int a;
          PROT_NONE
          PROT_READ                xbegin();
                                     a = x[0];
                                     x[1024] += a+1;
x[1024]                    7       xend();
          PROT_READ|
          PROT_WRITE
                                                    log on
x[2048]                    2               2         disk       2
          PROT_NONE


x[3072]                    9
          PROT_NONE


          Memory Map   input.txt      Buffer File            Log File
                Implementation Sketch
   x[0]                    1       int a;
          PROT_NONE                xbegin();
                                     a = x[0];
                                     x[1024] += a+1;
x[1024]                    1
                           2       xend();
          PROT_NONE
                                   copy contents

x[2048]                    2                  2           2
          PROT_NONE


x[3072]                    9
          PROT_NONE


          Memory Map   input.txt       Buffer File     Log File
        Focus on PMA Rebalance
insert (tree, key, value) {
   xbegin();
   x=find_location_in_pma(tree->static_index,
                                  key);
   insert_into_pma(tree->pma, key, value, x);
   fix_static_index(tree->static_index);
   xend();
}


 rebalance(tree->pma);


In this talk, we illustrate the problems of transaction
I/O considering a transactional rebalance of the
packed memory array.
    Rebalance of an In-Memory Array
void rebalance(int* x, int n) {
                                          0 -- 1 2 3 -- -- 4 -- 5 6 --
   int i;
   int count = 0;
   for (i = 0; i < n; i++) {
                                    // Slide everything left
    if (x[i] != EMPTY_SLOT) {
        x[count] = x[i];
        count++;                          0 1 2 3 4 5 6 4 -- 5 6 --
    }
   }
    int j = count-1;                //   Redistribute items
    double spacing = 1.0*n/count;   //     from right
    for (i = n-1; i >= 0; i--) {
     if (floor(j*spacing) == i) {
         x[i] = x[j];
                                          0 1 -- 2 -- 3 4 -- 5 -- 6 --
     }
     else {
         x[i] = EMPTY_SLOT;
     }
    }
}
         Rebalance with Explicit I/O
void rebalance_with_I/O(int n) {
   int i;                           Why do we want to avoid
   int count = 0;
   int* y; int* z;
                                    performing explicit I/O to
                                    read/write data blocks?
    y = read_block(0);
    z = read_block(0);
                                         Issues:
    for (i = 0; i < n; i++) {
      if (i % B == 0) {                     1. What if the data does
          y = read_block(i/B);                 not fit into memory?
      }
      if (y[i%B] != EMPTY_SLOT) {                  We must have
          if (count % B == 0) {
              z = read_block(count/B);             buffer management
          }                                        code somewhere.
          z[count%B] = y[i%B];
          count++;
      }                                     2. A cache-oblivious
    }
    ...
                                               algorithm does not
              write_block(…)
}                                              know the value of B!
     Rebalance using Memory Mapping
 void rebalance_with_mmap(int n) {
                                            1. What if the data does
    int i;                                     not fit into memory?
    int count = 0;
    x = mmap(“input.db”, n*sizeof(int));
    for (i = 0; i < n; i++) {
                                               If we use memory
      if (x[i] != EMPTY_SLOT) {                mapping, then the OS
          x[count] = x[i];
          count++;
                                               automatically buffers
      }                                        pages that are
    }
                                               accessed.
     ...                                 2. What value of B do we
     munmap(x, n*sizeof(int));              choose for a cache-
                                            oblivious algorithm?
 }

Using mmap, the code looks like the
                                         I/O is transparent to the user.
in-memory rebalance.                     B does not appear in the
But we still need concurrency control.   application code.
           Concurrent Rebalance?
void rebalance_with_mmap(int n) {         What happens if we
   int i;
   int count = 0;                         want the rebalance to
   x = mmap(“input.db”, n*sizeof(int));   occur concurrently?
   for (i = 0; i < n; i++) {
     if (x[i] != EMPTY_SLOT) {
         x[count] = x[i];
         count++;
                                    If we use locks, what do we
     }                              choose as the locking
   }
                                  granularity?
             write_block(…)
    ...

    munmap(x, n*sizeof(int));     If we use transactions, will
}                                 the system need to unroll I/O
                                  when a transaction aborts to
                                  ensure the data on disk is
                                  consistent?
    Transactional Memory Mapping
void rebalance_with_xMmap(int n) {
   int i;
   int count = 0;
                                              Our solution:
   x = xMmap(“input.db”, n*sizeof(int));
   xbegin();
   for (i = 0; i < n; i++) {
     if (x[i] != EMPTY_SLOT) {
                                      Replace mmap with xMmap,
         x[count] = x[i];             and use transactions for
         count++;
     }
                                      concurrency control.
   }
   ...
             write_block(…)           Transaction system
                                  maintains multiple versions
    xend();                       of a page to avoid needing
    xMunmap(x, n*sizeof(int));
                                  to unroll I/O.
}
      Transactional memory mapping simplifies the code for
      a concurrent disk-resident data structure.
       Cache-Oblivious B-Tree Insert
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                              45


            4                              16                                38                                54


   4              10                16            21                  38            45              54                83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 37    38 39 40 45    -- 48 54 --        56 59 70 83



  2(a) Add 37 to packed memory array.                                   2.5 ≤ n ≤ 7.5


                                                                                         6 ≤ n ≤ 14
                                                                                         Packed Memory Array
                                                                                         Density Thresholds
       Cache-Oblivious B-Tree Insert
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                                  45


            4                              16                                38                                54


   4              10                16            21                  38             45                 54             83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 37    38 39 40 --        45 48 54 56    59 70 83 --



  2(a) Add 37 to packed memory array.                                   2.5 ≤ n ≤ 7.5

  2(b) Rebalance the PMA.
                                                                                          6 ≤ n ≤ 14
                                                                                          Packed Memory Array
                                                                                          Density Thresholds
        Dictionary Operations using a B+-Tree
                        4 10 20 42

                             B


1   3   4    5   8 10   12 15 20     21 33 35 41 42   51 70 77 85


The branching factor of the tree and the size of a block on
disk are both q(B). For a B+-tree, the data is stored at the
leaves. The keys at an interior node represent the maximum
key of that node’s subtree.

But to build the tree, we need to know the value of B…
    Cache-Oblivious B-tree [BDFC00]
Operations                               Cost in Block Transfers
Search(key)                                        O(logB N)
Insert(key, value)                                 O(logB N)**
Delete(key, value)                                 O(logB N)**
RangeQuery(start, end)                             O(logB N + k/B)*
*Bound assumes range query finds k items with keys between start and end.
** Amortized bound.

It is possible to support dictionary operations cache-
obliviously, i.e., with a data structure that does not know the
value of B. The cache-oblivious B-tree (COB-tree) achieves
the same asymptotic (amortized) bounds as a B+-tree.
   Cache-Oblivious B-Tree Overview
Static Cache-Oblivious
Binary Tree




The static tree is used as an index into a packed memory array.
To perform an insert, insert into the packed memory array, and update the static
tree. When the packed memory array becomes too full (empty), rebuild and grow
(shrink) the entire data structure.
 Static Cache-Oblivious Binary Tree
Static Cache-Oblivious
                                                                                                 size q(N1/2)
Binary Tree:
                                                                            …

                                                            1   2       3           N1/4
                                                                                                         size q(N1/4)




                                                                                                         …

                      …                          …                                          …                                    …

      1   2       3       N1/4   1   2       3       N1/4           1           2       3       N1/4              1     2    3       N1/4


              1                          2                                          3                                       N1/2

 Divide tree into q(N1/2) subtrees of size q(N1/2). Layout
 each subtree contiguously in memory, recursively.
                 Packed Memory Array
  A packed memory array uses a contiguous section of
  memory to store elements in order, but with gaps.

  For sections of size 2k, gaps are spaced to maintain to
  specified density thresholds that become arithmetically
  stricter as k increases.
1 -- -- 4   6 7 10 --   13 15 -- 16   -- -- 21 --     23 24 31 38   38 40 -- 45   -- 48 54 --   56 59 70 83


[4/16, 16/16]               [5/16, 15/16]                                [6/16, 14/16]


                                         [7/16, 13/16]

                                            24:   Density between 4/16 and 16/16.
   Density Thresholds Example:              25:   Density between 5/16 and 15/16.
                                            26:   Density between 6/16 and 13/16.
                                            27:   Density between 7/16 and 12/16
       Example Cache-Oblivious B-Tree
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                              45


            4                              16                                38                                54


   4              10                16            21                  38            45              54                83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 38    39 40 -- 45    -- 48 54 --        56 59 70 83


[4/16, 16/16]                        [5/16, 15/16]                                     [6/16, 14/16]

  Packed Memory Array                              [7/16, 13/16]
  Density Thresholds
       Cache-Oblivious B-Tree Insert
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                              45


            4                              16                                38                                54


   4              10                16            21                  38            45              54                83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 38    39 40 -- 45    -- 48 54 --        56 59 70 83


Insert 37:
       Cache-Oblivious B-Tree Insert
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                              45


            4                              16                                38                                54


   4              10                16            21                  38            45              54                83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 38    39 40 -- 45    -- 48 54 --        56 59 70 83


Insert 37:                                      37
1. Find correct section of PMA location using static tree.
       Cache-Oblivious B-Tree Insert
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                              45


            4                              16                                38                                54


   4              10                16            21                  38            45              54                83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 38    39 40 -- 45    -- 48 54 --        56 59 70 83


Insert 37:                                      37
1. Find correct section of PMA location using static tree.
2. Insert into PMA. This step possibly requires a rebalance.
       Example Cache-Oblivious B-Tree
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                              45


            4                              16                                38                                54


   4              10                16            21                  38            45              54                83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 37    38 39 40 45    -- 48 54 --        56 59 70 83


                                                                       [5/16, 15/16]
  2(a) Add 37 to packed memory array.
                                                                                       [6/16, 14/16]
       Example Cache-Oblivious B-Tree
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                                  45


            4                              16                                38                                54


   4              10                16            21                  38             45                 54             83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 37    38 39 40 --        45 48 54 56    59 70 83 --


                                                                       [5/16, 15/16]
  2(a) Add 37 to packed memory array.
  2(b) Rebalance the PMA.                                                              [6/16, 14/16]
       Cache-Oblivious B-Tree Insert
  Static Cache-                                               21
  Oblivious
  Tree                      10                                                                  40


            4                              16                                37                                56


   4              10                16            21                  37             40                 56             83


1 -- -- 4       6 7 10 --        13 15 -- 16    -- -- 21 --        23 24 31 37    38 39 40 --        45 48 54 56    59 70 83 --


Insert 37:
1. Find correct section of PMA location using static tree.
2. Insert into PMA. This step possibly requires a rebalance.
3. Fix the static tree.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:2/20/2013
language:English
pages:53