Computer Science 360 Assignment 6 Information on B+ Trees by guym13

VIEWS: 19 PAGES: 7

									                                              1


            Computer Science 360, Assignment 7, Information on B+ Trees


Background

    1. You can think of B+ trees as being the hard-coded equivalent of binary (or in
       general, base n) search.

    2. The B in the name means balanced. This signifies the idea that the nodes in the
       tree may vary in how many entries they contain, but all of the leaves are the same
       distance from the root.

    3. The balance of the tree is desirable because it places an upper bound on the
       number of pages that have to be read in order to get any value. The bound is
       O(logn).

    4. The + in the name of the data structure signifies that in addition to providing
       indexed access to file records, links are provided which allow the records to be
       accessed in sequential order without traversing the index tree.


Example

     Here is an example of a B+ tree at a certain stage of development. It is taken from
page 4 of part 1 of the assignment keys. The question of how insertions and deletions are
made will be addressed later. At this point it is simply desirable to see a tree and explain
its contents.


                                       (19, _, _)


                       (5, 9, 11)                     (29, _, _)


(2, 3, _)      (5, 7, _)       (9, 10, _)      (11, 17, _)    (19, 23, _)    (29, 31, _)




    The tree structure represents an index on a field in a table. The tree consists of nodes
which each fit on a single page of memory. In this diagram, the pairs of parentheses and
their contents represent the nodes in the tree. The integers are values of the field that is
being indexed. This field may not be a key field in the table, but in general, when
indexing, the field that is being indexed on can be referred to as the key. In the Korth and
Silberschatz handout they use the term key and refer to these values as the Ki. The nodes
                                              2


also contain pointers. In this diagram the pointers are represented by arrows. In reality,
the pointers would be stored in the nodes as addresses. In Korth and Silberschatz they
refer to the pointers as the Pi.

    The top two rows of this tree form the index set. The bottom row forms the sequence
set. The pointers in the index nodes point to internal or leaf nodes of the tree. From the
sequence set it is possible to point to the pages containing the actual table records
containing those key values. This is indicated by the vertical arrows pointing down from
the leaf nodes. The horizontal arrows between the leaf nodes represent the linkage that
makes it possible to access the key values in sequential order using this index.

    Observe that in this example, each index node can contain up to n = 4 pointers, and it
can contain up to n – 1 = 3 key values. If every node were completely full, there would
be 4 pointers in each. That means that the total number of key values possible in the
sequence set would be 4 * 4 = 16. All sequence set nodes are exactly 2 levels down from
the root. The bound on the number of page reads to get through the index tree is log4 16
= 2.

   There are additional rules governing the formation of trees of this sort. Counting by
pointers, internal and leaf nodes are not allowed to fall below half full. If n is even, that
means that you are allowed to have no fewer than n / 2 pointers in a node. If n is odd,
you round up, and the minimum is (n / 2) + 1. The book uses the notation of the ceiling
function, n/2, which means the same thing. Of course, this means that if you look at
how many key values are in a node, it is possible for it to appear less than half full.
Finally, it is permissible in general for the root node to fall below half full.

    Another thing becomes apparent about B+ trees from looking at the example. In each
node the key values are in order. There is also a relationship between the order of the key
values in one node, the pointers coming from it, and the values in the nodes pointed to by
these pointers. This relationship is intrinsic to the meaning of the contents of the tree and
will be explained further below when covering the rules for inserting and deleting entries.

    It is also apparent that the index set is sparse while the sequence set is dense. In other
words, the leaves contain all key values occurring in the table being indexed. Some of
these key values occur in the index set, but the majority do not. If a key value does occur
in the index set, it can only occur there once. It will become evident when looking at the
rules for inserting values how this situation comes about. When the tree is growing, a
value in a sequence set node can be copied into the index set node above it. However,
when values are promoted from one index set node to another they are not copied; they
are moved.

    A final remark can be made in this vein. The example shows creating a B+ tree on the
primary key of a table, in other words, a field that is unique. All of the example problems
on this topic will do the same. If the index were on a non-unique field, the difference
would show up only in the sequence set. It would be necessary at the leaf level to
                                              3


arrange for multiple pointers from a single key value, pointing to the multiple records that
contained that key value.

    Some authors present the rules for creating and maintaining B+ trees as a set of
mathematical algorithms. Others give pseudo-code or code for implementations. There
is also a certain degree of choice in both the algorithm and its implementation. What will
be given here are sets of rules of thumb that closely parallel Korth and Silberschatz. The
kinds of test questions you should be able to answer about B+ trees would be like the
assignment questions. In other words, given the number of key values and pointers that a
node can contain, and given a sequence of unique key values to insert and delete, you
need to be able to create and update the corresponding B+ tree index.


Summary of the Characteristics of a Correctly Formed Tree

   Here are some general rules of thumb that explain the contents of a tree. More
specific rules for insertion and deletion are given in following lists. At the outset,
however, it’s helpful to have a few overall observations.

   1.      At the very beginning the whole tree structure would consist of only one node,
           which would be both the index set and the sequence set at the same time.
           After the first node is split there is a distinction. The meaning of pointers
           coming from and between sequence set nodes has already been given above
           and no further explanation is needed. The remaining remarks below address
           the considerations of index set nodes specifically.

   2.      If a key value appears in a node, it has to have pointers on each side of it. In
           other words, the existence of a value in a node fundamentally signals “branch
           left” or “branch right”. In the algorithm for the insertion of values it will
           become apparent that as the tree grows, a new value in an index set node is
           promoted from a lower node to indicate branching to the left or right.

   3.      The pointer to the left of a key value points to the subtree where all of the
           entries are strictly less than that key value. The pointer to the right of a key
           value points to the subtree where all of the entries are greater than or equal to
           that key value. The “greater than or equal to” is part of the logic of the tree
           that allows sequence set values to appear in the index set, thereby creating the
           index.

   4.      As insertions are made, it is possible for a node to become full. If it is
           necessary to insert another value into a full node, that node has to be split in
           two. The detailed rules for splitting are given below.

   5.      Deletions can reduce a node to less than half full. If this happens, sibling
           nodes have to be merged. The detailed rules for merging are given below.
                                               4


Inserting and Deleting

    There is an important conceptual difference between balanced trees and other tree
structures you might be familiar with. In other trees you work from the root down when
inserting and deleting. This leads to the characteristic that different branches of the tree
may be of different length.

    In order to maintain balance in a tree, it’s necessary to work from the leaves up. You
use the tree to search downward to the leaf node where a value either would fall, or is.
You then either insert or delete accordingly, and adjust the index set above to correspond
to the new situation in the leaves. Enforcing the requirements on the fullness of nodes
leads to either splitting or merging. As a consequence of the adjustment to the index set,
the depth of the whole tree might grow or shrink depending on whether the
inserting/splitting or deleting/merging propagate all the way back up to the current root
node of the tree.


Rules of Thumb for Inserting

   Here is a list of the rules of thumb involved in inserting a new value into the tree.

   1.      Search through the tree as it exists until you find the sequence set node where
           the key value belongs.

   2.      If there is room in the node, simply insert the key value in order. Such an
           insertion has no effect upwards in the index set.

   3.      If the destination leaf node is full, split it into 2 nodes and divide the key
           values evenly between them.

   4.      Notice that in all of the examples the nodes hold an odd number of values.
           This makes it easy to split the values evenly when the n + 1st value is to be
           added. A real implementation would have to deal with the possibility of
           uneven splits, but you do not.

   5.      When a node is split, the two resulting nodes remain at the same level in the
           tree and become siblings.

   6.      The critical outcome of a split is that the new siblings’ parent node, its values,
           and its pointers have to be updated to correctly refer to the two new children.

   7.      In general, when a node is split, the leftmost value in the new right sibling is
           promoted to the parent. The fact that it is always the leftmost value that is
           promoted is explained by the fact that after promotion its right pointer points to
           a subtree containing values greater than or equal to that value. Promoting
           itself takes on two different meanings. When a value is inserted into a
                                             5


           sequence set node and is promoted from there into the index set, what is
           promoted is a copy of that value. This explains how sequence set values
           appear in the index set. However, if further up a value is promoted from one
           index set node into another, it is moved, not copied. This explains why a value
           can appear at most twice in the tree, once in the sequence set and only once in
           the index set.

   8.      The splitting and promoting process is recursive. If the parent is already full
           and a value is to be added to it, the parent is split into two siblings and its
           parent is adjusted accordingly.

   9.      When you split and promote, if the promotion causes another split in the
           parent, you end up with the following situation: The leftmost pointer in the
           new right parent appears to be able to point to the same child as the rightmost
           pointer of the new left parent. In other words, when the parent is split, 2 new
           pointers arise when the number of children only rises by one. However, the
           problem is resolved because the split in the parent requires that the leftmost
           pointer in the new right parent also be promoted, and this promotion is a move,
           not a copy.

   10.     If the splitting and promoting process trickles all of the way back up to the root
           and the root is split, then a new root node is created. The last value to promote
           is put into this new root. This growth at the root explains why balance is
           maintained in the tree and no branches become longer than any others. It also
           explains why it is necessary to allow the root to be less than half full: A brand
           new root node will only contain the single value that is promoted to it.


Deleting

    As described above, regardless of the number of children a node might have, the
splitting of nodes is binary, resulting in two new sibling nodes. This is a reasonable
approach to managing an insertion algorithm. Deletion and merging introduce a slight
complication. If a deletion causes a node to fall below half full, it needs to be merged
with another node, but which one? It will have at least one sibling, but it may have more
than one or more on each side. Should it be merged only with an immediate neighbor,
and if so, should it be the one on the left or the right? The rules of thumb below embody
the arbitrary decision to merge with the sibling on the immediate right, if there is one, and
otherwise take the one on the immediate left.

    In developing rules of thumb for this there is another consideration with deletion that
leads to more complication than with insertion. It may be that the sibling that you merge
with has the minimum permissible number of values in it. If this is the case the total
number of values would fit into one node and you would truly merge. If, however, the
sibling to be merged with is over half full, merging alone would not result in the loss of a
node. The values would simply have to be redistributed between the nodes. The
                                             6


situation where the two nodes would actually merge into one would be rare in practice.
However, it is quite possible with examples where the nodes can only contain a small
number of values and pointers.

     Just as with splitting, merging can trickle all of the way back up to the root. If it
reaches the point where the immediate children of the root are merged into a single node,
then the original root is no longer needed. This is how the tree shrinks in a balanced way.
Situations where nodes are merged and the values are redistributed between them will
still require that the values and pointers in their parent be adjusted. Finally, a simple
deletion from the sequence set which does not even cause a merge can have an effect on
the index set. This is because values in the index set have to be values that exist in the
sequence set. If the value disappears from the sequence set, then it also has to be
replaced in the index set. This is as true for the root node as for any other.

     Here is one final note of explanation that is directly related to the examples given. In
order to make the examples more interesting, the following assumption has been made:
You measure the fullness of a sequence set node strictly according to the same standard
as an index node. In a node that can contain 3 key values and 4 pointers, if a sequence set
node falls to one value, then technically it only has one pointer in it, the pointer to the
record. Thus, this node has to be merged with a sibling. This is in contrast to an index
set node, which might have only one key value in it, but is considered half full as long as
it still has two pointers in it.

Rules of Thumb for Deleting

   Here is a list of the rules of thumb involved in deleting a value from the tree.

   1. Search through the tree as it exists until you find the sequence set node where the
      key value exists.

   2. Delete the value. If the value can be deleted without having the node drop below
      half full, no merging is needed. However, if the deleted value was the leftmost in
      a sequence set node (other than the leftmost sequence set node), that value
      appears in the index set and has to be replaced there. Its replacement will end up
      being the new leftmost value in the sequence set node from which the value was
      deleted.

   3. If the deletion causes the node to drop below half full, merge it with a sibling,
      taking the sibling immediately on the right if there is one. Otherwise take the one
      on the left.

   4. If the total number of values merged together can fit into a single node, then leave
      them in a single node and adjust the values and the pointers in the parent
      accordingly.
                                         7


5. If the total number of values merged together still have to be put into two nodes,
   then redistribute the values evenly between the two nodes and adjust the values
   and the pointers in the parent accordingly.

6. Now check the parent to see whether due to the adjustments it has fallen below
   half full. Recall that the measure of fullness has to do with the number of
   pointers. In most of the small scale examples given, the sure sign of trouble is
   when a parent has only one child. A tree which doesn’t branch at each level is by
   definition not balanced.

7. If the parent is no longer half full, repeat the process described above, and merge
   at the parent level. This is the recursive part of the process.

8. Deletions can be roughly grouped into four categories with corresponding
   concerns. A deletion of a value that doesn’t appear in the index set and which
   doesn’t cause a merge: This requires no further action. A deletion of a value that
   appears in the index set and which doesn’t cause a merge: Promote another value
   into its spot in the index set. A deletion which causes a redistribution of values
   between nodes: This will affect the immediate parent; this may also be a value
   that appeared higher in the index set, requiring the promotion of a replacement. A
   deletion which causes the merging of two nodes: Work back up the tree,
   recursively merging as necessary; also promote a value if necessary to replace the
   deleted one in the index set.

9. If the merging process trickles all of the way back up to the root and the children
   of the current root are merged into one node, then the current root is replaced with
   this new node. This illustrates how balance is maintained when deleting, because
   the length of all branches of the tree is decreased at the same time when the root is
   replaced in this way.

								
To top