1 Computer Science 360, Assignment 7, Information on B+ Trees Background 1. You can think of B+ trees as being the hard-coded equivalent of binary (or in general, base n) search. 2. The B in the name means balanced. This signifies the idea that the nodes in the tree may vary in how many entries they contain, but all of the leaves are the same distance from the root. 3. The balance of the tree is desirable because it places an upper bound on the number of pages that have to be read in order to get any value. The bound is O(logn). 4. The + in the name of the data structure signifies that in addition to providing indexed access to file records, links are provided which allow the records to be accessed in sequential order without traversing the index tree. Example Here is an example of a B+ tree at a certain stage of development. It is taken from page 4 of part 1 of the assignment keys. The question of how insertions and deletions are made will be addressed later. At this point it is simply desirable to see a tree and explain its contents. (19, _, _) (5, 9, 11) (29, _, _) (2, 3, _) (5, 7, _) (9, 10, _) (11, 17, _) (19, 23, _) (29, 31, _) The tree structure represents an index on a field in a table. The tree consists of nodes which each fit on a single page of memory. In this diagram, the pairs of parentheses and their contents represent the nodes in the tree. The integers are values of the field that is being indexed. This field may not be a key field in the table, but in general, when indexing, the field that is being indexed on can be referred to as the key. In the Korth and Silberschatz handout they use the term key and refer to these values as the Ki. The nodes 2 also contain pointers. In this diagram the pointers are represented by arrows. In reality, the pointers would be stored in the nodes as addresses. In Korth and Silberschatz they refer to the pointers as the Pi. The top two rows of this tree form the index set. The bottom row forms the sequence set. The pointers in the index nodes point to internal or leaf nodes of the tree. From the sequence set it is possible to point to the pages containing the actual table records containing those key values. This is indicated by the vertical arrows pointing down from the leaf nodes. The horizontal arrows between the leaf nodes represent the linkage that makes it possible to access the key values in sequential order using this index. Observe that in this example, each index node can contain up to n = 4 pointers, and it can contain up to n – 1 = 3 key values. If every node were completely full, there would be 4 pointers in each. That means that the total number of key values possible in the sequence set would be 4 * 4 = 16. All sequence set nodes are exactly 2 levels down from the root. The bound on the number of page reads to get through the index tree is log4 16 = 2. There are additional rules governing the formation of trees of this sort. Counting by pointers, internal and leaf nodes are not allowed to fall below half full. If n is even, that means that you are allowed to have no fewer than n / 2 pointers in a node. If n is odd, you round up, and the minimum is (n / 2) + 1. The book uses the notation of the ceiling function, n/2, which means the same thing. Of course, this means that if you look at how many key values are in a node, it is possible for it to appear less than half full. Finally, it is permissible in general for the root node to fall below half full. Another thing becomes apparent about B+ trees from looking at the example. In each node the key values are in order. There is also a relationship between the order of the key values in one node, the pointers coming from it, and the values in the nodes pointed to by these pointers. This relationship is intrinsic to the meaning of the contents of the tree and will be explained further below when covering the rules for inserting and deleting entries. It is also apparent that the index set is sparse while the sequence set is dense. In other words, the leaves contain all key values occurring in the table being indexed. Some of these key values occur in the index set, but the majority do not. If a key value does occur in the index set, it can only occur there once. It will become evident when looking at the rules for inserting values how this situation comes about. When the tree is growing, a value in a sequence set node can be copied into the index set node above it. However, when values are promoted from one index set node to another they are not copied; they are moved. A final remark can be made in this vein. The example shows creating a B+ tree on the primary key of a table, in other words, a field that is unique. All of the example problems on this topic will do the same. If the index were on a non-unique field, the difference would show up only in the sequence set. It would be necessary at the leaf level to 3 arrange for multiple pointers from a single key value, pointing to the multiple records that contained that key value. Some authors present the rules for creating and maintaining B+ trees as a set of mathematical algorithms. Others give pseudo-code or code for implementations. There is also a certain degree of choice in both the algorithm and its implementation. What will be given here are sets of rules of thumb that closely parallel Korth and Silberschatz. The kinds of test questions you should be able to answer about B+ trees would be like the assignment questions. In other words, given the number of key values and pointers that a node can contain, and given a sequence of unique key values to insert and delete, you need to be able to create and update the corresponding B+ tree index. Summary of the Characteristics of a Correctly Formed Tree Here are some general rules of thumb that explain the contents of a tree. More specific rules for insertion and deletion are given in following lists. At the outset, however, it’s helpful to have a few overall observations. 1. At the very beginning the whole tree structure would consist of only one node, which would be both the index set and the sequence set at the same time. After the first node is split there is a distinction. The meaning of pointers coming from and between sequence set nodes has already been given above and no further explanation is needed. The remaining remarks below address the considerations of index set nodes specifically. 2. If a key value appears in a node, it has to have pointers on each side of it. In other words, the existence of a value in a node fundamentally signals “branch left” or “branch right”. In the algorithm for the insertion of values it will become apparent that as the tree grows, a new value in an index set node is promoted from a lower node to indicate branching to the left or right. 3. The pointer to the left of a key value points to the subtree where all of the entries are strictly less than that key value. The pointer to the right of a key value points to the subtree where all of the entries are greater than or equal to that key value. The “greater than or equal to” is part of the logic of the tree that allows sequence set values to appear in the index set, thereby creating the index. 4. As insertions are made, it is possible for a node to become full. If it is necessary to insert another value into a full node, that node has to be split in two. The detailed rules for splitting are given below. 5. Deletions can reduce a node to less than half full. If this happens, sibling nodes have to be merged. The detailed rules for merging are given below. 4 Inserting and Deleting There is an important conceptual difference between balanced trees and other tree structures you might be familiar with. In other trees you work from the root down when inserting and deleting. This leads to the characteristic that different branches of the tree may be of different length. In order to maintain balance in a tree, it’s necessary to work from the leaves up. You use the tree to search downward to the leaf node where a value either would fall, or is. You then either insert or delete accordingly, and adjust the index set above to correspond to the new situation in the leaves. Enforcing the requirements on the fullness of nodes leads to either splitting or merging. As a consequence of the adjustment to the index set, the depth of the whole tree might grow or shrink depending on whether the inserting/splitting or deleting/merging propagate all the way back up to the current root node of the tree. Rules of Thumb for Inserting Here is a list of the rules of thumb involved in inserting a new value into the tree. 1. Search through the tree as it exists until you find the sequence set node where the key value belongs. 2. If there is room in the node, simply insert the key value in order. Such an insertion has no effect upwards in the index set. 3. If the destination leaf node is full, split it into 2 nodes and divide the key values evenly between them. 4. Notice that in all of the examples the nodes hold an odd number of values. This makes it easy to split the values evenly when the n + 1st value is to be added. A real implementation would have to deal with the possibility of uneven splits, but you do not. 5. When a node is split, the two resulting nodes remain at the same level in the tree and become siblings. 6. The critical outcome of a split is that the new siblings’ parent node, its values, and its pointers have to be updated to correctly refer to the two new children. 7. In general, when a node is split, the leftmost value in the new right sibling is promoted to the parent. The fact that it is always the leftmost value that is promoted is explained by the fact that after promotion its right pointer points to a subtree containing values greater than or equal to that value. Promoting itself takes on two different meanings. When a value is inserted into a 5 sequence set node and is promoted from there into the index set, what is promoted is a copy of that value. This explains how sequence set values appear in the index set. However, if further up a value is promoted from one index set node into another, it is moved, not copied. This explains why a value can appear at most twice in the tree, once in the sequence set and only once in the index set. 8. The splitting and promoting process is recursive. If the parent is already full and a value is to be added to it, the parent is split into two siblings and its parent is adjusted accordingly. 9. When you split and promote, if the promotion causes another split in the parent, you end up with the following situation: The leftmost pointer in the new right parent appears to be able to point to the same child as the rightmost pointer of the new left parent. In other words, when the parent is split, 2 new pointers arise when the number of children only rises by one. However, the problem is resolved because the split in the parent requires that the leftmost pointer in the new right parent also be promoted, and this promotion is a move, not a copy. 10. If the splitting and promoting process trickles all of the way back up to the root and the root is split, then a new root node is created. The last value to promote is put into this new root. This growth at the root explains why balance is maintained in the tree and no branches become longer than any others. It also explains why it is necessary to allow the root to be less than half full: A brand new root node will only contain the single value that is promoted to it. Deleting As described above, regardless of the number of children a node might have, the splitting of nodes is binary, resulting in two new sibling nodes. This is a reasonable approach to managing an insertion algorithm. Deletion and merging introduce a slight complication. If a deletion causes a node to fall below half full, it needs to be merged with another node, but which one? It will have at least one sibling, but it may have more than one or more on each side. Should it be merged only with an immediate neighbor, and if so, should it be the one on the left or the right? The rules of thumb below embody the arbitrary decision to merge with the sibling on the immediate right, if there is one, and otherwise take the one on the immediate left. In developing rules of thumb for this there is another consideration with deletion that leads to more complication than with insertion. It may be that the sibling that you merge with has the minimum permissible number of values in it. If this is the case the total number of values would fit into one node and you would truly merge. If, however, the sibling to be merged with is over half full, merging alone would not result in the loss of a node. The values would simply have to be redistributed between the nodes. The 6 situation where the two nodes would actually merge into one would be rare in practice. However, it is quite possible with examples where the nodes can only contain a small number of values and pointers. Just as with splitting, merging can trickle all of the way back up to the root. If it reaches the point where the immediate children of the root are merged into a single node, then the original root is no longer needed. This is how the tree shrinks in a balanced way. Situations where nodes are merged and the values are redistributed between them will still require that the values and pointers in their parent be adjusted. Finally, a simple deletion from the sequence set which does not even cause a merge can have an effect on the index set. This is because values in the index set have to be values that exist in the sequence set. If the value disappears from the sequence set, then it also has to be replaced in the index set. This is as true for the root node as for any other. Here is one final note of explanation that is directly related to the examples given. In order to make the examples more interesting, the following assumption has been made: You measure the fullness of a sequence set node strictly according to the same standard as an index node. In a node that can contain 3 key values and 4 pointers, if a sequence set node falls to one value, then technically it only has one pointer in it, the pointer to the record. Thus, this node has to be merged with a sibling. This is in contrast to an index set node, which might have only one key value in it, but is considered half full as long as it still has two pointers in it. Rules of Thumb for Deleting Here is a list of the rules of thumb involved in deleting a value from the tree. 1. Search through the tree as it exists until you find the sequence set node where the key value exists. 2. Delete the value. If the value can be deleted without having the node drop below half full, no merging is needed. However, if the deleted value was the leftmost in a sequence set node (other than the leftmost sequence set node), that value appears in the index set and has to be replaced there. Its replacement will end up being the new leftmost value in the sequence set node from which the value was deleted. 3. If the deletion causes the node to drop below half full, merge it with a sibling, taking the sibling immediately on the right if there is one. Otherwise take the one on the left. 4. If the total number of values merged together can fit into a single node, then leave them in a single node and adjust the values and the pointers in the parent accordingly. 7 5. If the total number of values merged together still have to be put into two nodes, then redistribute the values evenly between the two nodes and adjust the values and the pointers in the parent accordingly. 6. Now check the parent to see whether due to the adjustments it has fallen below half full. Recall that the measure of fullness has to do with the number of pointers. In most of the small scale examples given, the sure sign of trouble is when a parent has only one child. A tree which doesn’t branch at each level is by definition not balanced. 7. If the parent is no longer half full, repeat the process described above, and merge at the parent level. This is the recursive part of the process. 8. Deletions can be roughly grouped into four categories with corresponding concerns. A deletion of a value that doesn’t appear in the index set and which doesn’t cause a merge: This requires no further action. A deletion of a value that appears in the index set and which doesn’t cause a merge: Promote another value into its spot in the index set. A deletion which causes a redistribution of values between nodes: This will affect the immediate parent; this may also be a value that appeared higher in the index set, requiring the promotion of a replacement. A deletion which causes the merging of two nodes: Work back up the tree, recursively merging as necessary; also promote a value if necessary to replace the deleted one in the index set. 9. If the merging process trickles all of the way back up to the root and the children of the current root are merged into one node, then the current root is replaced with this new node. This illustrates how balance is maintained when deleting, because the length of all branches of the tree is decreased at the same time when the root is replaced in this way.
Pages to are hidden for
"Computer Science 360 Assignment 6 Information on B+ Trees"Please download to view full document