# Indexing and Hashing

Shared by:
Categories
-
Stats
views:
7
posted:
8/11/2011
language:
English
pages:
65
Document Sample

```							Indexing
Contents

•   Basic Concepts
•   Ordered Indices
•   B+ - Tree Index Files
•   B- Tree Index Files
Basic Concepts
Index Evaluation Metrics
Ordered Indices
B+-Tree Index Files
B+-Tree Node Structure
•   Typical node

P1 K1 P2      …     Pn-1   Kn-1 Pn

1. K i are the search-key values
2. Pi are pointers to children (for no-leaf nodes) or
pointers to records or buckets of records (for leaf
nodes).
• The search-keys in a node are ordered
• K1<K2<K3<…<Kn-1
Example of a B+-tree
Example of B+-tree

!   Leaf nodes must have between 2 and 4 values
(.(n–1)/2. and n –1, with n = 5).
!   Non-leaf nodes other than root must have between 3
and 5 children (.(n/2. and n with n =5).
!   Root must have at least 2 children.
Queries on B+-Trees
Queries on B+-Trees (Cont.)

above difference is significant since every node access
may need a disk I/O, costing around 20 milliseconds!
B+-Tree File Organization

• Index file degradation problem is solved by using B+-Tree
indices. Data file degradation problem is solved by using
B+-Tree File Organization.
• The leaf nodes in a B+-tree file organization store records,
• Since records are larger than pointers, the maximum
number of records that can be stored in a leaf node is less
than the number of pointers in a nonleaf node.
• Leaf nodes are still required to be half full.
• Insertion and deletion are handled in the same way as
insertion and deletion of entries in a B+-tree index.
B+-Tree File Organization
(Cont.)

Example of B+-tree File Organization
B-Tree Index Files
•   Similar to B+-tree, but B-tree allows search-key values toappear only once;
eliminates redundant storage of search keys.
•   Search keys in nonleaf nodes appear nowhere else in the B tree; an
additional pointer field for each search key in a nonleaf node must be
included.
•   Generalized B-tree leaf node

Nonleaf node – pointers Bi are the bucket or file record
pointers.
B-Tree Index File Example
B-Tree Index Files (Cont.)
• May use less tree nodes than a corresponding B+-
Tree.
• Sometimes possible to find search-key value
before reaching leaf node.
• Only small fraction of all search-key values are
found early
• Non-leaf nodes are larger, so fan-out is reduced.
Thus B-Trees
typically have greater depth than corresponding
B+-Tree
• Insertion and deletion more complicated than in
B+-Trees
• Implementation is harder than B+-Trees.
Typically, advantages of B-Trees do not out weigh
Index Sequential               Data File
Block 1
Becker
Value      Block
Number

Dumpling    1
Harty       2
Texaci      3
Getta
...         …         Harty     Block 2

Mobile
Sunoci    Block 3
Texaci
Indexed Sequential: Two
Levels    Key
Value
003
150      1               .
.
385      2              150
Value
251
.
385      7                                 .
Value
678      8
536      3              455
805      9
678      4              480
…                                 .
.
536

605
Value
.
785      5               .
678
805      6
705
710
.
.
791
785
.
.
805
Indexed Random
• Key values of the physical records are not
necessarily in logical sequence
• Index may be stored and accessed with
Indexed Sequential Access Method
• Index has an entry for every data base
record. These are in ascending order. The
index keys are in logical sequence.
Database records are not necessarily in
ascending sequence.
• Access method may be used for storage
and retrieval
Indexed Random
Becker
Harty
Value      Block
Number

Becker      1
Dumpling    3
Getta       2         Getta
Harty       1

Dumpling
Btree
F   ||   P || Z|

B   ||    D || F|        H ||    L || P|         R ||    S || Z|

Devils
Hawkeyes       Minors           Seminoles
Hoosiers      Panthers
Aces          Flyers
Boilers
Cars
Inverted
• Key values of the physical records are not
necessarily in logical sequence
• Access Method is better used for
retrieval
• An index for every field to be inverted
may be built
• Access efficiency depends on number of
database records, levels of index, and
storage allocated for index
Inverted 145
Student   Course
name      Number
CH
101, 103,104
Value    Block
Number
Becker    cs201
CH 145    1
Dumpling ch145
CS 201    2
CS 623    3             CS 201       Getta     ch145
PH 345    …              102
Harty     cs623

Mobile    cs623

CS 623
105, 106
Direct
• Key values of the physical records are not
necessarily in logical sequence
• There is a one-to-one correspondence
between a record key and the physical
• May be used for storage and retrieval
• Access efficiency always 1
• Storage efficiency depends on density of
keys
• No duplicate keys permitted
So Far
• So far, when we do runtime analysis, we
give each operation one time unit
• Actually, we've been assuming that they
are close enough in run time that every
operation can be considered the same
• This, of course, is not realistic, as things
like hard drives and networking are much
slower than anything we can calculate in
the computer
Why the Processor and
Main Memory Are Good
• A processor can do about 2.5 billion
instructions per second on a higher-end
home PC these days
• Data stored in main memory is accessible
in a speed to match the processor's speed
• Imagine that we are storing a tree of 100
million elements (imagine, the number of
bank transactions for a given month or
year)
Continued
• Even if it takes 20 CPU instructions (too
many) to traverse a single node of binary
search tree (accessing data and processing
it), we can still access 125 million nodes
per second
• This is all of the elements of a completely
linear tree 1.25 times per second
• Imagine that it takes 32 bytes to
represent a key into the tree (what we
order on) and 1k bytes to store the data,
then we need to store roughly
100,000,000,000 bytes of data, or need
100 GB of RAM to run this procedure
Continued
• Of course, 400 MB RAM is not out of
question, but that leaves no other RAM to
run anything else (like the OS perhaps)
among other things
• So, processor/main memory is fast, but
not very practical, storing 100 GB of data
on a hard drive is nothing these days,
however
Hard Drives
• While speeds of processors go up rapidly,
hard drive capacity goes up
• Most drives today run 7,200 RPM
• To get data, we have to rotate a disk 0.5
of a rotation, so about 4.1 ms
• So we can do about 250 accesses per
second
• Remember processors? That was 125
million accesses per second
Rough Comparisons
• Based on our rough comparisons, a piece of
data in main memory can be accessed
500,000 faster than data on a hard drive
• But the reality is that main memory is
very, very expensive, and hard drives are
cheap with lots of storage
• So we want to go with hard drives, but we
need a way to improve the parts of the
runtimes that are slow for hard drives
Past BSTs
• If you look at our past BSTs, even the
good balanced trees, we'd have to do, at
best, an average of O(log n) calculations to
find an element
• log 100000000 = about 26
• So, in a balanced BST, we would have to do
26 node accesses in the worst case
• This is a nearly immeasurable fraction of
time if we are using only main memory, but
its about 1/10th of a second from a hard
drive
Goal
• Our goal is to make a tree such that the
number of accesses to find a node is
greatly reduced
• If we can do a lot of heavy, fast
calculations to do very few disk
operations, we will have improved the
runtime greatly
• Its okay to do a lot of calculations -- in
the time it takes for one disk access, we
can do 125 million instructions in the
processor
Height
• The biggest problem is tree height
• Think about what happens in a BST...
access a node
decide which way to go
repeat for the direction desired
• We have to do this for every node in the
path, which can cause problems
• We want a tree with a smaller height, and
level, we'll access the disk only once
• To reduce height, we have a relatively
simple solution: increase branching
The Good Ol' N-ary Tree
• We mostly always talk about binary trees,
for a good reason: in main memory, who
cares what the height is? We have plenty
of speed O(log n) is good enough
• In that, is hidden details, as log n is
actually log2n
• In a trinary tree, the runtime is O(log n)
also, but (as with constants) we left out
the base, and it really goes in log3n time
• This may not make a huge difference for
main memory, but for disk access, its 10
less accesses, or about 0.4 sec (a huge
improvement)
The B-Tree
• What the B-tree is is a collection of data
all collected at the leaf level, made up of
an n-ary tree,
• An n-ary tree works like a BST, only we do
a slightly more complicated calculation to
figure out which path to take
• We also need to guarantee that the n-ary
tree will be balanced, or it may worsen into
a regular BST, this is the first property of
a B-Tree, good balancing
B-Tree Properties
• Like a red-black tree, we have certain
rules to follow in our n-ary tree
• 1) All data is stored at the leaf level
2) Nonleaf nodes store up to n-1 keys to
help searching for which path
3) The root is a leaf or has 2 to n children
4) All nonleaf nodes (except the root) have
between M/2 and M children
5) All leaves are at the same depth and
have between M/2 and M children, except
if the root is a leaf
What It Looks Like
1         3          5
5         0          0

1, 10, 12     16, 21       30, 41, 42, 43   50, 51

Note that even though we have 11 elements in the tree, the
runtime will only involve 2 node accesses, differed by 2.18 for the
binary
Better?
• This is much better
• Now we've decreased the number of nodes
to be accessed and decreased the height,
making everything more shallow
• 4-ary isn't really the way to go here, what
ends up happening is usually a much bigger
number, like say, 200-ary, but either way,
it's still better than binary
• Note: we have to assume that all the
information on one node is stored
contiguously in memory, meaning the hard
drive doesn't have to reposition itself
(slow!) just for one unfinished node
B-Tree
• Now we're ready to get into B-Trees
• We need to know how to add to and
remove from the B-Tree, and as
always, we'll focus on insertion first
Summary of Why
• Writing to the hard disk is an expensive
operation
• In a large system, we'll have to read and
write from the disk often, and we want to
minimize the number of expensive
operations for a good runtime
• We need a structure that will take into
account the cost of accessing a disk and do
that as little as possible while using the
structure
Summary of How
• When the HD is accessed, a set size block
• We'll make each node of a tree one of
these blocks
• For efficiency's sake, we want to maximize
the amount of data in these nodes/blocks
• With this setup, we can define quite a few
algorithms to deal with large-scale blocks
of data such as the one governing the
structure of a B-tree
The Rules
1) All data is stored at the leaf level.
2) Non-leaf nodes store keys to help
search with path to a leaf to find the
element at
3) The root is a leaf or has 2 to n children
4) All non-leaf nodes have between M/2
and M children (at least half-full)
5) All leaves are at the same level
6) All leaves are always at least half-full,
unless the root is a leaf itself
Starting Out
• So what do we really have if we have an
empty B-tree?
• Essentially, we have one empty storage
container, which is the size of a block on
the HD (say, b), and can store b/n
elements of size n
• So, the root starts out as a leaf, which
just stores data
Inserting the First
Elements
• The root/leaf has a set amount of space,
that we can keep adding elements to
• We may want to keep it sorted for easier
searching
• In fact, is there any reason to not keep a
leaf sorted?
• No, because the time it will take us to
order the elements is miniscule compared
to how long it will take to write the result
to the hard drive
• Okay, so, while we have space in the leaf,
we add a new element in sorted order
Notes: Element?
• This is more of a databases topic, but what
is normally stored in the B-Tree is a
reference to another object or piece of
data, which may be, for instance, a memory
location to somewhere else on the disk
• We also need to store a key that
something is being searched on
• We are keeping, then, a {key, object} pair,
where key is simply a search item (like a ID
number) and an object is just a pointer to
something else
• We'll learn lots more about {key, object}
pairs when we start talking about
hashtables
Uh Oh! Full!
• It works okay to keep adding elements, but
as we said, each block is a set amount of
space, what happens when we fill it up with
key-object pairs?
• Recall one of the rules of a B-tree: all
leaves are always at least half-full
• If we know this, we know we can take that
original, full leaf node, and split it into two
nodes
Notes: Split?
• Every block on an HD is the same
size, and each element (key, object)
occupies the same size
• If we split the number of elements
down the middle, then we'll create
two half-full nodes
Notes: Full?
• Just when does a tree become full, or,
more importantly, when do we do a split of
a leaf node?
• The popular and common way is whenever
you try to add one element to a completely
full node, you split the node and then add
the element again
• Another way I like is to check if a split is
needed after an addition, this way, you
don't have to keep track of the element to
there
The Split
• When the node gets full, we'll split it into
two nodes
• But what happens then? This is a tree, we
need connections to our leaf nodes
• In the first case of a split, that is, when
the root is the leaf and it splits, we end up
with two leaves, which would give us reason
to believe we should make a new, non-leaf
node which connects the two leaf nodes
together
Notes: The Non-leaf
Node
• A leaf node stores key-object pairs in
sorted order...what would the non-leaf
node store?
• One of our rules: Non-leaf nodes store
keys to help search with path to a leaf to
find the element at
• So, the leaf-nodes will store keys, they
also need to store links to other nodes
After the Split
• After the split of the leaf-node at the
root level, we need to make a new non-leaf
node and make this the root
• But what do we do about the keys?
• What we can do is copy either the max of
the left or the min of the right from
either of the new small half-full trees,
and, from the data we are trying to search
on, we would be able to tell which subtree
to traverse down to
Recap
• If a leaf node fills up, split it into two leaf
nodes
• If the leaf node is the root, create a new
non-leaf node
• Copy a key up into the parent node
• Reference the two new subtrees
accordingly with the key copied up into the
parent node
Notes: Splitting the Root
• Anytime you split the root, you will be
making a brand new empty node
• This is the only time you will ever actually
create an node that will have less elements
than half (since we normally will move half
of another node into a new node further
down the tree)
• Keep in mind the rules have special
allowances for the root
• You'll also have to fix the root references
What Else Can Happen?
• What if a leaf node fills up, and it is not
the root?
• Well, this means it already has a parent, so
split the node into two, and copy up the key
into the right spot in the parent node
• This may involve inserting a link in the
middle of elements of the non-leaf node
• The non-leaf node can also fill up at this
point
What Else Continued?
• What if a non-leaf node fills up?
• We'll need to split the node in two, but
how can we possibly do this? We'd have to
overlap pointers
• Not necessarily
• What we'll do here, instead of copying up a
key, we will move it up to another non-leaf
node
More Full Non-leaf
Nodes
• We'll move up the middle element
• This means that the pointer to the
left of the old middle will be the
rightmost pointer of the left child,
and the pointer to the right of the
old middle will be the left-most
pointer of the right child
Deletion
• A very common deletion algorithm in a B-
tree is lazy deletion
• This means just go to the spot to be
deleted, and get rid of it, and don't check
to see if the rules hold
• Of course, this can ruin some of the
balance of the tree and make it branch out
a little more than expected, and we now
have nodes which are not half full
• The only thing we'll have to deal with is
removing all the elements of a node, which
means we have to change the parent node
Notes: Deleting a Key in
a Non-Leaf Node
• Even if we are deleting an element at the
leaf level, we don't need to worry about
removing it's key further up in the tree (if
it is up there)
• The reason why: we just use the keys as a
way to determine which direction to follow
to get to any element, we could select keys
originally that are not actual elements in
the tree, but it's just easier to do it that
way
Deletion the Real Way
• We go down to the leaf level and delete
the element we are looking for
• Then, if the node is now half full, we have
choices to make
• If there are enough elements remaining in
an adjacent node (to keep both nodes half
full), we can copy them across
• If there are not enough elements, then we
can merge the two nodes together
Case 1: Copy Across
• If we are going to copy elements across
from another node, we will need to fix the
key in the parent node, since it is probably
off by a few elements, now
• Thus, when we are done with the copy
across, we will return the new key to the
parent node so it can adjust it's search
keys
Case 2: Merge
• In the merge case, we will simply move all
the elements of two nodes into one
• The biggest problem with this is that the
parent now has a key for which there is no
valid pointer anymore
• So we will now have to delete a key in the
parent node
Deleting a Key in a Non-
Leaf Node
• If you delete a key in a non-leaf
node, you are subject to, again,
having a node that is not half-full,
which breaks the rules
• In this case, we can again do either
redistribution/copy across of keys,
or we can merge two nodes together
Notes: Doing a Good
Deletion
• If you want to use the better algorithm
for doing deletion, keep in mind one thing:
to perform a copy across or merge, we
have to access different page(s) from the
hard disk, which is time consuming
• Performing copy across makes more sense
than a merge (so we don't have to delete a
node altogether), so we may want to look
at nodes on both sides of the now not half-
full node
Notes: Implementing the
Leaf Level
• A common way the leaf level is
implemented is to have links to the
next leaf in the tree
• This makes more sense from a
databases standpoint than from just
a pure data structure view
B+ Trees?
• We've actually been discussing B+ trees
• B trees are slightly different - data is
allowed at the non-leaf level (among a
couple other small differences)
• However, B+ trees are so common in
practice that many times they are now
called B trees in many places (including the
Weiss book!)
• There are other variations, like the B*
tree, which keeps the nodes always 2/3rds

```
Related docs
Other docs by suchenfz
armtrack
Miscellaneous Items - GSA Home
BRAIN_CHIPS