slides - Department of Computer Science and Engineering_ The
Document Sample


STR: A Simple and Efficient
Algorithm for R-Tree Packing
Scott T. Leutenegger Jeffrey M.
Edgington Mario A. Lopez
Mathematics and Computer Science Department
University of Denver
Outline
1. Introduction
2. Overview of R-tree and Packing Algorithms
3. Experimental Methodology
4. Results
5. Conclusions
Introduction or Overview
R-tree
• R-tree is a common indexing technique for
spatial and multi-dimensional databases
• R-tree is dynamic structure,its contents
can be modified without reconstructing the
entire tree
Introduction or Overview
R-tree
• R-tree can be used to determine which
objects intersect a given query region by
storing the bounding boxes of arbitrary
geometric objects, such as points, polygons,
or more complex objects
• Arbitrary geometric object is represented
by its MBR, which can change over time
through insertions and deletions
Introduction or Overview
R-tree
• Each node of the R-
tree stores a
maximum of n entries
• Each entry consists
of a rectangle R and a
pointer P
Figure 1: A sample R-tree
Introduction or Overview
R-tree
• Every path down
through the tree
corresponds to a
sequence of nested
rectangles
• To perform a query Q,
the retrieval is
accomplished by using
a simple recursive
Figure 1: A sample R-tree procedure
Introduction or Overview
Packing Algorithms
• Preprocessing (packing algorithms to build an
R-tree) is particularly reasonable when the
data is fairly static or available a priori
• These algorithms cluster rectangles attempt
to minimize the number of nodes visited
while processing a query (improved query
time and optimal space utilization)
Introduction or Overview
Packing Algorithms
*Assume*
• Exactly one node fits per disk page
• The data file consists of r rectangles
• Each R-tree node can hold n rectangles
Introduction or Overview
General Algorithm
1. Preprocsee ← order the r rectangles in
[r/n] consecutive groups of n rectangles
2. Create the leaf level nodes
3. Load and output the (MBR, page-number)
for each leaf level node into a temporary
file
4. Recursively pack these MBRs into nodes at
the next higher level
5. Until the root node is created
Introduction or Overview
Nearest-X Algorithm (NX)
• The rectangles are stored by accending x-
coordinate
• Assume: the x-coordinate of the rectangle’s
center
• NX.txt
Introduction or Overview
Hilbert Sort (HS)
*Hilbert Fractal Space Filling Curve*
• The basic Hilbert Curve on a 2*2 grid
• To derive a curve of order i, each vertex of
the basic curve is replaced by the curve of
order i-1, which may be appropriately rotated
and/or reflected
Introduction or Overview
Hilbert Sort (HS)
*Hilbert Fractal Space Filling Curve*
• The basic Hilbert Curve on a 2*2 grid
• To derive a curve of order i, each vertex of
the basic curve is replaced by the curve of
order i-1, which may be appropriately rotated
and/or reflected
Introduction or Overview
Hilbert Sort (HS)
• The rectangles are stored according to the
Hilbert Value of their centers
• The final result is that the set of rectangles
on the same node will be close to each other
in the linear ordering
The Hilbert Value of a point is the length of the
Hilbert Curve from the origin to the point
Introduction or Overview
Hilbert Sort (HS)
• HS.txt
• The main difference between the Hilbert R-
tree and the General R-tree is that non-leaf
nodes also contain information about the
LHVs (Largest Hilbert Value of the children):
(MBR, page-number, LHV)
Introduction or Overview
Hilbert Sort (HS)
*Extension to arbitrary floating point
values*
• Conceptual representation
2sizeof(Exponent)+sizeof(Mantissa) bits
• The first bit of the x- and y- coordinates of
a point determine which quadrant contains it.
Successive bits determine which successively
smaller subquadrants contain the point
Introduction or Overview
Sort-Tile-Recursive (STR)
• S: a k-dimensional data set of r hyper-
rectangles
A hyper-rectangle is defined by k intervals of the
form [Ai,Bi] and is the locus of points whose i-th
coordinate falls inside the i-th interval, for all
1≤i≤k
• K=1 B-Tree
Introduction or Overview
Sort-Tile-Recursive (STR)
• K=2
• The basic idea is to “tile” the data space
using √(r/n) vertical slices so that each slice
contains enough rectangles to pack roughly
√(r/n) nodes
• The number of leaf level nodes: r/n
Introduction or Overview
Sort-Tile-Recursive (STR)
• K=2
• A slice consists of a run of √(r/n) *n= √(r*n)
consecutive rectangles from the sorted list
• Then sort the rectangles of each slice by y-
coordinate and pack them into nodes by
grouping them into runs of length n
Introduction or Overview
Sort-Tile-Recursive (STR)
• K>2
• Sort according to the first coordinate
• Divide into (r/k)1/n slabs, where a slab
consists of a run of n*(r/n)(k-1)/k=n1/k*r1-1/k
consecutive hyper-rectangles
• Processing recursively
Introduction or Overview
Visual results
Introduction or Overview
Visual results
Introduction or Overview
Visual results
Introduction or Overview
Three Packing Algorithms
Differ only in how the rectangles are
ordered ta each level
Experimental Methodology
• Goal: provide a solid experimental comparison
of the algorithms through actual R-tree
implementations over a wide range of data
using both “real world” and synthetic data
sets
• Primary comparison metric: the number of
disk accesses required to satisfy a query of a
given size
Experimental Methodology
• Assume an LRU buffer management routine
to simplify the parameter space of our
experiments
• Use a raw disk partition to easily vary the
actual buffer size without reconfiguring the
OS or hardware
LRU: discards Least Recently Used items first
Experimental Methodology
• Data set size
Affects R-tree performance in two ways
① Increases the depth of the R-tree
② Decreases the percentage of the data set that
will fit in the buffer
Experimental Methodology
• Different types of data set
1. GIS: 53,145 line segments chosen from the
Long Beach data of the TIGER system
2. VLSI: a CIF data set of 453,994 rectangles,
which distribution is highly skewed, both in
location and in size
Experimental Methodology
• Different types of data set
3. CFD: a data set with 52.510 nodes of varying
density
4. Synthetic: Uniformly distributed data sets
containing between 10,000 and 300,000 squares
Results
• All results are obtained from R-trees with
100 rectangles per nodes
• The results for the NX algorithm are not
included in the figures since it is not
competitive
Results
• 2-D Synthetic Data
Results
• 2-D Synthetic Data
Results
• 2-D Synthetic Data
Results
• 2-D Synthetic Data
– For a buffer of size 10, HS requires 31-42%
more disk access than STR for point data, and
26-32% more disk access for region data of
density of 5
– For a buffer of size 250, HS requires 33-41%
more disk access than STR for point data,and 26-
32% more disk access for region data of density
of 5
Results
• 2-D Synthetic Data
– For a buffer of size 10, HS requires 31-42%
more disk access than STR for point data, and
26-32% more disk access for region data of
density of 5
– For a buffer of size 250, HS requires 33-41%
more disk access than STR for point data,and 26-
32% more disk access for region data of density
of 5
Results
• 2-D Synthetic Data
Results
• 2-D Synthetic Data
Results
• 2-D Synthetic Data
– For point data, HS requires 6-22% more disk
access than STR
– For region data of density 5, HS requires 6-16%
more disk access than STR
Results
• 2-D Synthetic Data
• NOTE:
• As the query region size increases, the
difference between STR and HS becomes
smaller
• EXTREME:
• All R-tree packing schemes exhibit the same
performance as all leaves need to be
examined
Results
• 2-D Synthetic Data
• NOTE:
• As the query region size increases, the
difference between STR and HS becomes
smaller
• EXTREME:
• All R-tree packing schemes exhibit the same
performance as all leaves need to be
examined
Results
• 2-D Synthetic Data
Results
• GIS Tiger Data
Results
• GIS Tiger Data
Results
• GIS Tiger Data
– The HS algorithm requires 20-50% more disk
access than STR. The relative difference
increases as the buffer size decreases
– The STR algorithm produces significantly smaller
areas than the other two
Results
• VLSI
Results
• VLSI
Results
• VLSI
– The HS algorithm performs slightly better than
STR for point queries by a factor of 3%-11%
Results
• Computation Fluid Dynamics
Results
• Computation Fluid Dynamics
Results
• Computation Fluid Dynamics
– For point queries, the STR algorithm requires
significantly fewer disk access than HS,
especially for small buffer sizes
Conclusions
• None of them is best for all data sets
• The NX algorithm is not competitive
• STR outperforms HS for mildly skewed or
uniform distributed data
• Choosing a packing algorithm is a toss up
between STR or HS for highly skewed data,
particularly for region queries
Conclusions
• The importance of choosing a packing
algorithm is diminished as either the
query size or the buffer size increase
Get documents about "