VIEWS: 25 PAGES: 51 POSTED ON: 3/9/2011 Public Domain
STR: A Simple and Efficient Algorithm for R-Tree Packing Scott T. Leutenegger Jeffrey M. Edgington Mario A. Lopez Mathematics and Computer Science Department University of Denver Outline 1. Introduction 2. Overview of R-tree and Packing Algorithms 3. Experimental Methodology 4. Results 5. Conclusions Introduction or Overview R-tree • R-tree is a common indexing technique for spatial and multi-dimensional databases • R-tree is dynamic structure，its contents can be modified without reconstructing the entire tree Introduction or Overview R-tree • R-tree can be used to determine which objects intersect a given query region by storing the bounding boxes of arbitrary geometric objects, such as points, polygons, or more complex objects • Arbitrary geometric object is represented by its MBR, which can change over time through insertions and deletions Introduction or Overview R-tree • Each node of the R- tree stores a maximum of n entries • Each entry consists of a rectangle R and a pointer P Figure 1: A sample R-tree Introduction or Overview R-tree • Every path down through the tree corresponds to a sequence of nested rectangles • To perform a query Q, the retrieval is accomplished by using a simple recursive Figure 1: A sample R-tree procedure Introduction or Overview Packing Algorithms • Preprocessing (packing algorithms to build an R-tree) is particularly reasonable when the data is fairly static or available a priori • These algorithms cluster rectangles attempt to minimize the number of nodes visited while processing a query (improved query time and optimal space utilization) Introduction or Overview Packing Algorithms *Assume* • Exactly one node fits per disk page • The data file consists of r rectangles • Each R-tree node can hold n rectangles Introduction or Overview General Algorithm 1. Preprocsee ← order the r rectangles in [r/n] consecutive groups of n rectangles 2. Create the leaf level nodes 3. Load and output the (MBR, page-number) for each leaf level node into a temporary file 4. Recursively pack these MBRs into nodes at the next higher level 5. Until the root node is created Introduction or Overview Nearest-X Algorithm (NX) • The rectangles are stored by accending x- coordinate • Assume: the x-coordinate of the rectangle’s center • NX.txt Introduction or Overview Hilbert Sort (HS) *Hilbert Fractal Space Filling Curve* • The basic Hilbert Curve on a 2*2 grid • To derive a curve of order i, each vertex of the basic curve is replaced by the curve of order i-1, which may be appropriately rotated and/or reflected Introduction or Overview Hilbert Sort (HS) *Hilbert Fractal Space Filling Curve* • The basic Hilbert Curve on a 2*2 grid • To derive a curve of order i, each vertex of the basic curve is replaced by the curve of order i-1, which may be appropriately rotated and/or reflected Introduction or Overview Hilbert Sort (HS) • The rectangles are stored according to the Hilbert Value of their centers • The final result is that the set of rectangles on the same node will be close to each other in the linear ordering The Hilbert Value of a point is the length of the Hilbert Curve from the origin to the point Introduction or Overview Hilbert Sort (HS) • HS.txt • The main difference between the Hilbert R- tree and the General R-tree is that non-leaf nodes also contain information about the LHVs (Largest Hilbert Value of the children): (MBR, page-number, LHV) Introduction or Overview Hilbert Sort (HS) *Extension to arbitrary floating point values* • Conceptual representation 2sizeof(Exponent)+sizeof(Mantissa) bits • The first bit of the x- and y- coordinates of a point determine which quadrant contains it. Successive bits determine which successively smaller subquadrants contain the point Introduction or Overview Sort-Tile-Recursive (STR) • S: a k-dimensional data set of r hyper- rectangles A hyper-rectangle is defined by k intervals of the form [Ai,Bi] and is the locus of points whose i-th coordinate falls inside the i-th interval, for all 1≤i≤k • K=1 B-Tree Introduction or Overview Sort-Tile-Recursive (STR) • K=2 • The basic idea is to “tile” the data space using √(r/n) vertical slices so that each slice contains enough rectangles to pack roughly √(r/n) nodes • The number of leaf level nodes: r/n Introduction or Overview Sort-Tile-Recursive (STR) • K=2 • A slice consists of a run of √(r/n) *n= √(r*n) consecutive rectangles from the sorted list • Then sort the rectangles of each slice by y- coordinate and pack them into nodes by grouping them into runs of length n Introduction or Overview Sort-Tile-Recursive (STR) • K>2 • Sort according to the first coordinate • Divide into (r/k)1/n slabs, where a slab consists of a run of n*(r/n)(k-1)/k=n1/k*r1-1/k consecutive hyper-rectangles • Processing recursively Introduction or Overview Visual results Introduction or Overview Visual results Introduction or Overview Visual results Introduction or Overview Three Packing Algorithms Differ only in how the rectangles are ordered ta each level Experimental Methodology • Goal: provide a solid experimental comparison of the algorithms through actual R-tree implementations over a wide range of data using both “real world” and synthetic data sets • Primary comparison metric: the number of disk accesses required to satisfy a query of a given size Experimental Methodology • Assume an LRU buffer management routine to simplify the parameter space of our experiments • Use a raw disk partition to easily vary the actual buffer size without reconfiguring the OS or hardware LRU: discards Least Recently Used items first Experimental Methodology • Data set size Affects R-tree performance in two ways ① Increases the depth of the R-tree ② Decreases the percentage of the data set that will fit in the buffer Experimental Methodology • Different types of data set 1. GIS: 53,145 line segments chosen from the Long Beach data of the TIGER system 2. VLSI: a CIF data set of 453,994 rectangles, which distribution is highly skewed, both in location and in size Experimental Methodology • Different types of data set 3. CFD: a data set with 52.510 nodes of varying density 4. Synthetic: Uniformly distributed data sets containing between 10,000 and 300,000 squares Results • All results are obtained from R-trees with 100 rectangles per nodes • The results for the NX algorithm are not included in the figures since it is not competitive Results • 2-D Synthetic Data Results • 2-D Synthetic Data Results • 2-D Synthetic Data Results • 2-D Synthetic Data – For a buffer of size 10, HS requires 31-42% more disk access than STR for point data, and 26-32% more disk access for region data of density of 5 – For a buffer of size 250, HS requires 33-41% more disk access than STR for point data,and 26- 32% more disk access for region data of density of 5 Results • 2-D Synthetic Data – For a buffer of size 10, HS requires 31-42% more disk access than STR for point data, and 26-32% more disk access for region data of density of 5 – For a buffer of size 250, HS requires 33-41% more disk access than STR for point data,and 26- 32% more disk access for region data of density of 5 Results • 2-D Synthetic Data Results • 2-D Synthetic Data Results • 2-D Synthetic Data – For point data, HS requires 6-22% more disk access than STR – For region data of density 5, HS requires 6-16% more disk access than STR Results • 2-D Synthetic Data • NOTE: • As the query region size increases, the difference between STR and HS becomes smaller • EXTREME: • All R-tree packing schemes exhibit the same performance as all leaves need to be examined Results • 2-D Synthetic Data • NOTE: • As the query region size increases, the difference between STR and HS becomes smaller • EXTREME: • All R-tree packing schemes exhibit the same performance as all leaves need to be examined Results • 2-D Synthetic Data Results • GIS Tiger Data Results • GIS Tiger Data Results • GIS Tiger Data – The HS algorithm requires 20-50% more disk access than STR. The relative difference increases as the buffer size decreases – The STR algorithm produces significantly smaller areas than the other two Results • VLSI Results • VLSI Results • VLSI – The HS algorithm performs slightly better than STR for point queries by a factor of 3%-11% Results • Computation Fluid Dynamics Results • Computation Fluid Dynamics Results • Computation Fluid Dynamics – For point queries, the STR algorithm requires significantly fewer disk access than HS, especially for small buffer sizes Conclusions • None of them is best for all data sets • The NX algorithm is not competitive • STR outperforms HS for mildly skewed or uniform distributed data • Choosing a packing algorithm is a toss up between STR or HS for highly skewed data, particularly for region queries Conclusions • The importance of choosing a packing algorithm is diminished as either the query size or the buffer size increase