Docstoc

slides - Department of Computer Science and Engineering_ The

Document Sample
slides - Department of Computer Science and Engineering_ The Powered By Docstoc
					STR: A Simple and Efficient
Algorithm for R-Tree Packing
   Scott T. Leutenegger Jeffrey M.
     Edgington     Mario A. Lopez
  Mathematics and Computer Science Department
              University of Denver
                  Outline
1. Introduction

2. Overview of R-tree and Packing Algorithms

3. Experimental Methodology

4. Results

5. Conclusions
         Introduction or Overview
                  R-tree

• R-tree is a common indexing technique for
  spatial and multi-dimensional databases

• R-tree is dynamic structure,its contents
  can be modified without reconstructing the
  entire tree
          Introduction or Overview
                   R-tree
• R-tree can be used to determine which
  objects intersect a given query region by
  storing the bounding boxes of arbitrary
  geometric objects, such as points, polygons,
  or more complex objects

• Arbitrary geometric object is represented
  by its MBR, which can change over time
  through insertions and deletions
           Introduction or Overview
                      R-tree

                            • Each node of the R-
                              tree stores a
                              maximum of n entries

                            • Each entry consists
                              of a rectangle R and a
                              pointer P


Figure 1: A sample R-tree
           Introduction or Overview
                      R-tree
                            • Every path down
                              through the tree
                              corresponds to a
                              sequence of nested
                              rectangles

                            • To perform a query Q,
                              the retrieval is
                              accomplished by using
                              a simple recursive
Figure 1: A sample R-tree     procedure
          Introduction or Overview
         Packing Algorithms

• Preprocessing (packing algorithms to build an
  R-tree) is particularly reasonable when the
  data is fairly static or available a priori

• These algorithms cluster rectangles attempt
  to minimize the number of nodes visited
  while processing a query (improved query
  time and optimal space utilization)
          Introduction or Overview
          Packing Algorithms
*Assume*

• Exactly one node fits per disk page

• The data file consists of r rectangles

• Each R-tree node can hold n rectangles
         Introduction or Overview
          General Algorithm
1. Preprocsee ← order the r rectangles in
   [r/n] consecutive groups of n rectangles
2. Create the leaf level nodes
3. Load and output the (MBR, page-number)
   for each leaf level node into a temporary
   file
4. Recursively pack these MBRs into nodes at
   the next higher level
5. Until the root node is created
           Introduction or Overview
    Nearest-X Algorithm (NX)

• The rectangles are stored by accending x-
  coordinate

• Assume: the x-coordinate of the rectangle’s
  center

• NX.txt
          Introduction or Overview
          Hilbert Sort (HS)
*Hilbert Fractal Space Filling Curve*

• The basic Hilbert Curve on a 2*2 grid

• To derive a curve of order i, each vertex of
  the basic curve is replaced by the curve of
  order i-1, which may be appropriately rotated
  and/or reflected
          Introduction or Overview
          Hilbert Sort (HS)
*Hilbert Fractal Space Filling Curve*

• The basic Hilbert Curve on a 2*2 grid

• To derive a curve of order i, each vertex of
  the basic curve is replaced by the curve of
  order i-1, which may be appropriately rotated
  and/or reflected
          Introduction or Overview
           Hilbert Sort (HS)
• The rectangles are stored according to the
  Hilbert Value of their centers

• The final result is that the set of rectangles
  on the same node will be close to each other
  in the linear ordering

  The Hilbert Value of a point is the length of the
   Hilbert Curve from the origin to the point
           Introduction or Overview
           Hilbert Sort (HS)
• HS.txt

• The main difference between the Hilbert R-
  tree and the General R-tree is that non-leaf
  nodes also contain information about the
  LHVs (Largest Hilbert Value of the children):

  (MBR, page-number, LHV)
           Introduction or Overview
           Hilbert Sort (HS)
*Extension to arbitrary floating point
 values*
• Conceptual representation
  2sizeof(Exponent)+sizeof(Mantissa) bits


• The first bit of the x- and y- coordinates of
  a point determine which quadrant contains it.
  Successive bits determine which successively
  smaller subquadrants contain the point
          Introduction or Overview
    Sort-Tile-Recursive (STR)
• S: a k-dimensional data set of r hyper-
  rectangles

  A hyper-rectangle is defined by k intervals of the
   form [Ai,Bi] and is the locus of points whose i-th
   coordinate falls inside the i-th interval, for all
   1≤i≤k


• K=1      B-Tree
          Introduction or Overview
    Sort-Tile-Recursive (STR)
• K=2

• The basic idea is to “tile” the data space
  using √(r/n) vertical slices so that each slice
  contains enough rectangles to pack roughly
  √(r/n) nodes

• The number of leaf level nodes: r/n
          Introduction or Overview
    Sort-Tile-Recursive (STR)
• K=2

• A slice consists of a run of √(r/n) *n= √(r*n)
  consecutive rectangles from the sorted list

• Then sort the rectangles of each slice by y-
  coordinate and pack them into nodes by
  grouping them into runs of length n
           Introduction or Overview
    Sort-Tile-Recursive (STR)
• K>2

• Sort according to the first coordinate

• Divide into (r/k)1/n slabs, where a slab
  consists of a run of n*(r/n)(k-1)/k=n1/k*r1-1/k
  consecutive hyper-rectangles

• Processing recursively
Introduction or Overview
   Visual results
Introduction or Overview
   Visual results
Introduction or Overview
   Visual results
   Introduction or Overview
   Three Packing Algorithms


Differ only in how the rectangles are
ordered ta each level
    Experimental Methodology
• Goal: provide a solid experimental comparison
  of the algorithms through actual R-tree
  implementations over a wide range of data
  using both “real world” and synthetic data
  sets

• Primary comparison metric: the number of
  disk accesses required to satisfy a query of a
  given size
    Experimental Methodology
• Assume an LRU buffer management routine
  to simplify the parameter space of our
  experiments

• Use a raw disk partition to easily vary the
  actual buffer size without reconfiguring the
  OS or hardware

  LRU: discards Least Recently Used items first
   Experimental Methodology
• Data set size
  Affects R-tree performance in two ways

  ① Increases the depth of the R-tree

  ② Decreases the percentage of the data set that
    will fit in the buffer
    Experimental Methodology
• Different types of data set

  1. GIS: 53,145 line segments chosen from the
     Long Beach data of the TIGER system

  2. VLSI: a CIF data set of 453,994 rectangles,
     which distribution is highly skewed, both in
     location and in size
    Experimental Methodology
• Different types of data set

  3. CFD: a data set with 52.510 nodes of varying
     density

  4. Synthetic: Uniformly distributed data sets
     containing between 10,000 and 300,000 squares
                  Results
• All results are obtained from R-trees with
  100 rectangles per nodes

• The results for the NX algorithm are not
  included in the figures since it is not
  competitive
                 Results
• 2-D Synthetic Data
                 Results
• 2-D Synthetic Data
                 Results
• 2-D Synthetic Data
                   Results
• 2-D Synthetic Data

  – For a buffer of size 10, HS requires 31-42%
    more disk access than STR for point data, and
    26-32% more disk access for region data of
    density of 5

  – For a buffer of size 250, HS requires 33-41%
    more disk access than STR for point data,and 26-
    32% more disk access for region data of density
    of 5
                   Results
• 2-D Synthetic Data

  – For a buffer of size 10, HS requires 31-42%
    more disk access than STR for point data, and
    26-32% more disk access for region data of
    density of 5

  – For a buffer of size 250, HS requires 33-41%
    more disk access than STR for point data,and 26-
    32% more disk access for region data of density
    of 5
                 Results
• 2-D Synthetic Data
                 Results
• 2-D Synthetic Data
                   Results
• 2-D Synthetic Data

  – For point data, HS requires 6-22% more disk
    access than STR

  – For region data of density 5, HS requires 6-16%
    more disk access than STR
                 Results
• 2-D Synthetic Data
• NOTE:
• As the query region size increases, the
  difference between STR and HS becomes
  smaller
• EXTREME:
• All R-tree packing schemes exhibit the same
  performance as all leaves need to be
  examined
                 Results
• 2-D Synthetic Data
• NOTE:
• As the query region size increases, the
  difference between STR and HS becomes
  smaller
• EXTREME:
• All R-tree packing schemes exhibit the same
  performance as all leaves need to be
  examined
                 Results
• 2-D Synthetic Data
                   Results
• GIS Tiger Data
                   Results
• GIS Tiger Data
                   Results
• GIS Tiger Data

  – The HS algorithm requires 20-50% more disk
    access than STR. The relative difference
    increases as the buffer size decreases

  – The STR algorithm produces significantly smaller
    areas than the other two
         Results
• VLSI
         Results
• VLSI
                   Results
• VLSI

  – The HS algorithm performs slightly better than
    STR for point queries by a factor of 3%-11%
                 Results
• Computation Fluid Dynamics
                 Results
• Computation Fluid Dynamics
                    Results
• Computation Fluid Dynamics

  – For point queries, the STR algorithm requires
    significantly fewer disk access than HS,
    especially for small buffer sizes
              Conclusions
• None of them is best for all data sets
• The NX algorithm is not competitive
• STR outperforms HS for mildly skewed or
  uniform distributed data
• Choosing a packing algorithm is a toss up
  between STR or HS for highly skewed data,
  particularly for region queries
              Conclusions

• The importance of choosing a packing
  algorithm is diminished as either the
  query size or the buffer size increase

				
DOCUMENT INFO