Index Structures For ISAT Work In Progress

Document Sample
Index Structures For ISAT Work In Progress Powered By Docstoc
					Index Structures For ISAT
    Work In Progress
Biswanath Panda, Mirek Riedewald,
  Paul Chew & Johannes Gehrke
        Where Do We Fit In?
• ISAT uses a tabulation method to find
  approximate function values
• Currently a simple binary tree
• Try existing and new index structures
  – Lot of work on indexing high dimensional data
  – Very little work with real applications
             Outline Of Talk
•   New API for ISATAB
•   Description of indexes
•   Experimental results
•   Ongoing work
•   Discussion
         API Design For ISAT
• Why did we need it?
• Old API
  – ISATAB logic and index logic separate
  – Index logic and ISATAB operations on the index still

                         snepQuery (containment search)
                         snepQueryList (proximity search)
         New API Design
         General Index Template              Specific Index
         ISATAB API
         snepQuery (containment search)     Implements index
         snepQueryList (proximity search)        logic
ISATAB   nepUpdate

         General Index API
         containmentIterator                 Packaging Object
         insertDataItem                     Transforms ISATAB
         deleteDataItem                      Objects into index
         Benefits And Losses
• Advantages
  – General Index Template does not need to be
  – Can use any existing index that follows API
• Disadvantages
  – Different layers cannot talk easily
  – Extra overheads of multiple function calls
        Indexes For ISATAB
• Initial results showed usefulness of caching
  – LRU List
• Two broad categories of indexes
  – Point Indexes
     • Current binary tree
     • Rtree of points
  – Ellipsoid Indexes
     • Rtree of rectangles
     • LRU list
What Is A Rtree?
           Rtree Properties
• Balanced tree
• Each node must have a minimum
• Overlapping bounding boxes deteriorates
• Delete operation: Deletes and reinserts
  underfull nodes.
                 Rtree For ISAT
• Point Rtree
   – Indexes centers of ellipsoids
   – Find nearest neighbors both for queries and growing
   – No delete operation
• Bounding Box Rtree
   –   Take the bounding box of the ellipsoids
   –   Check for containment in bounding box for queries
   –   Find nearest neighbor to bounding box for a grow
   –   Delete operation in grow
Measurements   List    pointRTree pointRtree+ List   Rtree      Original

Queries        5.98E+06 5.98E+06            5.98E+06 5.98E+06              5.92E+06
Retrieves      5.94E+06 5.96E+06            5.96E+06 5.97E+06              5.90E+06
Adds               12827    2902                 2804    2594                   2267
Grows              43681   31451                22620   21232                  12269
Total Time          1684    6599               748.81    1108                 757.84
Search            60.857    5855                   97     390                    132
Add                  170      39                   39      35                    8.8
Grow                7.07      30                38.31      44                     47
Evaluations         1427     646                  556     621                    529

  • Methane Simulation with 32 species
• Caching good for searching not for growing
• Fast Scan does seem to do well
  – Point Rtree + list and Original ISAT do the best
  – Original ISAT does only 50% primary retrieves
• Rtree does well but is expensive
• What we do not understand?
  – Our code always does more grows
              Another Example
• Methane simulation with 55 species
• Large number of Grows : Order of 105 grows in
  2x106 queries
• PointRtree + list = 65% hits
• Original ISATAB = 90% hits (44% secondary
• Rtree = 86% hits (could only reach 105 queries)
• Simple caching not going to work
• Grow Cost
   – Grow for Rtree more expensive than point Rtree
   – Searching for growable ellipsoids and growing them
     dominates simulation
      Summary Of Experiments
• Different simulations have very different
• Concept of growing still not clear
   – Definitely needed
   – When, what and how much to grow?
• Simple caching definitely not good
• Tradeoff discovered
   – Indexing ellipsoids helps search but growing may
     dominate costs
   – Indexing points helps updates but search suffers.
                   Ongoing Work
• Hybrid index structures
   – Transition from update friendly to search friendly
   – Dynamically change index parameters
       • Study statistics as the simulation proceeds
• Rtree with ellipsoidal bounding region
• Random Projections
   – Project ellipsoids on random lines
   – If a point lies within the projections, then it lies within
     the ellipsoid
• Simple averaging of nearest neighbors
• Error Analysis with different index structures
• First paper to introduce problem to the database