Docstoc

Constrained Subspace Skyline Computation

Document Sample
Constrained Subspace Skyline Computation Powered By Docstoc
					    Progressive Computation of
Constrained Subspace Skyline Queries




 Evangelos Dellis1 Akrivi Vlachou1 Ilya Vladimirskiy1
      Bernhard Seeger1 Yannis Theodoridis2
 1   Department of Computer Science, University of Marburg, Germany
     2 Department of Computer Science, University of Piraeus, Greece




                                                                       1
Overview
 Introduction
 Motivation - Related Work
 Basic STA
 Improved Pruning
 Indexing using Low-dimensional R-trees
 Experimental Evaluation
 Conclusions – Future Work



                                           2
Overview
 Introduction
 Motivation - Related Work
 Basic STA
 Improved Pruning
 Indexing using Low-dimensional R-trees
 Experimental Evaluation
 Conclusions – Future Work



                                           3
Finding A Hotel Close to the Beach




 Which one is better?
  i  or h? (i, because its price and distance
    dominate those of h)
   i or k?
                                                 4
    Skyline Queries




 Retrieve points not dominated by any other point:

       A point p dominates another point q if it is as good or better as p in all
        dimensions and better in at least one dimension.
                                                                                     5
  Skyline of Manhattan




 Which buildings can we see?
   Higher     or nearer
    (a building dominates another building if it is higher, closer to the
    river, and has the same x position)                                   6
 SQL Extension
 SQL syntax:




 Examples:




a) Find a hotel that is cheap and   b) Find salespersons who were very
close to the beach.                 successful in 1999 and have low salary

                                                                         7
Overview
 Introduction
 Motivation - Related Work
 Basic STA
 Improved Pruning
 Indexing using Low-dimensional R-trees
 Experimental Evaluation
 Conclusions – Future Work



                                           8
      Motivation
     Constrained Skyline (car database):
          A user may only be interested in records within the price range from 3 thousand to 7
           thousand euros and with mileage reading between 20K and 100K.




     The traditional skyline (dashed line) fails to return interesting points.

                                                                                                  9
    Motivation (continued)
   Subspace Skyline:
        A car database could contain many other attributes of the cars:
              horsepower, age, fuel consumption, etc…

        A customer that is sensitive on the price and the mileage reading (2-dimensional
         subspace) would like to pose a skyline query on those attributes, rather than on the
         whole data space.




   While the dimensionality of the corresponding data space might be rather high, skyline
    queries generally refer to a low dimensional subspace.

   The constrained subspace skyline queries form the generalization of all meaningful
    skyline queries over a given dataset.


                                                                                                10
    Related Work
   SKYCUBE [VLDB 2005, SIGMOD 2006]:
       The Skyline Cube (SKYCUBE), consists of the skylines in all possible (2d -1)
        subspaces.


       Drawback: It is not possible to pre-calculate the points of the full space skyline and
        their duplicates, since the result depends on the given constraints (static).


   SUBSKY [ICDE 2006]:
       Transforms the multi-dimensional data into one-dimensional, and therefore permits
        indexing the dataset with a B+-tree.


       Drawbacks:
        1.   is unable to answer constrained subspace skyline queries as all points have to be
             transformed in a pre-processing step.
        2.   does not deliver the skyline points progressively.
                                                                                                 11
    Related Work (Continued)
   BBS [SIGMOD 2003, TODS 2005]:

        all points are indexed in an R-tree.
        mindist(MBR) = the L1 distance between its
         lower-left corner and the origin (NN).
        Keep a heap of index entries and objects,
         ordered by mindist.

   Is still the most efficient method for
    (constrained subspace) skyline retrieval!




                                                      12
    Related Work (Continued)
Shortcomings of BBS:
   Maintaining a high-dimensional index to support constrained skyline queries in
    arbitrary dimensionality is not suitable:
       It has been shown that the performance of such high-dimensional indexes
        deteriorates with an increasing number of dimensions. (Curse of Dimensionality)
       The performance of low-dimensional constrained skyline queries decreases when the
        dimensionality of the indexed space is high in contrast to the query space that is low.
        (Random Grouping Effect)


   Only low-dimensional indexes, e.g. R-trees, seem to perform well in practice and for
    that reason have found their place in commercial database management systems
    (DBMS).



                                                                                           13
    Our Approach
   We partition vertically the data space among several low-dimensional subspaces and
    index each of these subspaces using an R-tree.
   A constrained skyline query is then partitioned into several sub-queries, each of them
    is processed by utilizing the corresponding index using incremental NN search.




   TA-INDEX [DAWAK 2005]: An algorithm for vertically partitioned nearest neighbor
                                                                                     14
    queries.
    Contributions
   We present a threshold-based skyline algorithm (called STA), which exploits multiple
    indexes.
   We propose different pruning strategies to identify dominated regions and to discard
    irrelevant sub-trees of the indexes.
   A workload-adaptive strategy for determining the number of indexes and the
    assignment of dimensions to the indexes is presented.




                                                                                     15
Overview
 Introduction
 Motivation - Related Work
 Basic STA
 Improved Pruning
 Indexing using Low-dimensional R-trees
 Experimental Evaluation
 Conclusions – Future Work



                                           16
    Problem Definition
Constrained Subspace Skyline Queries:
    For a point p∈Dc in the dimension set S΄:
        the dominance region contains points which are dominated by p.
        the anti-dominance region refers to the set of points dominating p.




    A point p∈D is said to dominate another point q∈D on subspace S΄ if:
    1.   on every dimension di∈S΄, pi ≤ qi; and
    2.   on at least one dimension dj∈S΄, pj < qj.
                                                                               17
    One-point Pruning
    Observation: A point p is a skyline point in S΄ if and only if there exists no point q that
     belongs to the anti-dominance area of p for all dimension sets S i΄ (1≤ i ≤ n).


    Pruning with the Nearest Neighbor: need to prune objects  not part of skyline.
    1.   because it is a member of the skyline, there is no dominating point.
    2.   among all the skyline points it is the one with a large volume, and hence, it is also
         expected to prune a large percentage of the data points.




                                                                                                 18
    STA: A Threshold-based Skyline Algorithm

    Our algorithm works in two steps:
        Filter step:
             All retrieved points are organized in a priority queue (heap) based on their Manhattan
              distance according to the dimension set S΄.
             We use the Manhattan distance of the last reported point of S i΄ as a threshold to speed
              up the filtering phase.

        Refinement step: (domination test)
             The refinement step begins when the first constrained nearest neighbor based on S΄ is
              returned by the filter step. This point is guaranteed to be a skyline point.
             In the next iteration, where another candidate is found, the refinement step needs to
              determine whether this candidate is a skyline point or not.


    The dominance test is performed in a way similar to traditional window queries using
     a main-memory R-tree whose dimensionality is equal to the query dimensionality.
                                                                                                 19
    Index Scheduling
   Round Robin strategy:
       Inefficient
   We are interested in more advanced strategies resulting in a fast increase of the
    threshold.
       We choose the index that will increase the partial distance mostly as it is more
        beneficial for our threshold.


   Strategies for index scheduling for nearest neighbor search on a vertically partitioned
    data set have been studied in [DAWAK 2005].




                                                                                           20
Overview
 Introduction
 Motivation - Related Work
 Basic STA
 Improved Pruning
 Indexing using Low-dimensional R-trees
 Experimental Evaluation
 Conclusions – Future Work



                                           21
    Improved Pruning
   Motivating example:
       Non uniform distributions  Points form clusters
       Need: Pruning using multiple points




   Simultaneous pruning:
       we are not able to prune simultaneously in both subspaces using the same point.

                                                                                          22
    Multiple-point Pruning
   Observation: when points lying in the dominance region of a point are not discarded
    in at least one subspace, then we are able, under certain conditions, to discard points
    in all remaining subspaces, while we guarantee no false dismissals.
       we use the points that are retrieved as local constrained nearest neighbors from an
        index, for pruning in all other indexes.




   Example: 4-dimensional data space is divided into two 2-dimensional subspaces.
    When the point p1 is retrieved from subspace S1 then the dominance area of the
    point p1 in subspace S2 is used for pruning.
                                                                                         23
    Avoiding False Hits
   Unfortunately, by following this strategy some skyline points are falsely discarded.
       Case 1: Let the point q in the projection S2 collapse on the point q1. The point p is not
        a skyline point in S, since it is dominated by q in all dimensions sets of S.
       Case 2: On the other hand, if the point q in the projection S2 collapses on the point
        q2, then point p may be discarded falsely, since it is a potential skyline point.




   Solution: To discard points from the dominance area of p in S2, the point p and a
    point qi must be dominated by the projection of the same point in S2 and S1
    respectively. This condition must hold for each point qi which belongs in the discarded
    area of S1.                                                                             24
Overview
 Introduction
 Motivation - Related Work
 Basic STA
 Improved Pruning
 Indexing using Low-dimensional R-trees
 Experimental Evaluation
 Conclusions – Future Work



                                           25
    Random Grouping Effect
   Random Grouping Effect: Since not all dimensions are used for splitting the axes
    during the index creation for a leaf node, when a query that requires projection is
    posed to the index the performance of the index corresponds to a random low-
    dimensional index,
       i.e. an index that groups the points into leaf nodes in a mostly random manner.


   Example: consider a 10-d data space and assume that we are interested in retrieving
    the skyline of any 2-d subspace.
       If only two dimensions are used for splitting, then the probability that the chosen
        dimensions have been used for splitting is very small.
       Thus, the query performance is similar to the performance of a 2-d index, where the
        data points were grouped together randomly.



                                                                                              26
    Number of Indexes
   If every leaf node is splitted at least once in each dimension, we need a total number
    of at least 2d leaf nodes.
        Well-performing index: every leaf node is splitted by each dimension once (L ≥ 2d ).
        (Defines a maximum dimensionality for a low-dimensional index)



   Example: 32-d Color dataset, 68,040 records.
        Our formula suggests  2 indexes




   In this way we index more effectively high dimensional datasets, by avoiding
    performance degradation due to random grouping effect.
                                                                                         27
    Dimension Assignment Algorithm
   Number of Distinct Values: a quality measure of a subspace Si
       points whose projections coincide to a low-dimensional point, so that it is dominated
        by some duplicate point in the query-dimensional space.



   DAA: a greedy algorithm to distribute the attributes over the n indexes.
       restrict the random grouping effect
       maximize the number of distinct values




                                                                                          28
    Workload-adaptive Extension
   User preferences are correlated:
       use multiple indexes, which are built on the most preferred subspaces
   Simple, but very powerful extension:
       associate some probability with each subspace
        (the frequency with which it is queried)
       weight the cost estimation of each dimension set by its probability.


   This extension allows us to examine the performance of our algorithm under a
    workload, which is closer to real applications, instead of picking random subspaces.




                                                                                     29
Overview
 Introduction
 Motivation - Related Work
 Basic STA
 Improved Pruning
 Indexing using Low-dimensional R-trees
 Experimental Evaluation
 Conclusions – Future Work



                                           30
Experimental Evaluation
Datasets:
    Three data sets from real-world applications:
         NBA dataset contains 17,000 13-dimensional points, where each point corresponds to
          the statistics of a player in 13 categories.
         Color moments dataset contain 9-dimensional features of 68,040 photo images
          extracted from the Corel Draw database.
         Color histogram consists of 32-dimensional features, representing the histogram of an
          image.
    Additionally, we generated 10-dimensional uniform datasets with a cardinality of
     10,000, 50,000 and 100,000 data points.


Implementation Details:
   We compare our algorithm against the current state-of-the-art method BBS.
        We set the page size for each R-tree to 4K and each dimension was represented by a
         real number.
   Measurement: The number of disc I/O’s (page accesses)
                                                                                           31
    Examination of Constrained Subspace Skylines

    Effect of Constrained Region:
        Varying constrained region from 50% to 100% of each axis.
        We examine subspaces with dimensionality of dsub=3.
        Uniform dataset: full space dimensionality of 10-d and a cardinality of 50,000 points.




    Observation: the performance of our algorithm is not affected significantly by the
     size of the constrained region.
                                                                                            32
    Examination of Constrained Subspace Skylines

    Effect of Subspace Dimensionality
        We vary the query subspace dimensionality from 2 to 4.
        We set the constrained region constant (represented as 60% of the values of each
         requested axis). These results demonstrate that the STA algorithm leads to
         substantially less page accesses than BBS.




           a) 10-d Uniform Dataset, 50k              b) 9-d Color Dataset, 68k

    These results demonstrate that the STA algorithm leads to substantially less page
     accesses than BBS.                                                               33
    Scalability with the Dataset Cardinality
    We use uniform datasets, (dimensionality of 10-D)
        Vary the cardinality between 10,000 and 100,000 points.
        We set the constrained region to cover 60% of each axis.
        In addition we request the skyline of 3-dimensional subspaces.




   The proposed method scale better with cardinality than BBS.

                                                                          34
    Scalability with Full-space Dimensionality
    Varying the Full-space Dimensionality:
        We set the constrained region to cover 60% of each axis. In addition we request the
         skyline of 3-dimensional subspaces.
        Uniform dataset with varied dimensionality of 10, 20 and 30-d.
        Real datasets with varied dimensionality of 9, 13 and 32-d




         a) Uniform Datasets                               b) Real Datasets

   In both cases our algorithm constantly outperforms BBS in this experiment.
                                                                                         35
        Adaptation to the query Workload
        Query-workload using the “80-20” law:
           20% of the attributes contribute to 80% of the queries
           32-dimensional Color histogram dataset, which consists of 68,040 records




                     a) I/O cost                b) CPU cost
         Scalability using the “80-20” law:
             Subspace skyline with dsub = 3
             Constrained Region: 60% of each axis



                                                                                       36
Overview
 Introduction
 Motivation - Related Work
 Basic STA
 Improved Pruning
 Indexing using Low-dimensional R-trees
 Experimental Evaluation
 Conclusions – Future Work



                                           37
    Conclusions – Future Work
   We addressed the problem of Constrained Subspace Skyline Queries and we have
    presented a threshold-based skyline algorithm, which exploits multiple indexes.
   We proposed different pruning strategies to identify dominated regions and to discard
    irrelevant sub-trees of the indexes.
   A workload-adaptive strategy for determining the number of indexes and the
    assignment of dimensions to the indexes is presented.
   Extensive performance evaluation show the superiority of our proposed technique
    against related work.



    Future Work may include:
        Examination of STA using external queues
        Development of a Cost Model for Constrained Subspace Skyline Queries


                                                                                      38
References
   SKYCUBE [VLDB 2005, SIGMOD 2006]:
        Yuan, Y., Lin, X., Liu, Q., Wang, W., Yu, J., Zhang, Q.: Efficient Computation of the Skyline
         Cube. Very Large Data Bases Conference (VLDB), Trondheim, Norway, August 30 -
         September 2, 2005.
        Pei, J. Jin, W, Ester, M., Tao, Y.: Catching the Best Views of Skyline: A Semantic Approach
         Based on Decisive Subspaces. Very Large Data Bases Conference (VLDB), Trondheim,
         Norway, August 30 - September 2, 2005.
        Xia, T., Zhang, D.: Refreshing the Sky: The Compressed Skycube with Efficient Support for
         Frequent Updates. To appear in Proceedings of the 2006 ACM SIGMOD International
         Conforerence on Management of Data (SIGMOD), Chicago, IL, USA 2006.
   SUBSKY [ICDE 2006]:
        Tao, Y., Xiao, X., Pei, J. SUBSKY: Efficient Computation of Skylines in Subspaces. IEEE
         International Conference on Data Engineering (ICDE), Atlanta, Georgia, USA, April 3-7,
         2006.
   BBS [SIGMOD 2003, TODS 2005]:
        Papadias, D., Tao, Y., Fu, G., Seeger, B. An Optimal and Progressive Algorithm for Skyline
         Queries. ACM Conference on the Management of Data (SIGMOD), San Diego, CA, June 9-
         12, 2003.
        Papadias, D., Tao, Y., Fu, G., Seeger, B. Progressive Skyline Computation in Database
         Systems. ACM Transactions on Database Systems, 30(1): 41-82, 2005.
   TA-INDEX [DAWAK 2005]:
        Dellis, E., Seeger, B., Vlachou, A. Nearest Neighbor Search on Vertically Partitioned High-
         Dimensional Data. In Proceedings of 7th International Conference on Data Warehousing and
         Knowledge Discovery (DaWaK), Copenhagen, Denmark, 2005


                                                                                                         39
Thank You



       Questions?




                    40

				
DOCUMENT INFO