VIEWS: 128 PAGES: 7 POSTED ON: 8/2/2011 Public Domain
Implementation Of Skyline Query Algorithms Marios Kokkodis Pamela Bhattacharya University of California Riverside University of California Riverside mak@cs.ucr.edu pamelab@cs.ucr.edu ABSTRACT ios Hadjieleftheriou [2] to support skyline queries. More spe- In this work we extend the public available Spatial Index ciﬁc, we implement the following three skyline algorithms: Library to support skyline queries. More speciﬁc, We im- • Block Nested Loop (BN L) - [1] plement three basic algorithms (Block Nested Loop, Divide and Conquer and Branch and Bound ) and the naive brute force approach. Finally we test these implementations in • Divide and Conquer (DCA) - [1] various datasets, and we compare their eﬃciency in terms of disk I/Os and run time. • Branch and Bound (BBS) - [5] Furthermore we run those algorithms in various datasets, 1. INTRODUCTION and we compare their eﬃciency in terms of disk IOs and Skyline points are deﬁned as those points which dominate run time. all the other points in a dataset. The dominating criteria are deﬁned in a way that a point is part of the skyline (i.e. it dominates all other points) if it is as good as or better 2. ALGORITHMS DESCRIPTION in all dimensions than the rest of the points. The skyline points are equally good, and don’t dominate each other. 2.1 Computing “Dominance" Computing the skyline points of a dataset is essential for The notion of dominance is very crucial in the deﬁnition of applications that involve multi-criteria decision making. As skyline points. Given two data points, we compute whether o o an example, we adopt the one given by B¨rzs¨nyi et. al. one dominates the other or they are incomparable by using in [1]. Often tourists search for “cheap hotels near the the dominating factor speciﬁed in the query. In the classical beach”. Such query, should return all hotels that dominate o o example provided by B¨rzs¨nyi et al . in [1] the dominat- the others in at least one of the two attributes (i.e. cost, dis- ing factors are whether a hotel is “cheap” and “near to the tance to the beach). The result, is the skyline of the queried beach”. The formal deﬁnition of dominance is as follows [1]: dataset. This example, taken from [1] is presented in ﬁgure 1. Tuple p = The ﬁrst skyline query algorithm was proposed by Kung (p1 , ....., pk , pk+1 , ....., pl , pl+1 , ....., pm , pm+1 , ....., pn ) et. al. in 1975 [4] . It was based in the “divide and con- quer” strategy and it had high complexity. In the later eight- ies and nineties several algorithms were proposed, which did Tuple q = not gain much popularity as they could not provide solutions (q1 , ....., qk , qk+1 , ....., ql , ql+1 , ....., qm , qm+1 , ....., qn ) for datasets that would not ﬁt in main memory. B¨rzs¨nyio o et. al. [1] in 2001 were the ones to ﬁrst introduce the notion Let’s say that we have the following Skyline Query: of skyline queries for signiﬁcantly large databases. They pro- vided three algorithms: Block Nested Loop, Basic Divide and Skyline of d1 M IN, ....., dk M IN, dk+1 M AX, ....., dl Conquer and M-way Divide and Conquer. In the same year, M AX, dl+1 DIF F, dm DIF F Tan et. al. [6] presented two other novel approaches for com- puting the skyline points: Bitmap and Index. Furthermore, where, M IN (M AX - DIF F ) indicates that we are look- in 2002 Kossmann et al. [3] presented the NN-Algorithm for ing for a minimum (maximum - diﬀerent) point in that spe- evaluating 2-dimensional skylines. Finally Papadias et al. ciﬁc dimension. Then we can say that tuple p dominates in 2005 presented the Branch and Bound algorithm [5]. Re- tuple q if the following 3 conditions hold: cently, several optimized and novel approaches of computing skyline queries have been proposed , but there are out of the pi ≤ qi for all i = 1, ....., k scope of this work. In this work, we extend the Spatial Index Library by Mar- pi ≥ qi for all i = (k + 1), ....., l pi = qi for all i = (l + 1), ....., m Figure 1: Skyline of “Cheap hotels near the beach” Figure 2: Merging Phase of Divide and Conquer 2.2 Block Nested Loop (BNL) a way that is eﬃcient for large databases. The algorithm is A kind of naive way of computing the Skyline points for stated as follows: o o a given dataset, was proposed by B¨rzs¨nyi et al . 2001 [1]. Before the skyline computation begins, a window is being • Compute the Median: In this step we compute the allocated some space in the main memory so that it keeps median mp of the input dataset for some “dominat- the resultant candidates. If there is not enough memory for ing” dimension dp . For example, if we query for all all the candidate skyline points a temporary ﬁle is created “cheap” hotels in Los Angeles, our “dominating” di- which stores all the points that are incomparable with the mension for computing the median is the dimension candidate skyline points of the window. Each tuple p from “price” for each input hotel. We then divide the in- the input dataset is compared with all the tuples in the put dataset into two parts: P1 and P2 . P1 contains window. Such a comparison has three probabilities: all tuples whose value of attribute dp is “better” (for our example: is less than or cheaper) than mp . P2 • Eliminate a tuple from input dataset: A tuple from contains the remaining tuples of the given input. the dataset is eliminated, when it is dominated by at least one tuple in the window. The eliminated tuple is • Compute the Skyline recursively: In a recursive fash- not considered for comparison in the rest of algorithm. ion, the partitions P1 and P2 computed in the earlier step are further partitioned and this continues until Pi • Eliminate tuples from window : When a tuple in the (i = 1, 2) contains one (or very few) tuples.Then com- dataset dominates one or more tuples in the window, puting the skyline point (points) is trivial. The result the former substitutes all the latter ones in the win- consists of the skyline points of Pi , and it is kept in Si dow. The tuples thus removed from the dataset are respectively. treated similar to a tuple that has been eliminated from the input dataset , and are hence not considered • Compute the Overall Skyline: From the way that we again for comparison. partitioned the data, none of the tuples in S1 can be dominated by a tuple in S2 , since a tuple in S1 is bet- • Incomparable tuples: When a tuple is incomparable to ter in dimension dp than every tuple in S2 . Hence, the all other tuples in the window, if there is enough space, only thing that we need to do in the merging phase, is it is inserted in the window. Otherwise it is copied to a to eliminate all the tuples in S2 that are being dom- temporary ﬁle and is considered for comparison (only inated by some tuple in S1 . Then, the skyline points with the rest tuples in the temporary ﬁle)in the next will be S1 ∪ merge(S1 , S2 )1 . iteration. Complexity: Both the best case and the worst case com- For the very ﬁrst iteration, both the window and the tempo- plexity is O(n ∗ (log n)d−2 ) + O(n ∗ log n); where n is the rary ﬁle is empty and hence, the ﬁrst tuple is inserted in the number of input tuples and d is the number of dimensions in window. The iteration continues unless all the tuples in the the Skyline. Hence, we expect this algorithm to outperform dataset have been compared with the tuples in the window. BNL in worst cases and to be worse in good ones. Complexity: The best case complexity is O(n), and that is when all the candidate skyline points ﬁt in the memory 2.4 Branch-and-Bound Skyline Algorithm (BBS) at every instance. It’s obvious, that in the worst case, the In the year 2005, Papadias et al. [5] proposed the Branch- complexity is O(n2 ) which is the same as the naive nested And-Bound Skyline algorithm. This algorithm is based on loop algorithm (brute force, see §3.4). the Nearest-Neighbor Search to ﬁnd the skyline points us- 2.3 Basic Divide-and-Conquer Algorithm (BDC) ing a multidimensional index (R-Trees). The algorithm is divided into fours steps: As stated in [1], Kung et al . in 1975 proposed the ﬁrst “divide-and-conquer” algorithm for computing skylines. 1 In section §3.2 we present in detail the merging phase, o o B¨rzs¨nyi et al. [1] used the same approach and modiﬁed in which is the most important part of this algorithm. Figure 3: Skyline Points of the Island dataset Figure 4: Skyline Points of the Long Beach dataset • Index: use an R-tree to index the objects and set the Conquer, the index was not necessary.2 However, since it list of Skyline points (S ) to Null was available, it helped us improved the I/Os of the Block Nested loop algorithm. For the Divide and conquer, we just • Access Root: Insert all the entries of the root in a heap, use the R-tree to access the data points and store them in a and sort them according to their minimum distance list. Finally, the use of the R-tree index is essential for the from a minimum point in ascending order. For positive Branch and Bound, since it is a key point for implementing values this point could be the origin (i.e. x = [0...0]T ) the algorithm. In the next paragraphs, we present some while for negative values, this point could be a point details about the implementation of the three algorithms. that consists of the minimum values at each direction. The minimum distance (mindist) of a node of the R- 3.1 Block Nested Loop tree to a point is given by the Euclidean distance of For this algorithm, we use a skyline list (the window ) , the point to the lower left corner of the node’s Mini- which at every instance keeps the candidate skyline points. mum Bounding Rectangle(MBR). The mindist of At each step, this list is compared with the current tree two points is their Euclidean distance. node (index, leaf) , and only if the node is not dominated by some candidate skyline points we visit its children. The • Compute Skyline Points: The procedure starts by ex- same thing happens at the leaves level: if an entry in a amining the elements of the heap. If the element is leaf is dominated by some point in the list, we don’t visit an intermediate node,and it is not dominated by some the actual data point. If not, we visit the actual point and point in S, it is expanded and all its children that are we include it in the candidate skyline points list. In the not dominated by some point in S are inserted in the end, and when all the necessary nodes of the tree have been heap. If the top element of the heap is a data point, traversed, the candidate skyline points list contains only the it is inserted in S. Summarizing, the following steps skyline points. that are taken: When we visit actual points that are not dominated by some candidate skyline point, we ﬁrst check if there is enough – Expand the entry e in the root which has the memory to add them in the list. If not, we store the data in lowest mindist, the disk (as the algorithm indicates,[1] ). However, for our – Compare e with all the entries in S and if e is datasets (see §4) this wasn’t necessary, since we had enough “dominated” by any point in S , discard e memory to keep all the candidate skyline points in memory. The implementation can be found in src/rtree/RTree.cc – If e is not “dominated” and e is an intermediate ﬁle of the Spatialindex library [2] under the name bnl. node, insert each child of e that is not dominated, in the heap 3.2 Divide and Conquer – If e is a data point, insert it in the skyline list S . In order to implement this algorithm we need some nec- essary functions that we describe in this section. Complexity: The main memory requirement for the BBS Function travereTree(): This method traverses the is at the same order of the size of skyline list, since both the tree and adds all the points in a list, which if there is enough heap and the main-memory R-tree sizes are of at this order. space, is kept in memory. Otherwise points are stored in the Furthermore, the number of node accesses by BBS is at most disk. s.h, where s is the number of skyline points, and h the height Function ﬁndMedian(): It ﬁnds the median, mp , for a of the R-tree. given dimension, dp . 2 3. IMPLEMENTATION DETAILS We could implement these two algorithms without the use of the R-tree index, but since we wanted to be consistent All the three algorithms have the same R-tree ﬁles as an when comparing these two with the branch and bound al- input. For both the Block Nested Loop and the Divide and gorithm, we ﬁnally used the same framework. Function partition(): It splits our data into two par- titions, P1 , P2 , according to the median mp that has been found earlier by ﬁndMedian(). In the case where the ﬁrst partition is empty3 we ﬁnd the median of one other dimen- sion, d′ , and we partition the data again. This way, we guarantee that each point in P1 is not dominated by any point in P2 . Function skylineBasic(data): If there is only one point in the data list, then this point is returned. Otherwise, we ﬁnd the median of the data list, and we partition it into P1 and P2 (by calling partition()). Then, we call skylineBasic(P1 ) and skylineBasic(P2 ) and we keep the results in S1 and S2 . So far, we know that all the points in S1 are not dominated by any point in S2 (this is due to the way that we partition- ing the data- see §3.2). Hence we need to merge S1 with S2 in a way to ﬁnd the points in S2 that are not being domi- Figure 5: Skyline Points of the Sierpinski dataset nated by any point in S1 . In the end, the union of S1 and the result of the merge are the skyline points. Function merge(): This is the most important phase of the algorithm. We implement it as described in [1], so that we eliminate unnecessary comparisons. First, we check whether any of the two lists S1 , S2 has only one element - trivial cases. Then, we check if the points are 2-dimensional4 . and if so, we ﬁnd the minimum of S1 (min) in a dimension d′ , where d′ = dp . As a result, we return all the points in S2 that have their d′ -dimension less than min (those are the points that are incomparable with all the points in S1 ). If the dataset is of higher dimension (> 2), we partition both sets in some other dimension dg (with median mg ) and we get four sets: S1,1 , S1,2 for S1 , and S2,1 , S2,2 for S2 (notice again that any point in S1,1 is not dominated by any point in S1,2 etc.). Then we merge recursively S1,1 with S2,1 (as a result we get all the points in S2,1 that are not dominated by some point in S1,1 ) ,S1,2 with S2,2 and ﬁnally we merge Figure 6: Skyline Points of the Sierpinski dataset S1,1 with the result of the last merge (that is all the points in S2,2 that are not dominated by some point in S1,2 are merging diagonal with S1,1 ). Schematically, those compar- isons are presented in Figure 25 . As a result, we return the intermediate entry. If it is, we add all its children that are union of the ﬁrst and the third merge: not being dominated by some point in S into the heap. If e is a data point, we add it into S. merge(S1,1 , S2,1 ) ∪ merge(S1,1 , merge(S1,2 , S2,2 )) If the dataset has negative points, we traverse the tree to ﬁnd the minimum coordinate in every direction, create All these methods are implemented with the same names a point that consists of all these coordinates, and compute in src/rtree/RTree.cc ﬁle of the Spatialindex library [2]. the minimum distances to this point. It is obvious that in this case, the number of disk I/Os increases, since we have 3.3 Branch and Bound to go through all the data and ﬁnd the minimum values. As it is already mentioned, the R-tree index is necessary However, if somehow we have this minimum value before- for this algorithm. After the generation of the R-tree, if the hand, we can have the same I/Os as if the dataset had only dataset is “positive” (it includes only positive points) the positive points. origin is considered as the minimum point. The minimum The implementation of the algorithm can be found in distances of the roots are computed, and the heap is created src/rtree/RTree.cc ﬁle of the SpaialIndex library [2], under by inserting the roots in ascending order. At each instance, the name bbsQuery. we remove the top entry e of the heap , and we check if e is dominated by some point in our skyline points list, S. If it is, we discard it, otherwise we check whether e is an 3.4 Brute Force For validation and comparison purposes, we also imple- 3 The ﬁrst partition consists of the points that have their dp - mented a brute force approach for ﬁnding the skyline points. dimension value less than mp . Hence, is the only partition As it can be inferred, this method consists of one nested for that can be empty, and this could only happen in the case loop, that compares each point with all the others, and if where the median is also the minimum in dp dimension (all the points belong to the second partition). it’s not dominated, it’s been added in the skyline list. As 4 Here we don’t actually mean the dimension of the initial soon as a point is found to be dominated, we exit the loop, dataset, but the number of dimensions that have not been and continue with the next point. partitioned yet. The implementation can be found in src/rtree/RTree.cc 5 The ﬁgure is taken from [1]. ﬁle of the SpaialIndex library [2], under the name bruteF orce. Algorithm Index I/Os Leaves I/Os Data I/Os Total I/Os Algorithm Index I/Os Leaves I/Os Data I/Os Total I/Os BF 15 924 63383 64322 BF 108 7714 531441 539263 DCA 15 924 63383 64322 DCA 108 7714 531441 539263 BNL 15 924 5191 6130 BNL 108 7714 1 7823 BBS 15 671 467 1153 BBS 50 66 1 117 Table 1: Island (2d) IOs - toal points:63383, skyline Table 4: Sierpinski(2d) IOs - total points:531441, points:467 skyline points:1 Algorithm Index I/Os Leaves I/Os Data I/Os Total I/Os Algorithm Index I/Os Leaves I/Os Data I/Os Total I/Os BF 8 526 36298 36832 BF 7 374 25375 25756 DCA 8 526 36298 36832 DCA 7 374 25375 25756 BNL 8 526 263 797 BNL 7 374 315 696 BBS 16 601 36324 36941 BBS 15 511 25415 25940 Table 2: Long Beach(2d)IOs (Negative points) - to- Table 5: US cities(2d) IOs - total points:25375, sky- tal points:36298 points, skyline points:26 line points:40 4. DATA 5.1 Correctness We use ﬁve datasets to evaluate the three algorithms we We ﬁrst check the correctness of our implementations. We implemented by extending the Spatial Index Library [2]. do this in two ways: Those datasets are: • For the 2-dimensional datasets, we depict in the same ﬁgure the skyline points with the dataset points. By • Islands: 63,383 2-dimensional points of an Island, observing those ﬁgures, we can tell whether our re- sulted skyline points are correct or not. • Beaches: 36,298 2-dimensional coordinates of road in- tersections in the Long Beach County, CA, (includes • For all the datasets, we ﬁnd the skyline points by using negative points) the brute force method described in §3.4 and compare them with the ones that our implementations provide • Sierpinski: 531,441 2-dimensional points representing a Sierpinski triangle fractal, After applying the algorithms, we had the same skyline points for each dataset, and also the same skyline points • US Cities: 25,375 2-dimensional points representing with the ones that resulted from the brute force approach. the cordinates of US cities, (includes negative points) This proves the correctness of our implementations. Table 3 presents the number of skyline points that we • NBA: 17,265 5-dimensional points, representing some found for each dataset. Furthermore, for all the 2-dimensional statistics of NBA players. datasets we present the skyline points in comparison with the dataset points in ﬁgures 3, 4, 5, and 6 . Correctness for A tuple in an n-dimensional dataset is in the form: these datasets is also visible. id,v1 ,...,dvn 5.2 Disk Inputs - Outputs As we mentioned in §3.3, our implementation of BBS 5. EXPERIMENTAL ANALYSIS has diﬀerent number of disk IOs depending on whether the dataset includes negative points. Hence, we split the In this section we analyze the results after applying all datasets into the “positive” and the “negative” ones and we three algorithms to all available datasets. The procedure compare the algorithms accordingly. follows two steps: Positive set {Island, Sierpinski, NBA}: Tables 1,4,6 • Create an R-tree index for each dataset present the IOs for each algorithm, for Island ,Sierpinski and NBA datasets respectively. It’s obvious, that Branch • Load the stored index and ﬁnd the skyline points by and Bound has less IOs than any other algorithm. This is applying the algorithms due to the fact, that BBS visits only the skyline points. The way we implement Block Nested Loop (exploiting the R-tree All comparisons are taking place after the creation of the index), results in less IOs than we would expect without the R-tree index. use of the R-tree index (in that case, all the data would be visited in order to compared with the skyline points). Divide and Conquer has to visit all the points, and so does the brute Dataset Dataset Points Skyline Points force approach (BF ), and hence they are equivalent as far Island 63383 467 as IOs cost (however Divide and Conquer is a lot faster L. Beach 36298 26 than BF in most of the cases, as we present in 5.3). It is Sierpinski 531441 1 important here to state again that all points of every dataset US cities 25375 40 ﬁt in memory, and that’s why Divide and Conquer has the NBA 17265 495 same IOs with BF . Negative Set {Long Beach, US Cities}: Tables 2 Table 3: Resulted Skyline Points and 5 present the IOs for these two datasets. Since these Algorithm Index I/Os Leaves I/Os Data I/Os Total I/Os Dataset BF DCA BNL BBS BF 5 253 17265 17523 DCA 5 253 17265 17523 Island 223 1.067 0.79 0.27 BNL 5 253 9225 9483 L. Beach 0.335 0.565 0.084 0.17 BBS 5 253 495 753 Sierpinski 2.3 9.534 0.93 0.045 US cities 0.625 0.415 0.065 0.13 Table 6: NBA (5d) IOs - total points: 17265, skyline NBA 3.7 0.58 0.57 0.21 points:495 Table 7: Algorithm run time (in sec) for each dataset datasets contain negative points, BBS ﬁrst traverses the tree 1.1 NBA to ﬁnd the minimum in every dimension (as we described in 1 Island Long Beach §3.3). Hence, it nees more IOs than any other algorithm. US Cities 0.9 This can be easily improved, by having the min point as an input for the algorithm. In that case, the disk IOs would 0.8 be reduced to the number of skyline points as before. Nega- Time (sec) 0.7 tive points does not aﬀect the rest of the algorithms: Block 0.6 Nested Loop is better than Divide and Conquer which has 0.5 the same IOs as BF . 0.4 5.3 Run - Time Cost 0.3 We run each algorithm three times on each dataset, and 0.2 we compute the average running time that need in order to 1 10 100 1000 Parameter p ﬁnd the skyline points. Table 7 presents the results of our tests. Positive set {Island, Sierpinski, NBA}: As we ex- Figure 7: Run time of DCA vs the parameter p pected, BBS has the smallest running times when applied in these datasets (rows 1,3,5 in Table 7). Block Nested Loop (BN L) comes second. For the Divide and Conquer (DCA), such time, DCA runs faster than before. we observe that it needs more time than brute force ap- proach for the Sierpinski dataset. This has to do with the 6. CONCLUSION type of the dataset. As we can see in ﬁgure 5 , this dataset has only one skyline point. Hence it could happen that this In this project, we used the Spatial Index Library frame- point is in the beginning of the list, and the nested loop work [2] to implement three classical Skyline algorithms, never completes, after ﬁnding this skyline point. As a re- Block Nested Loop, Divide and Conquer and Branch and sult, brute force needs just a few seconds. On the other hand Bound. We tested each one of these algorithms in ﬁve mul- DCA needs relatively long time to result. This is due to the tidimensional datasets, and we compared them in terms of fact that many of the points of this dataset (see ﬁgure 5) disk I/Os and running time. We conclude that as long as have same values at diﬀerent dimensions, a fact that leads the R-tree index is generated, BBS runs faster and with less to empty partitions P1 , and hence repartitioning P2 in other disk I/Os than anyone else. For “negative” datasets, BN L dimensions (see §3.2). As a result, BF is faster than DCA exploits the R-tree index characteristics and has better per- for this dataset. An other thing to notice here, is the huge formance, since BBS need to make a search in the beginning time that BF needs in order to ﬁnd end the skyline of Island for ﬁnding the minimum point of the set. Finally, we have dataset: 223 seconds. This happens because this time, the seen that generally DCA runs better than the brute force skyline points are 467 (relatively big number) and hence the approach, but still worse than the two other approaches. nested loop breaks less times. Negative Set {Long Beach, US Cities}: For these 7. REFERENCES cases, BN L is the fastest. BBS needs some time to traverse [1] Stephan Borzsonyi, Konrad Stocker, and Donald all the data from the R-tree in order to ﬁnd the minimum Kossmann. The skyline operator. Data Engineering, point (described in §3.3). This extra time ranks BBS second International Conference on, 0:0421, 2001. among all the algorithms. What we said before about DCA [2] Marios Hadjieleftheriou, Erik Hoel, and Vassilis J. and BF holds here too, since they are not aﬀected by the Tsotras. Sail: A spatial index library for eﬃcient “negative” points. application integration. Geoinformatica, 9(4):367–389, Diﬀerent Trivial Cases of Divide and Conquer: So 2005. far, we have assume that skylineBasic(data): returns only [3] Donald Kossmann, Frank Ramsak, and Steﬀen Rost. when the data size is equal to one. In the following exper- Shooting stars in the sky: an online algorithm for iment we increase this value (parameter p) from 1 to 1000 skyline queries. In VLDB ’02: Proceedings of the 28th and we ﬁnd the skyline points of this partition by brute international conference on Very Large Data Bases, force. We run the algorithm again in all the datasets. The pages 275–286. VLDB Endowment, 2002. results are presented in ﬁgure 76 We can notice that the best [4] H. T. Kung F. Luccio and F. P. Preparata. On nding choice for our datasets, are between p = 50 and p = 100. In the maxima of a set of vectors. Journal of the ACM, 6 We exclude Sierpinski dataset in the ﬁgure since its run (22(4)):469476, 1975. time is huge compared to the others. However, the dataset [5] Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard shows similar behavior. Seeger. Progressive skyline computation in database systems. ACM Trans. Database Syst., 30(1):41–82, 2005. [6] Kian-Lee Tan, Pin-Kwang Eng, and Beng Chin Ooi. Eﬃcient progressive skyline computation. In VLDB ’01: Proceedings of the 27th International Conference on Very Large Data Bases, pages 301–310, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.