VIEWS: 5 PAGES: 56 POSTED ON: 3/21/2012 Public Domain
Maintaining Sliding Window Skylines on Data Streams Yufei Tao & Dimitris Papadias Skyline • a支配b • a在每一维上不比b差 • Skyline:不被任何其 余点支配的点的集合 Manhattan skyline • f()是一个在所有维上都单调上升的函数 • f = min(f)的取值点一定在skyline集合中 • Data stream • Consider only the tuples that arrived in a sliding window covering the W most recent timestamps • W: the window length • A tuple r’s lifespan (r.tarr, r.texp) r.texp = r.tarr + W •W = 5 • (+a, 1), (+b, 3), (-a, 6), (+c, 6), (+d, 7), (-b, 8), (+e, 9), (-c, 9), (-d, 9), (+f, 11) Properties • All points dominated by an incoming tuple r can be discarded (Lemma 1) • An arriving tuple r cannot be discarded even if it is dominated by some existing tuple r’ • A tuple r can appear in the skyline for at most a single continuous time interval The architecture of our system TABLE 1 Frequently Used Symbols The Lazy Method Preprocessing module (L-PM) Maintenance module (L-MM) Several implementation issues • The structures organizing the data in DBsky and DBrest • The algorithms for performing d- sided emptiness tests and range search • DBrest needs to store obsolete data and tuples that will never appear in the skyline. The Eager Method Eager aims at • Minimizing the memory consumption by keeping only those tuples that may become part of the skyline • Reducing the cost of the maintenance module (E-MM). W = 15 DBsky i a b k c d e f g Skyline influence time h.tsky = 26 DBrest h i j i.tsky = 20 j.tsky = 32 Event list EL e < e.ptr, e.t, e.tag > pointer to the the tuple involved the event time event type • e.tag = ‘EX’, e.t = r.texp for skyline tuple r (expire) • e.tag = ‘SK’, e.t = r.tsky otherwise (skyline) a main-memory B-tree indexed by event times • Lemma 2. Eager correctly produces the skyline output stream. • Proof. We aim at establishing two facts: • 1) all skyline changes are captured by Eager • 2) every skyline change produced by Eager is correct. Analytical Study Analytical Study • Ssky: the largest number of points in DBsky at any timestamp • a: the highest tuple arrival rate per timestamp • Ssky <= W * a (actually much smaller) Analysis of Lazy • Lemma 3. For Lazy, every tuple r is inserted into (removed from) each of DBsky and DBrest at most once. Cempty, total N lazy Crange[i] ,total narrsky Csubsky[i], total nskyexp Cupdsky, Cupdrest • Theorem 1. For Lazy, the amortized time of processing a tuple is 1 narrsky 1 nsky exp O Cempty Crange[i] N Csubsky[i] Cupdsky Cupdrest N i 1 lazy i 1 LL-Lazy • Organize the data of DBsky and DBrest using linked lists • Srest: the maximum number of tuples in DBrest when a skyline point expires • Corollary 1. For LL-Lazy, the amortized time of processing a tuple is O ssky nskyexp srest ssky log srest N 1 for d 2 | 3, d 2 for d 4 n 1 arrsky lazy n 1 sky exp O Cempty Crange[i] N Csubsky[i] Cupdsky Cupdrest N i 1 i 1 • srest is no more than W*a nskyexp O ssky * (W * a) * ( ssky log (W * a)) N • W*a is a constant independent of N • N >> (W*a)*logα(W*a) • Need nskyexp << N to be efficient • O(ssky) I-Lazy • creating appropriate indexes on DBsky • emptiness tests and range queries • Qrange(n) + k, Qempty(n) • Urange(n), Uempty(n) Qcount(n), Ucount(n) • Corollary 2. For I-Lazy, the amortized time of processing a tuple is nskyexp O Q( ssky ) U ( ssky ) * srest * (Qcount ( ssky ) log ( srest )) N Q(ssky) = max{Qempty(ssky),Qrange(ssky)} U(ssky) = max{Uempty(ssky),Urange(ssky),Ucount(ssky)} 1 narrsky 1 nsky exp O Cempty Crange[i] N Csubsky[i] Cupdsky Cupdrest N i 1 lazy i 1 Qrange(ssky)+ki, sum(ki) <= N 1 (narrsky * Qrange( ssky ) N ) Qrange(ssky) N • Selecting different structures for d-sided emptiness tests, range search, and count queries • priority search tree • O-tree • R-tree Analysis of Eager • Lemma 4. Eager inserts (deletes) every tuple r into DB once. Furthermore, Eager adds to (removes from) the event list EL at most one SK and one EX event of r. • Eager stores at most W*a tuples Theorem 3. For Eager, the amortized time of processing a tuple is O Cmax Crange Cupddb log(W eager * a) eager Cmax Crange O Cmax Crange Cupddb log(W eager * a) • LL-Eager • O(W*a) • I-Eager • similar to I-Lazy, enhances the performance with auxiliary structures on DB that facilitate d-sided max and range retrievals. Comparing • Lazy: efficient when ssky or nskyexp is small • Eager: upper bounds the processing time within O(W*a) in all cases (the time is even shorter if indexes are adopted). R-trees • R-tree does not have provable performance guarantees • it performs reasonably well for real-world data • range search • obtaining a subskyline • emptiness test Experiments • LL-Lazy, I-Lazy, LL-Eager, I-Eager • each node of an R-tree and a B-tree occupies 512 bytes. • d dimensions, range [0, 1] • independent and anti-correlated Amortized Performance • a low arrival rate of 10 tuples/second • d (between 2 and 4) • W (from 200 to 3.2k seconds) • Each stream contains 500 windows • the total number of tuples ranges from 1 to 16 million Average Skyline Size per Timestamp versus W (d = 3) Average Skyline Size per Timestamp versus d (W = 800) independent data sets have much smaller skylines than anti-correlated ones. Processing Time (d = 3) Independent Anti-correlated Indexed version & Lazy version are better Processing cost at individual operations (independent, W = 800, d = 3). (a) LL-Lazy. (b) I-Lazy. Spikes: timestamps when a skyline point expires Lazy invokes expensive L-MM I-Lazy’s spikes are shorter, because it calculates a subskyline using an R-tree (c) LL-Eager (d) I-Eager Less fluctuation Indexed version are always better Anti- correlated (a) LL-Lazy. (b) I-Lazy. (c) LL-Eager. (d) I-Eager. • spikes of Lazy disappear since it needs to scan a large number of tuples in DBsky for every arrival (due to the frequent skyline changes). Amortized cost versus d (W = 800) (a) Independent (b) Anti-correlated performance deteriorates LL: increase of skyline sizes I: the effectiveness of R-trees drops with the dimensionality Space Consumption Per-timestamp space overhead versus W (d = 3) (a) Independent. (b) Anti-correlated • Lazy stores a larger number of tuples than W * a • tuple r is evicted r is in DBsky & dominated by a subsequent tuple. This seldom happens because Independent has a small size skyline and skyline changes are infrequent. • Eager keeps only the tuples that participate in the skyline • Indexed version are only slightly larger than non-indexed version. (b) Anti-correlated • although Lazy retains more tuples than Eager, it actually consumes less space. • most data in anti-correlated will appear in the skyline • Eager needs to maintain an event list, whose size is comparable to that of the data tuples Performance under Variable Arrival Rate • examine their performance for realistic streams • d = 3 • tuples arrive in a “spiky” manner every 30 seconds • Time difference between 2 consecutive arrivals • 1-29s: Gaussian distribution N(0.1, 0.1) • 30s: N(5*10^(-5), 10^(-3)) • W = 15s, 300 minutes Space overhead with time (independent, LL). (a) Buffer size (LL-Lazy). (b) Buffer size (LL-Eager). • LL-Lazy is small (at most 6) at all times since it can usually finish handling the previous tuple before the next one arrives • LL-Eager is large because its amortized cost is higher than the expected interval (5*10^(-5)) (c) Memory consumption (LL-Lazy). (d) Memory consumption (LL-Eager). • the first 15 seconds remains high (W = 15) • Decreases after 15th second • LL-Eager keeps only the tuples that may become part of the skyline Anti-correlated (a)Buffer size (LL- Lazy). (b)Buffer size (LL- Eager). (c)Memory consumption (LL-Lazy) (d)Memory consumption (LL-Eager) • LL-Lazy also needs to buffer a large number of tuples at the end of each period since its PM becomes more expensive due to the increased skyline size • LL-Eager requires a larger amount of memory for EL Indexed version: Independent, anti-correlated • the buffer sizes of both algorithms are smaller • Especially for Independent Summary • Lazy requires less CPU computation than Eager • Eager, however, achieves balanced performance in the sense that it incurs small processing cost for every tuple • requires much smaller space for independent data • The indexed versions of both frameworks have extremely low amortized overhead. hence, they can support very fast streams Future work • other forms of skyline retrieval • example, a “top-k skyline” extracts only the k skyline tuples maximizing a user’s preference function. Less memory • incorporate the skyline operator into an integrated system • skyline maintenance besides sliding window • extremely fast streams, meaningful in practice Thank You~