Maintaining Sliding Window Skylines on Data Streams

Document Sample
Maintaining Sliding Window Skylines on Data Streams Powered By Docstoc
					Maintaining Sliding Window
 Skylines on Data Streams

                 Yufei Tao
                     &
             Dimitris Papadias
          Skyline
• a支配b
• a在每一维上不比b差
• Skyline:不被任何其
  余点支配的点的集合
Manhattan skyline
• f()是一个在所有维上都单调上升的函数
• f = min(f)的取值点一定在skyline集合中
• Data stream
• Consider only the tuples that arrived
  in a sliding window covering the W most
  recent timestamps
• W: the window length
• A tuple r’s lifespan (r.tarr, r.texp)
  r.texp = r.tarr + W
•W = 5
• (+a, 1), (+b, 3), (-a, 6), (+c, 6),
  (+d, 7), (-b, 8), (+e, 9), (-c, 9),
  (-d, 9), (+f, 11)
           Properties
• All points dominated by an incoming
  tuple r can be discarded
  (Lemma 1)
• An arriving tuple r cannot be
  discarded even if it is dominated
  by some existing tuple r’
• A tuple r can appear in the skyline
  for at most a single continuous
  time interval
The architecture of our system
        TABLE 1
Frequently Used Symbols
The Lazy Method
 Preprocessing module (L-PM)
  Maintenance module (L-MM)
    Several implementation
            issues
• The structures organizing the data
  in DBsky and DBrest
• The algorithms for performing d-
  sided emptiness tests and range
  search
• DBrest needs to store obsolete data
  and tuples that will never appear
  in the skyline.
The Eager Method
        Eager aims at


• Minimizing the memory consumption
  by keeping only those tuples that
  may become part of the skyline
• Reducing the cost of the
  maintenance module (E-MM).
W = 15
                            DBsky   i
                                    a   b
                                        k   c   d   e   f   g
Skyline    influence time
h.tsky =   26               DBrest h    i   j
i.tsky =   20
j.tsky =   32
            Event list EL
             e < e.ptr, e.t, e.tag >


  pointer to the                       the
  tuple involved     the event time
                                       event type
• e.tag = ‘EX’, e.t = r.texp     for skyline tuple r
         (expire)
• e.tag = ‘SK’, e.t = r.tsky     otherwise
         (skyline)

a main-memory B-tree indexed by event times
• Lemma 2. Eager correctly produces
  the skyline output stream.
• Proof. We aim at establishing two
  facts:
• 1) all skyline changes are
  captured by Eager
• 2) every skyline change produced
  by Eager is correct.
Analytical Study
         Analytical Study
• Ssky: the largest number of points
  in DBsky at any timestamp
• a: the highest tuple arrival rate
  per timestamp
• Ssky <= W * a
        (actually much smaller)
       Analysis of Lazy
• Lemma 3. For Lazy, every tuple r
  is inserted into (removed from)
  each of DBsky and DBrest at most
  once.
Cempty, total N    lazy
                  Crange[i] ,total narrsky
Csubsky[i], total nskyexp   Cupdsky, Cupdrest
 • Theorem 1. For Lazy, the amortized time
   of processing a tuple is


           1
              narrsky
                             1
                               nsky exp                          
O Cempty 
               Crange[i]  N  Csubsky[i]  Cupdsky  Cupdrest 
            N i 1
                      lazy
                                                                 
                               i 1                             
              LL-Lazy
• Organize the data of DBsky and DBrest
  using linked lists
• Srest: the maximum number of tuples
  in DBrest when a skyline point
  expires
• Corollary 1. For LL-Lazy, the
  amortized time of processing a
  tuple is
 
O ssky 
 
          nskyexp
                                    
                   srest  ssky  log srest  
                                                       
                                                             
           N                                           
                    1 for d  2 | 3,  d  2 for d  4
            n
            1 arrsky lazy
                             n
                            1 sky exp                           
O Cempty 
              Crange[i]  N  Csubsky[i]  Cupdsky  Cupdrest 
            N i 1                                              
                              i 1                             
 • srest is no more than W*a

         nskyexp                                  
O ssky 
                 * (W * a) * ( ssky  log (W * a)) 
                                                    
           N                                       
 • W*a is a constant independent of N
 • N >> (W*a)*logα(W*a)
 • Need nskyexp << N to be efficient

                         • O(ssky)
              I-Lazy
• creating appropriate indexes on
  DBsky
• emptiness tests and range queries
• Qrange(n) + k, Qempty(n)
• Urange(n), Uempty(n)
Qcount(n), Ucount(n)
    • Corollary 2. For I-Lazy, the amortized
      time of processing a tuple is
                           nskyexp                                            
O Q( ssky )  U ( ssky ) 
                                   * srest * (Qcount ( ssky )  log ( srest )) 
                                                                                
                             N                                                 
  Q(ssky) = max{Qempty(ssky),Qrange(ssky)}
   U(ssky) = max{Uempty(ssky),Urange(ssky),Ucount(ssky)}


            1
               narrsky
                              1
                                nsky exp                          
 O Cempty 
                Crange[i]  N  Csubsky[i]  Cupdsky  Cupdrest 
             N i 1
                       lazy
                                                                  
                                i 1                             
   Qrange(ssky)+ki, sum(ki) <= N
   1
     (narrsky * Qrange( ssky )  N )                 Qrange(ssky)
   N
• Selecting different structures for
  d-sided emptiness tests, range
  search, and count queries
• priority search tree
• O-tree
• R-tree
      Analysis of Eager
• Lemma 4. Eager inserts (deletes)
  every tuple r into DB once.
  Furthermore, Eager adds to
  (removes from) the event list EL
  at most one SK and one EX event of
  r.
• Eager stores at most W*a tuples
 Theorem 3. For Eager, the amortized time of
 processing a tuple is

 O    Cmax  Crange  Cupddb  log(W
               eager
                                        * a)   
                    eager
Cmax               Crange
O   
    Cmax  Crange  Cupddb  log(W
            eager
                                     * a)   
• LL-Eager
• O(W*a)

• I-Eager
• similar to I-Lazy, enhances the
  performance with auxiliary
  structures on DB that facilitate
  d-sided max and range retrievals.
            Comparing
• Lazy: efficient when ssky or nskyexp
  is small
• Eager: upper bounds the processing
  time within O(W*a) in all cases
  (the time is even shorter if
  indexes are adopted).
             R-trees
• R-tree does not have provable
  performance guarantees
• it performs reasonably well for
  real-world data

• range search
• obtaining a subskyline
• emptiness test
              Experiments
• LL-Lazy, I-Lazy, LL-Eager, I-Eager
• each node of an R-tree and a B-tree occupies
  512 bytes.
• d dimensions, range [0, 1]
• independent and anti-correlated
   Amortized Performance
• a low arrival rate of 10
  tuples/second
• d (between 2 and 4)
• W (from 200 to 3.2k seconds)
• Each stream contains 500 windows
• the total number of tuples ranges
  from 1 to 16 million
Average Skyline Size per Timestamp versus W (d = 3)




Average Skyline Size per Timestamp versus d (W = 800)




independent data sets have much smaller
skylines than anti-correlated ones.
   Processing Time (d = 3)




       Independent         Anti-correlated

Indexed version & Lazy version are better
Processing cost at individual operations (independent, W =
800, d = 3). (a) LL-Lazy. (b) I-Lazy.




 Spikes: timestamps when a skyline point expires
 Lazy invokes expensive L-MM
 I-Lazy’s spikes are shorter, because it
 calculates a subskyline using an R-tree
(c) LL-Eager   (d) I-Eager




Less fluctuation
Indexed version are always better
 Anti-
 correlated

(a)   LL-Lazy.
(b)   I-Lazy.
(c)   LL-Eager.
(d)   I-Eager.




 • spikes of Lazy disappear since it needs to
   scan a large number of tuples in DBsky for
   every arrival (due to the frequent skyline
   changes).
 Amortized cost versus d (W = 800)
 (a) Independent (b) Anti-correlated




performance deteriorates
LL: increase of skyline sizes
I: the effectiveness of R-trees drops with the dimensionality
            Space Consumption
 Per-timestamp space overhead versus W (d = 3)
 (a) Independent. (b) Anti-correlated




• Lazy stores a larger number of tuples than W * a
• tuple r is evicted  r is in DBsky & dominated by a subsequent
  tuple. This seldom happens because Independent has a small size
  skyline and skyline changes are infrequent.

• Eager keeps only the tuples that participate in the skyline
• Indexed version are only slightly larger than non-indexed version.
(b) Anti-correlated




• although Lazy retains more tuples than Eager, it
  actually consumes less space.
• most data in anti-correlated will appear in the
  skyline
• Eager needs to maintain an event list, whose size
  is comparable to that of the data tuples
   Performance under Variable
          Arrival Rate
• examine their performance for realistic
  streams
• d = 3
• tuples arrive in a “spiky” manner every
  30 seconds
• Time difference between 2 consecutive
  arrivals
• 1-29s: Gaussian distribution N(0.1, 0.1)
• 30s: N(5*10^(-5), 10^(-3))
• W = 15s, 300 minutes
 Space overhead with time (independent, LL).
 (a) Buffer size (LL-Lazy). (b) Buffer size (LL-Eager).




• LL-Lazy is small (at most 6) at all times since
  it can usually finish handling the previous
  tuple before the next one arrives
• LL-Eager is large because its amortized cost is
  higher than the expected interval (5*10^(-5))
(c) Memory consumption (LL-Lazy). (d) Memory consumption (LL-Eager).




• the first 15 seconds remains high (W = 15)
• Decreases after 15th second
• LL-Eager keeps only the tuples that may become
  part of the skyline
   Anti-correlated

(a)Buffer size (LL-
  Lazy).
(b)Buffer size (LL-
  Eager).
(c)Memory
consumption (LL-Lazy)
(d)Memory consumption
  (LL-Eager)




 • LL-Lazy also needs to buffer a large number of tuples at
   the end of each period since its PM becomes more expensive
   due to the increased skyline size
 • LL-Eager requires a larger amount of memory for EL
 Indexed version: Independent, anti-correlated


• the buffer sizes of both algorithms are
  smaller
• Especially for Independent
               Summary
• Lazy requires less CPU computation than
  Eager
• Eager, however, achieves balanced
  performance in the sense that it incurs
  small processing cost for every tuple
• requires much smaller space for
  independent data
• The indexed versions of both frameworks
  have extremely low amortized overhead.
  hence, they can support very fast
  streams
           Future work
• other forms of skyline retrieval
• example, a “top-k skyline” extracts only
  the k skyline tuples maximizing a user’s
  preference function. Less memory
• incorporate the skyline operator into an
  integrated system
• skyline maintenance besides sliding
  window
• extremely fast streams, meaningful in
  practice
Thank You~

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:3/21/2012
language:
pages:56