Docstoc

Algorithms for Distributed Functional Monitoring

Document Sample
Algorithms for Distributed Functional Monitoring Powered By Docstoc
					Algorithms for Distributed
   Functional Monitoring
                               Ke Yi
                              HKUST

                       Joint work with
        Graham Cormode (AT&T Labs)
        S. Muthukrishnan (Google Inc.)
The Story Begins with ...
    The Model
                                                           Alice observes
                                                           A(t) by time t


5       4       3       1         2       4            1



                                              t       Carole tries to compute
                                                      f (A(t)UB(t)) for all t

    2       1       2       5         3           2


                                                            Bob observes
All parties have infinite computing power                   B(t) by time t
Goal is to minimize communication
                                                      A(t), B(t): multisets
    The Model
        Continuous Communication Model / Distributed Streaming Model

                                                                            k sites
5               4           3           1               2       4       1



    2               1           2           5               3       2




            3           1           3       1       2           3




        2               2           3           3           5       2
Combination of Two Models

    3           2         3           2
            1                     1
1                     1
        2       4             2       4


                                          Continuous Communication Model
                                          Distributed Streaming Model

                                            3     1   2   4       1




Communication model                             Streaming model
            One-shot Model
    Other Models [Gibbons and Tirthapura, 2001]

5       4       3       1       2       4       1




t                                               Carole tries to compute
                                                f (AUB) in the end

    2       1       2       5       3       2


All parties make one pass using small memory
 small communication
Applied Motivation: Distributed Monitoring

    Network                                           Query site
                                                                                       Query
   Operations                                                                       Q(S1 ∪ S2 ∪…)
  Center (NOC)                  S1
                                                            S3                                   S6
                                                                                             1
                                    1 1                                     1                    1
                                                                                                     0
                                                                    1
                                                            0
                                1
                            1 0

                                    S2        1
                                                  0        S4                   1
                                                                                         1
                                                                                             S5
                                                                                             1
                                                                        1                        0
                                          0                     0
                                    1 0


 Large-scale querying/monitoring: Inherently distributed!
     Streams physically distributed across remote sites
      E.g., stream of UDP packets through routers
 Challenge is “holistic” querying/monitoring
     Queries over the union of distributed streams Q(S1 ∪ S2 ∪ …)
     Streaming data is spread throughout the network
      Slide from the tutorial “Streaming in a connected world: Querying and tracking
      distributed data streams” at VLDB’06 and SIGMOD’07 [Cormode and Garofalakis]
Applied Motivation: Distributed Monitoring

    Network                                        Query site
                                                                                    Query
   Operations                                                                    Q(S1 ∪ S2 ∪…)
  Center (NOC)               S1
                                                         S3                                   S6
                                                                                          1
                                 1 1                                     1                    1
                                                                                                  0
                                                                 1
                                                         0
                             1
                         1 0

                                 S2        1
                                               0        S4                   1
                                                                                      1
                                                                                          S5
                                                                                          1
                                                                     1                        0
                                       0                     0
                                 1 0


 Traditional approach: “pull” based
    Query all nodes once for a while
    Expensive communication, most is wasted
    Inaccurate
 Current trend: moving towards a “push” based approach
    The remote sites alert the coordinator when something interesting
     happens
Theoretical Questions

Upper bounds: Worst-case communication
 bounds for a given f ?
Lower bounds: Is there a gap in the
 communication complexity between the
 one-shot model and the continuous model?
The Frequency Moments
 Assume integer domain [n] = {1, …, n}
 i appears mi times
 The p-th frequency moment:
 F1 is the cardinality of A
 F0 is # unique items in A (define 00=0)
 F2 is
   Gini’s index of homogeneity in statistics
   self-join size in db
 Extensively studied since [Alon, Matias, and Szegedy, 1999]
Approximate Monitoring
 Must trigger alarm when Fp > τ
 Cannot trigger alarm when Fp < (1 − ε) τ
             Fp
                                                    τ
                                                    (1 − ε) τ

                                                        alarm

                                                         time
 Why approximate: Exact monitoring is expensive and
  unnecessary
 Why monitoring
    Most applications only need monitoring
    Tracking can be simulated by monitoring with τ = 1+ε, (1+ε)2,
     (1+ε)3, …, so at most an O(1/ε) factor away.
Prior Work

Several papers in the database literature
  Mostly heuristic based
  Bad worst-case bounds, no lower bounds
F1: O(k/ε log(τ/k)) [SIGMOD’06] O(k log(1/ε))
F0: Õ(k2/ε3) [ICDE’06]          Õ(k/ε2)
F2: Õ(k2/ε4) [VLDB’05]          Õ(k2/ε+k3/2/ε3)
     Õ() suppresses polylog factors
Continuous vs One-Shot

If there is a continuous monitoring
 algorithm that communicates X bits, then
 there is a one-shot algorithms that
 communicates O(X+k) bits
Our Results




Good news: all continuous bounds (except
 F2) are close to their one-shot counterparts
Bad news: all continuous bounds (except
 F2) are close to their one-shot counterparts
Talk Outline

Introduction
Deterministic F1 algorithm: O(k log(1/ε))
Randomized F1 algorithm: O(1/ε2∙log(1/δ))
Randomized F0 algorithm: Õ(k/ε2)
Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3)
Conclusions
Deterministic F1 Algorithm

The first round:
             Terminates round after receiving k signals
             τ/2k · k = τ/2 < F1 < τ




  τ/2k




                      coordinator
Deterministic F1 Algorithm

The second round:


 τ/4k




                coordinator
Deterministic F1 Algorithm

The second round:
            Terminates round after receiving k signals
            3τ/4 < F1 < τ

 τ/4k




                    coordinator
Deterministic F1 Algorithm
Each round communicates O(k) bits
Continue until Δ=ετ  O(log(1/ε)) rounds

 Δ=ετ       After the last round, we have (1-ε)τ < F1 < τ

            Total communication: O(k log(1/ε))
            Lower bound: Ω(k log(1/(εk)))


            One-Shot: O(k log(1/ε))
            Lower bound: Ω(k log(1/(εk)))

                    coordinator
Talk Outline

Introduction
Deterministic F1 algorithm: O(k log(1/ε))
Randomized F1 algorithm: O(1/ε2∙log(1/δ))
Randomized F0 algorithm: Õ(k/ε2)
Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3)
Conclusions
F0: # Distinct Items

Lower bound: Any deterministic (or Las
 Vegas randomized) algorithm has to
 communicate Ω(n) bits
Consider the one-shot case first
  Use “sketches”: small-space streaming
   algorithms
  “Combine” the sketches from the k sites
  FM sketch [Flajolet and Martin 1985; Alon, Matias,
    and Szegedy, 1999]
FM Sketch
 Take a pair-wise independent random hash
  function h : {1,…,n}  {1,…,2d}, where 2d > n
 For each incoming element x, compute h(x)
  e.g., h(5) = 10101100010000
  Count how many trailing zeros
  Remember the maximum number of trailing zeroes in
   any h(x)
 Let Y be the maximum number of trailing zeroes
  Can show E[2Y] = # distinct elements
FM Sketch
 So 2Y is an unbiased estimator for # distinct elements
 However, has a large variance
   Some recent techniques [Gibbons and Tirthapura, 2001; Bar-
     Yossef, Jayram, Kumar, Sivakumar, and Trevisan, 2002] to produce
     a good estimator that has probability 1–δ to be within
     relative error ε
   Space increased to Õ(1/ε2)
 FM sketch has linearity
   Y1 from A, Y2 from B, then 2max{Y1, Y2} estimates #
     distinct items in AUB
 A one-shot algorithm with communication Õ(k/ε2)
Continuously Monitoring F0

FM sketch is monotone
  Yi is non-decreasing, and Yi < log n
  Whenever Yi increases, notify the coordinator
  The coordinator can always have the up-to-
   date combined FM sketch
  Total communication: Õ(k/ε2)
Lower bound: Ω(k)
Talk Outline

Introduction
Deterministic F1 algorithm: O(k log(1/ε))
Randomized F1 algorithm: O(1/ε2∙log(1/δ))
Randomized F0 algorithm: Õ(k/ε2)
Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3)
Conclusions
F2: The One-Shot Case

Lower bound: Any deterministic (or Las
 Vegas randomized) algorithm has to
 communicate Ω(n) bits
Consider the one-shot case first
  Use “sketches”: small-space streaming
   algorithms
  “Combine” the sketches from the k sites
  AMS sketch [Alon, Matias, and Szegedy, 1999]
AMS Sketch: “Tug-of-War”
 Take a 4-wise independent random hash function
  h : {1,…,n}  {−1,+1}
 Compute
                      Y = ∑ h(x)
  over all x
 Y2 is an unbiased estimator for F2
 Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good
  estimator that has probability 1–δ to be within relative
  error ε
 Linearity still holds!
   o One-shot case can be solved with communication Õ(k/ε2)
However…

Y is not monotone!




Can’t afford to send all changes of the
 local sketch to the coordinator
F2 Monitoring: Multi-Round Algorithm

Beginning of a round




    sketch Õ(1/ε2)                     sketch Õ(1/ε2)


                      coordinator

                     estimate for F2
  F2 Monitoring: Multi-Round Algorithm

  During a round




sends a signal whenever
the F2 of the updates increases
by t = (τ − F2)2/(64k2τ)           coordinator

                                  estimate for F2
F2 Monitoring: Multi-Round Algorithm

End of a round: when k signals are received




                                           # rounds: O(k/ε)
                          coordinator      Total cost: Õ(k2/ε3)

                         estimate for F2

old F2 + (τ − old F2) ∙ ε/k < new F2 < τ
F2: Round / Sub-Round Algorithm

 End of a sub-round: when k signals are received




          “rough” sketch                                “rough” sketch
          of size Õ(1)                                  of size Õ(1)

                                                 combine sketches
                                coordinator
                                                 maintain an upper
                              estimate for F2   bound of F2
                           k
old F2 + (τ − old F2) ∙ ε/k < new F2 < τ            Total cost: Õ(k2/ε+k3/2/ε3)
                           Lower bound: Ω(k)        One-shot: Õ(k/ε2)
Open Problems
 Still no clear separation between the one-shot model and
  the continuous model
    F2 is an interesting case
 Many other functions f
    Statistics: entropy, heavy hitters
    Geometric measures: diameter, width, …
 Variations of the model
    One-way vs two-way communication
    Does having a broadcast channel help?
    Sliding windows?
 “Continuous Communication Complexity”?
Thank you!

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:2/18/2013
language:Unknown
pages:34