Document Sample

Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.) The Story Begins with ... The Model Alice observes A(t) by time t 5 4 3 1 2 4 1 t Carole tries to compute f (A(t)UB(t)) for all t 2 1 2 5 3 2 Bob observes All parties have infinite computing power B(t) by time t Goal is to minimize communication A(t), B(t): multisets The Model Continuous Communication Model / Distributed Streaming Model k sites 5 4 3 1 2 4 1 2 1 2 5 3 2 3 1 3 1 2 3 2 2 3 3 5 2 Combination of Two Models 3 2 3 2 1 1 1 1 2 4 2 4 Continuous Communication Model Distributed Streaming Model 3 1 2 4 1 Communication model Streaming model One-shot Model Other Models [Gibbons and Tirthapura, 2001] 5 4 3 1 2 4 1 t Carole tries to compute f (AUB) in the end 2 1 2 5 3 2 All parties make one pass using small memory small communication Applied Motivation: Distributed Monitoring Network Query site Query Operations Q(S1 ∪ S2 ∪…) Center (NOC) S1 S3 S6 1 1 1 1 1 0 1 0 1 1 0 S2 1 0 S4 1 1 S5 1 1 0 0 0 1 0 Large-scale querying/monitoring: Inherently distributed! Streams physically distributed across remote sites E.g., stream of UDP packets through routers Challenge is “holistic” querying/monitoring Queries over the union of distributed streams Q(S1 ∪ S2 ∪ …) Streaming data is spread throughout the network Slide from the tutorial “Streaming in a connected world: Querying and tracking distributed data streams” at VLDB’06 and SIGMOD’07 [Cormode and Garofalakis] Applied Motivation: Distributed Monitoring Network Query site Query Operations Q(S1 ∪ S2 ∪…) Center (NOC) S1 S3 S6 1 1 1 1 1 0 1 0 1 1 0 S2 1 0 S4 1 1 S5 1 1 0 0 0 1 0 Traditional approach: “pull” based Query all nodes once for a while Expensive communication, most is wasted Inaccurate Current trend: moving towards a “push” based approach The remote sites alert the coordinator when something interesting happens Theoretical Questions Upper bounds: Worst-case communication bounds for a given f ? Lower bounds: Is there a gap in the communication complexity between the one-shot model and the continuous model? The Frequency Moments Assume integer domain [n] = {1, …, n} i appears mi times The p-th frequency moment: F1 is the cardinality of A F0 is # unique items in A (define 00=0) F2 is Gini’s index of homogeneity in statistics self-join size in db Extensively studied since [Alon, Matias, and Szegedy, 1999] Approximate Monitoring Must trigger alarm when Fp > τ Cannot trigger alarm when Fp < (1 − ε) τ Fp τ (1 − ε) τ alarm time Why approximate: Exact monitoring is expensive and unnecessary Why monitoring Most applications only need monitoring Tracking can be simulated by monitoring with τ = 1+ε, (1+ε)2, (1+ε)3, …, so at most an O(1/ε) factor away. Prior Work Several papers in the database literature Mostly heuristic based Bad worst-case bounds, no lower bounds F1: O(k/ε log(τ/k)) [SIGMOD’06] O(k log(1/ε)) F0: Õ(k2/ε3) [ICDE’06] Õ(k/ε2) F2: Õ(k2/ε4) [VLDB’05] Õ(k2/ε+k3/2/ε3) Õ() suppresses polylog factors Continuous vs One-Shot If there is a continuous monitoring algorithm that communicates X bits, then there is a one-shot algorithms that communicates O(X+k) bits Our Results Good news: all continuous bounds (except F2) are close to their one-shot counterparts Bad news: all continuous bounds (except F2) are close to their one-shot counterparts Talk Outline Introduction Deterministic F1 algorithm: O(k log(1/ε)) Randomized F1 algorithm: O(1/ε2∙log(1/δ)) Randomized F0 algorithm: Õ(k/ε2) Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3) Conclusions Deterministic F1 Algorithm The first round: Terminates round after receiving k signals τ/2k · k = τ/2 < F1 < τ τ/2k coordinator Deterministic F1 Algorithm The second round: τ/4k coordinator Deterministic F1 Algorithm The second round: Terminates round after receiving k signals 3τ/4 < F1 < τ τ/4k coordinator Deterministic F1 Algorithm Each round communicates O(k) bits Continue until Δ=ετ O(log(1/ε)) rounds Δ=ετ After the last round, we have (1-ε)τ < F1 < τ Total communication: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk))) One-Shot: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk))) coordinator Talk Outline Introduction Deterministic F1 algorithm: O(k log(1/ε)) Randomized F1 algorithm: O(1/ε2∙log(1/δ)) Randomized F0 algorithm: Õ(k/ε2) Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3) Conclusions F0: # Distinct Items Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits Consider the one-shot case first Use “sketches”: small-space streaming algorithms “Combine” the sketches from the k sites FM sketch [Flajolet and Martin 1985; Alon, Matias, and Szegedy, 1999] FM Sketch Take a pair-wise independent random hash function h : {1,…,n} {1,…,2d}, where 2d > n For each incoming element x, compute h(x) e.g., h(5) = 10101100010000 Count how many trailing zeros Remember the maximum number of trailing zeroes in any h(x) Let Y be the maximum number of trailing zeroes Can show E[2Y] = # distinct elements FM Sketch So 2Y is an unbiased estimator for # distinct elements However, has a large variance Some recent techniques [Gibbons and Tirthapura, 2001; Bar- Yossef, Jayram, Kumar, Sivakumar, and Trevisan, 2002] to produce a good estimator that has probability 1–δ to be within relative error ε Space increased to Õ(1/ε2) FM sketch has linearity Y1 from A, Y2 from B, then 2max{Y1, Y2} estimates # distinct items in AUB A one-shot algorithm with communication Õ(k/ε2) Continuously Monitoring F0 FM sketch is monotone Yi is non-decreasing, and Yi < log n Whenever Yi increases, notify the coordinator The coordinator can always have the up-to- date combined FM sketch Total communication: Õ(k/ε2) Lower bound: Ω(k) Talk Outline Introduction Deterministic F1 algorithm: O(k log(1/ε)) Randomized F1 algorithm: O(1/ε2∙log(1/δ)) Randomized F0 algorithm: Õ(k/ε2) Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3) Conclusions F2: The One-Shot Case Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits Consider the one-shot case first Use “sketches”: small-space streaming algorithms “Combine” the sketches from the k sites AMS sketch [Alon, Matias, and Szegedy, 1999] AMS Sketch: “Tug-of-War” Take a 4-wise independent random hash function h : {1,…,n} {−1,+1} Compute Y = ∑ h(x) over all x Y2 is an unbiased estimator for F2 Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε Linearity still holds! o One-shot case can be solved with communication Õ(k/ε2) However… Y is not monotone! Can’t afford to send all changes of the local sketch to the coordinator F2 Monitoring: Multi-Round Algorithm Beginning of a round sketch Õ(1/ε2) sketch Õ(1/ε2) coordinator estimate for F2 F2 Monitoring: Multi-Round Algorithm During a round sends a signal whenever the F2 of the updates increases by t = (τ − F2)2/(64k2τ) coordinator estimate for F2 F2 Monitoring: Multi-Round Algorithm End of a round: when k signals are received # rounds: O(k/ε) coordinator Total cost: Õ(k2/ε3) estimate for F2 old F2 + (τ − old F2) ∙ ε/k < new F2 < τ F2: Round / Sub-Round Algorithm End of a sub-round: when k signals are received “rough” sketch “rough” sketch of size Õ(1) of size Õ(1) combine sketches coordinator maintain an upper estimate for F2 bound of F2 k old F2 + (τ − old F2) ∙ ε/k < new F2 < τ Total cost: Õ(k2/ε+k3/2/ε3) Lower bound: Ω(k) One-shot: Õ(k/ε2) Open Problems Still no clear separation between the one-shot model and the continuous model F2 is an interesting case Many other functions f Statistics: entropy, heavy hitters Geometric measures: diameter, width, … Variations of the model One-way vs two-way communication Does having a broadcast channel help? Sliding windows? “Continuous Communication Complexity”? Thank you!

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 3 |

posted: | 2/18/2013 |

language: | Unknown |

pages: | 34 |

OTHER DOCS BY hcj

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.