Docstoc

data stream

Document Sample
data stream Powered By Docstoc
					Data Stream Algorithms
Intro, Sampling, Entropy


      Graham Cormode
      graham@research.att.com
Outline

   Introduction to Data Streams
    –   Motivating examples and applications
    –   Data Streaming models
    –   Basic tail bounds
   Sampling from data streams
   Sampling to estimate entropy




2
                       Data Stream Algorithms - Introduction
Data is Massive

       Data is growing faster than our ability to store or
        index it
       There are 3 Billion Telephone Calls in US each day,
        30 Billion emails daily, 1 Billion SMS, IMs.
       Scientific data: NASA's observation satellites
        generate billions of readings each per day.
       IP Network Traffic: up to 1 Billion packets per hour
        per router. Each ISP has many (hundreds) routers!
       Whole genome sequences for many species now
        available: each megabytes to gigabytes in size

3
                          Data Stream Algorithms - Introduction
Massive Data Analysis
    Must analyze this massive data:
     Scientific research (monitor environment, species)
     System management (spot faults, drops, failures)
     Customer research (association rules, new offers)
     For revenue protection (phone fraud, service abuse)
    Else, why even measure this data?




4
                       Data Stream Algorithms - Introduction
Example: Network Data




   Networks are sources of massive data: the metadata per
    hour per router is gigabytes
   Fundamental problem of data stream analysis:
    Too much information to store or transmit
   So process data as it arrives: one pass, small space: the
    data stream approach.
   Approximate answers to many questions are OK, if there
    are guarantees of result quality
5
                      Data Stream Algorithms - Introduction
      IP Network Monitoring Application
                                                            Source        Destination    Duration   Bytes   Protocol
                                                           10.1.0.2         16.2.3.7       12        20K      http
                          Network Operations               18.6.7.1         12.4.0.3       16        24K      http
        SNMP/RMON,                                         13.9.4.3         11.6.8.2       15        20K      http
                             Center (NOC)
        NetFlow records                                    15.2.2.9         17.1.2.1       19        40K      http
                                                           12.4.3.8         14.8.7.4       26        58K      http
                                                           10.5.1.3         13.0.0.1       27        100K     ftp
   Peer                                                    11.1.0.6        10.3.4.5        32        300K     ftp
                          Converged IP/MPLS
                                Core                       19.7.1.2         16.5.5.8       18        80K      ftp

                                                                                            Example NetFlow
                                                                                            IP Session Data
    Enterprise                                                PSTN
     Networks
• FR, ATM, IP VPN               DSL/Cable • Broadband                  • Voice over IP
                                Networks    Internet Access

       24x7 IP packet/flow data-streams at network elements
       Truly massive streams arriving at rapid rates
         –   AT&T/Sprint collect ~1 Terabyte of NetFlow data each day
       Often shipped off-site to data warehouse for off-line analysis
    6
                                       Data Stream Algorithms - Introduction
    Packet-Level Data Streams

   Single 2Gb/sec link; say avg packet size is 50bytes
   Number of packets/sec = 5 million
   Time per packet = 0.2 microsec
    If we only capture header information per packet: src/dest IP,
    time, no. of bytes, etc. – at least 10bytes.
    – Space   per second is 50Mb
    – Space   per day is 4.5Tb per link
    – ISPs   typically have hundreds of links!
   Analyzing packet content streams – order(s) of magnitude
    harder

    7
                            Data Stream Algorithms - Introduction
Network Monitoring Queries
    Back-end Data Warehouse

           DBMS
        (Oracle, DB2)
                                                        What are the top (most frequent) 1000 (source, dest)
    Off-line analysis –                                 pairs seen over the last month?
     slow, expensive
                        Network Operations
                                                              How many distinct (source, dest) pairs have
                          Center (NOC)
                                                              been seen by both R1 and R2 but not R3?

       Peer                                   R3                                Set-Expression Query
                             R1
                                                        SELECT COUNT (R1.source, R2.dest)
                                               R2       FROM R1, R2
                                                        WHERE R1.dest = R2.source
     Enterprise                              PSTN                                         SQL Join Query
     Networks            DSL/Cable
                         Networks
                                 Extra complexity comes from limited space and time
                                 Will introduce solutions for these and other problems
8
                                      Data Stream Algorithms - Introduction
Streaming Data Questions

   Network managers ask questions requiring us to
    analyze the data:
    – How many distinct addresses seen on the network?
    – Which destinations or groups use most bandwidth?
    – Find hosts with similar usage patterns?
   Extra complexity comes from limited space and time
   Will introduce solutions for these and other problems




9
                      Data Stream Algorithms - Introduction
Other Streaming Applications


        Sensor networks
         – Monitor habitat and environmental parameters
         – Track many objects, intrusions, trend analysis…




        Utility Companies
         – Monitor power grid, customer usage patterns etc.
         – Alerts and rapid response in case of problems



10
                          Data Stream Algorithms - Introduction
Streams Defining Frequency Dbns.

    We will consider streams that define frequency
     distributions
     –   E.g. frequency of packets from source A to source B
    This simple setting captures many of the core algorithmic
     problems in data streaming
     – How many distinct (non-zero) values seen?
     – What is the entropy of the frequency distribution?
     – What (and where) are the highest frequencies?
    More generally, can consider streams that define multi-
     dimensional distributions, graphs, geometric data etc.
    But even for frequency distributions, several models are
     relevant
11
                         Data Stream Algorithms - Introduction
Data Stream Models

    We model data streams as sequences of simple tuples
    Complexity arises from massive length of streams
    Arrivals only streams:
      – Example: (x, 3), (y, 2), (x, 2) encodes x
        the arrival of 3 copies of item x,      y
       2 copies of y, then 2 copies of x.
     – Could represent eg. packets on a network; power usage
    Arrivals and departures:
     – Example: (x, 3), (y,2), (x, -2) encodes x
       final state of (x, 1), (y, 2).          y
     – Can represent fluctuating quantities, or measure
       differences between two distributions
12
                       Data Stream Algorithms - Introduction
Approximation and Randomization

    Many things are hard to compute exactly over a stream
     – Is the count of all items the same in two different streams?
     – Requires linear space to compute exactly
    Approximation: find an answer correct within some factor
     – Find an answer that is within 10% of correct result
     – More generally, a (1 ) factor approximation
    Randomization: allow a small probability of failure
     – Answer is correct, except with probability 1 in 10,000
     – More generally, success probability (1-)
    Approximation and Randomization: (, )-approximations


13
                        Data Stream Algorithms - Introduction
  Basic Tools: Tail Inequalities
      General bounds on tail probability of a random variable
       (probability that a random variable deviates far from its
       expectation)
                   Probability
                   distribution
                                                                    Tail probability



                               
                                                      
      Basic Inequalities: Let X be a random variable with
       expectation  and variance Var[X]. Then, for any >0

Markov:                       Chebyshev:
                          1                                Var[X]
       Pr(X  (1 ε)μ)                 Pr(| X  μ | με)  2 2
                         1 ε                               με
  14
                            Data Stream Algorithms - Introduction
Tail Bounds

Markov Inequality:
For a random variable Y which takes only non-negative values.
                          Pr[Y  k]  E(Y)/k
(This will be < 1 only for k > E(Y))
Chebyshev’s Inequality:
For a random variable Y:
                   Pr[|Y-E(Y)|  k]  Var(Y)/k2
Proof: Set X = (Y – E(Y))2
    E(X) = E(Y2+E(Y)2–2YE(Y)) = E(Y2)+E(Y)2-2E(Y)2= Var(Y)
    So:     Pr[|Y-E(Y)|  k]       = Pr[(Y – E(Y))2  k2].
    Using Markov:                   E(Y – E(Y))2/k2 = Var(Y)/k2
15
                        Data Stream Algorithms - Introduction
Outline

    Introduction to Data Streams
     –   Motivating examples and applications
     –   Data Streaming models
     –   Basic tail bounds
    Sampling from data streams
    Sampling to estimate entropy




16
                        Data Stream Algorithms - Introduction
Sampling From a Data Stream




    Fundamental prob: sample m items uniformly from stream
     –   Useful: approximate costly computation on small sample
    Challenge: don’t know how long stream is
     –   So when/how often to sample?
    Two solutions, apply to different situations:
     – Reservoir sampling (dates from 1980s?)
     – Min-wise sampling (dates from 1990s?)

17
                         Data Stream Algorithms - Introduction
Reservoir Sampling




    Sample first m items
    Choose to sample the i’th item (i>m) with probability m/i
    If sampled, randomly replace a previously sampled item

    Optimization: when i gets large, compute which item will
     be sampled next, skip over intervening items. [Vitter 85]


18
                        Data Stream Algorithms - Introduction
Reservoir Sampling - Analysis

    Analyze simple case: sample size m = 1
    Probability i’th item is the sample from stream length n:
     –   Prob. i is sampled on arrival  prob. i survives to end

                  1         i  i+1 … n-2  n-1
                       
                   i       i+1  i+2   n-1   n

               = 1/n

    Case for m > 1 is similar, easy to show uniform probability
    Drawbacks of reservoir sampling: hard to parallelize


19
                            Data Stream Algorithms - Introduction
Min-wise Sampling

    For each item, pick a random fraction between 0 and 1
    Store item(s) with the smallest random tag [Nath et al.’04]



        0.391    0.908   0.291            0.555             0.619   0.273




    Each item has same chance of least tag, so uniform
    Can run on multiple streams separately, then merge

20
                         Data Stream Algorithms - Introduction
Sampling Exercises

    What happens when each item in the stream also has a
     weight attached, and we want to sample based on
     these weights?
     1. Generalize the reservoir sampling algorithm to draw a
        single sample in the weighted case.
     2. Generalize reservoir sampling to sample multiple
        weighted items, and show an example where it fails to
        give a meaningful answer.
     3. Research problem: design new streaming algorithms for
        sampling in the weighted case, and analyze their
        properties.



21
                       Data Stream Algorithms - Introduction
Outline

    Introduction to Data Streams
     –   Motivating examples and applications
     –   Data Streaming models
     –   Basic tail bounds
    Sampling from data streams
    Sampling to estimate entropy




22
                        Data Stream Algorithms - Introduction
Application of Sampling: Entropy

    Given a long sequence of characters
             S = <a1, a2, a3… am>        each aj  {1… n}
    Let fi = frequency of i in the sequence
    Compute the empirical entropy:
                H(S) = - i fi/m log fi/m = - i pi log pi

    Example: S = < a, b, a, b, c, a, d, a>
     – pa = 1/2, pb = 1/4, pc = 1/8, pd = 1/8
     – H(S) = ½ + ¼  2 + 1/8  3 + 1/8  3 = 7/4
    Entropy promoted for anomaly detection in networks


23
                         Data Stream Algorithms - Introduction
Challenge

    Goal: approximate H(S) in space sublinear
     (poly-log) in m (stream length), n (alphabet size)
     –   (,) approx: answer is (1§)H(S) w/prob 1-
    Easy if we have O(n) space: compute each fi exactly
    More challenging if n is huge, m is huge, and we have
     only one pass over the input in order
     –   (The data stream model)




24
                        Data Stream Algorithms - Introduction
Sampling Based Algorithm

    Simple estimator:
     – Randomly sample a position j in the stream
     – Count how many times aj appears subsequently = r
     – Output X = -(r log (r/m) – (r-1) log((r-1)/m))


    Claim: Estimator is unbiased – E[X] = H(S)
     –   Proof: prob of picking j = 1/m, sum telescopes correctly
    Variance of estimate is not too large – Var[X] = O(log2 m)
     – Observe that |X| ≤ log m
     – Var[X] = E[(X – E[X])2] < (max(X) – min(X))2 = O(log2 m)



25
                          Data Stream Algorithms - Introduction
Analysis of Basic Estimator

    A general technique in data streams:
     – Repeat in parallel an unbiased estimator with bounded
       variance, take average of estimates to improve variance
     – Var[ 1/k (Y1 + Y2 + ... Yk) ] = 1/k Var[Y]


                                       Var[X]
                    Pr(| X  μ | με)  2 2
                                        με
     – By Chebyshev, need k repetitions to be Var[X]/2E2[X]
     – For entropy, this means space O(log2m/2H2(S))


    Problem for entropy: when H(S) is very small?
     –   Space needed for an accurate approx goes as 1/H2!
26
                         Data Stream Algorithms - Introduction
Low Entropy

    But... what does a low entropy stream look like?
     –   aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabaaaaa
    Very boring most of the time, we are only rarely surprised
    Can there be two frequent items?
     – aabababababababaababababbababababababa
     – No! That’s high entropy (¼ 1 bit / character)

    Only way to get H(S) =o(1) is to have only one character
     with pi close to 1




27
                       Data Stream Algorithms - Introduction
Removing the frequent character

    Write entropy as
     – -pa log pa + (1-pa) H(S’)
     – Where S’ = stream S with all ‘a’s removed

    Can show:
     – Doesn’t matter if H(S’) is small: as pa is large, additive
       error on H(S’) ensures relative error on (1-pa)H(S’)
     – Relative error (1-pa) on pa gives relative error on pa log pa
     – Summing both (positive) terms gives relative error overall




28
                         Data Stream Algorithms - Introduction
Finding the frequency character

    Ejecting a is easy if we know in advance what it is
     –   Can then compute pa exactly
    Can find online deterministically
     – Assume pa > 2/3 (if not, H(S) > 0.9, and original alg works)
     – Run a ‘heavy hitters’ algorithm on the stream (see later)
     – Modify analysis, find a and pa §  (1-pa)

    But... how to also compute H(S’) simultaneously if we
     don’t know a from the start... do we need two passes?




29
                        Data Stream Algorithms - Introduction
Always have a back up plan...

    Idea: keep two samples to build our estimator
     –   If at the end one of our samples is ‘a’, use the other
     –   How to do this and ensure uniform sampling?
    Pick first sample with ‘min-wise sampling’:
    At end of the stream, if the sampled character = ‘a’, we
     want to sample from the stream ignoring all ‘a’s
    This is just “the character achieving the smallest label
     distinct from the one that achieves the smallest label”
    Can track information to do this in a single pass,
     constant space


30
                          Data Stream Algorithms - Introduction
Sampling Two Tokens
Stream:    C       A       A           B        B        A        B        D       C       A       B       A
           0.408

                   0.815

                           0.217

                                       0.191

                                                0.770

                                                         0.082

                                                                  0.366

                                                                           0.228

                                                                                   0.549

                                                                                           0.173

                                                                                                   0.627

                                                                                                           0.202
Tags:


Repeats:                               B        B        A        B                        A       B       A

                                                    min
                                      min tag amongst tag
                                      remaining tokens
                                              second smallest
  Assign tags, choose first token as before tag, but we don’t
                                              want this; same
  Delete all occurrences of first token
                                             token as min tag!
  Choose token with min remaining tag; count repeats
  Implementation: keep track of two triples
    (min tag, corresponding token, number of repeats)
31
                                   Data Stream Algorithms - Introduction
Putting it all together

    Can combine all these pieces
    Build an estimator based on tracking this information,
     deciding whether there is a frequent character or not
    A more involved Chernoff bounds argument improves
     number of repetitions of estimator from O(-2Var[X]/E2[X])
     to O(-2Range[X]/E[X]) = O(-2 log m)

    In space O(-2 log m log 1/) space we can compute an
     (,) approximation to H(S) in a single pass




32
                       Data Stream Algorithms - Introduction
Entropy Exercises

    As a subroutine, we need to find an element that occurs
     more than 2/3 of the time and estimate its weight
     1.   How can we find a frequently occurring item?
     2.   How can we estimate its weight p with (1-p) error?
     3.   Our algorithm uses O(-2 log m log 1/) space, could this
          be improved or is it optimal (lower bounds)?
     4.   Our algorithm updates each sampled pair for every
          update, how quickly can we implement it?
     5.   (Research problem) What if there are multiple distributed
          streams and we want to compute the entropy of their
          union?


33
                          Data Stream Algorithms - Introduction
Outline

    Introduction to Data Streams
     –   Motivating examples and applications
     –   Data Streaming models
     –   Basic tail bounds
    Sampling from data streams
    Sampling to estimate entropy




34
                        Data Stream Algorithms - Introduction
Data Stream Algorithms
 Frequency Moments


     Graham Cormode
     graham@research.att.com
Frequency Moments

    Introduction to Frequency Moments and Sketches
    Count-Min sketch for F and frequent items
    AMS Sketch for F2
    Estimating F0
    Extensions:
     – Higher frequency moments
     – Combined frequency moments




36
                      Data Stream Algorithms - Introduction
Last Time

    Introduced data streams and data stream models
     –   Focus on a stream defining a frequency distribution
    Sampling to draw a uniform sample from the stream
    Entropy estimation: based on sampling




37
                         Data Stream Algorithms - Introduction
This Time: Frequency Moments

    Given a stream of updates, let fi be the number of times
     that item i is seen in the stream
    Define Fk of the stream as i (fi)k – the k’th Frequency
     Moment
    “Space Complexity of the Frequency Moments” by Alon,
     Matias, Szegedy in STOC 1996 studied this problem
     – Awarded Godel prize in 2005
     – Set the pattern for much streaming algorithms to follow
     – Frequency moments are at the core of many streaming
       problems



38
                        Data Stream Algorithms - Introduction
Frequency Moments

    F0 : count 1 if fi  0 – number of distinct items
    F1 : length of stream, easy
    F2 : sum the squares of the frequencies – self join size
    Fk : related to statistical moments of the distribution
    F : dominated by the largest fk, finds the largest
     frequency

    Different techniques needed for each one.
     –   Mostly sketch techniques, which compute a certain kind of
         random linear projection of the stream


39
                         Data Stream Algorithms - Introduction
Sketches

    Not every problem can be solved with sampling
     – Example: counting how many distinct items in the stream
     – If a large fraction of items aren’t sampled, don’t know if
       they are all same or all different
    Other techniques take advantage that the algorithm can
     “see” all the data even if it can’t “remember” it all
    (To me) a sketch is a linear transform of the input
     –   Model stream as defining a vector, sketch is result of
         multiplying stream vector by an (implicit) matrix


                                                            linear projection

40
                          Data Stream Algorithms - Introduction
Trivial Example of a Sketch
              1 0 1 1 1 0 1 0 1 …

                   1 0 1 1 0 0 1 0 1 …

    Test if two (asynchronous) binary streams are equal
                 d= (x,y) = 0 iff x=y, 1 otherwise
    To test in small space: pick a random hash function h
    Test h(x)=h(y) : small chance of false positive, no chance
     of false negative.
    Compute h(x), h(y) incrementally as new bits arrive
     (Karp-Rabin: h(x) = xi2i mod p for random prime p)
     –   Exercise: extend to real valued vectors in update model
41
                         Data Stream Algorithms - Introduction
Frequency Moments

    Introduction to Frequency Moments and Sketches
    Count-Min sketch for F and frequent items
    AMS Sketch for F2
    Estimating F0
    Extensions:
     – Higher frequency moments
     – Combined frequency moments




42
                      Data Stream Algorithms - Introduction
Count-Min Sketch

    Simple sketch idea, can be used for as the basis of many
     different stream mining tasks.
    Model input stream as a vector x of dimension U
    Creates a small summary as an array of w  d in size
    Use d hash function to map vector entries to [1..w]
    Works on arrivals only and arrivals & departures streams
                                 W

             Array:
             CM[i,j]                                     d



43
                       Data Stream Algorithms - Introduction
CM Sketch Structure
                             +c
           h1(j)




                                                                       d=log 1/
 j,+c                                                             +c

                                      +c
         hd(j)
                                                          +c


                                 w = 2/
    Each entry in vector x is mapped to one bucket per row.
    Merge two sketches by entry-wise summation
    Estimate x[j] by taking mink CM[k,hk(j)]
     – Guarantees error less than F1 in size O(1/ log 1/)
     – Probability of more error is less than 1-
44
                                                  [C, Muthukrishnan ’04]
                          Data Stream Algorithms - Introduction
Approximation
Approximate x’[j] = mink CM[k,hk(j)]
 Analysis: In k'th row, CM[k,hk(j)] = x[j] + Xk,j
     –   Xk,j = S x[i] | hk(i) = hk(j)
     –   E(Xk,j)    = Si j x[i]*Pr[hk(i)=hk(j)]
                     Pr[hk(i)=hk(k)] * Si x[i]
                    =  F1/2 by pairwise independence of h
     –   Pr[Xk,j  F1] = Pr[Xk,j  2E(Xk,j)]  1/2 by Markov inequality
    So, Pr[x’[j] x[j] +  F1] = Pr[ k. Xk,j> F1] 1/2log 1/ = 
    Final result: with certainty x[j]  x’[j] and
     with probability at least 1-, x’[j]< x[j] +  F1


45
                               Data Stream Algorithms - Introduction
Applications of CM to F

    CM sketch lets us estimate fi for any i
    F asks to find maxi fi
    Slow way: test every i after creating sketch
    Faster way: test every i after it is seen in the stream, and
     remember largest estimated value
    Alternate way:
     – keep a binary tree over the domain of input items, where
       each node corresponds to a subset
     – keep sketches of all nodes at same level
     – descend tree to find large frequencies, discarding
       branches with low frequency

46
                        Data Stream Algorithms - Introduction
Count-Min Exercises

     1. The median of a distribution is the item so that the sum of
        the frequencies of lexicographically smaller items is ½ F1.
        Use CM sketch to find the (approximate) median.
     2. Assume the input frequencies follow the Zipf distribution
        so that the i’th largest frequency is (i-z) for z>1. Show
        that CM sketch only needs to be size -1/z to give same
        guarantee
     3. Suppose we have arrival and departure streams where
        the frequencies of items are allowed to be negative.
        Extend CM sketch analysis to estimate these frequencies
        (note, Markov argument no longer works)
     4. How to find the large absolute frequencies when some
        are negative?

47
                        Data Stream Algorithms - Introduction
Frequency Moments

    Introduction to Frequency Moments and Sketches
    Count-Min sketch for F and frequent items
    AMS Sketch for F2
    Estimating F0
    Extensions:
     – Higher frequency moments
     – Combined frequency moments




48
                      Data Stream Algorithms - Introduction
F2 estimation

    AMS sketch (for Alon-Matias-Szegedy) proposed in 1996
     – Allows estimation of F2 (second frequency moment)
     – Used at the heart of many streaming and non-streaming
       mining applications: achieves dimensionality reduction
    Here, describe AMS sketch by generalizing CM sketch.
    Uses extra hash functions g1...glog 1/ {1...U} {+1,-1}
    Now, given update (j,+c), set CM[k,hk(i)] += c*gk(j)


                                                                   linear
                                                               projection

                                    AMS sketch
49
                       Data Stream Algorithms - Introduction
F2 analysis
                          +c*g1(j)
           h1(j)




                                                                              d=8log 1/
j,+c                                                               +c*g2(j)

                                    +c*g3(j)
         hd(j)
                                                        +c*g4(j)


                                           w = 4/2
    Estimate F2 = mediank i CM[k,i]2
    Each row’s result is i g(i)2xi2 + h(i)=h(j) 2 g(i) g(j) fi fj
    But g(i)2 = -12 = +12 = 1, and i xi2 = F2
    g(i)g(j) has 1/2 chance of +1 or –1 : expectation is 0 …
50
                          Data Stream Algorithms - Introduction
F2 Variance
    Expectation of row estimate is exactly F2
    Variance of row k is an expectation:
     – Vark = E[ buckets b (CM[k,b])4 – F22 ]
     – Good exercise in algebra: expand this sum and simplify
     – Many terms are zero in expectation because of terms like
       g(a)g(b)g(c)g(d) (degree at most 4)
     – Requires that hash function g is four-wise independent: it
       behaves uniformly over subsets of size four or smaller
         Such hash functions are easy to construct

    Row variance can finally be bounded by F22/w
     – Chebyshev for w=4/2 gives probability ¼ of failure
     – How to amplify this to small  probability of failure?

51
                         Data Stream Algorithms - Introduction
Tail Inequalities for Sums
    We derive stronger bounds on tail probabilities for the sum
     of independent Bernoulli trials via the Chernoff Bound:
     –   Let X1, ..., Xm be independent Bernoulli trials s.t. Pr[Xi=1] = p
         (Pr[Xi=0] = 1-p).
     –   Let X = i=1m Xi ,and  = mp be the expectation of X.
     –   Then, for any >0,

                                                                     με2
                     Pr(| X  μ | με)  2exp                         2




52
                             Data Stream Algorithms - Introduction
Applying Chernoff Bound

    Each row gives an estimate that is within  relative error
     with probability p > ¾
    Take d repetitions and find the median. Why the median?


     – Because bad estimates are either too small or too large
     – Good estimates form a contiguous group “in the middle”
     – At least d/2 estimates must be bad for median to be bad
    Apply Chernoff bound to d independent estimates, p=3/4
     – Pr[ More than d/2 bad estimates ] < 2exp(d/8)
     – So we set d = (ln ) to give  probability of failure
    Same outline used many times in data streams
53
                         Data Stream Algorithms - Introduction
Aside on Independence
    Full independence is expensive in a streaming setting
     – If hash functions are fully independent over n items, then we
       need (n) space to store their description
     – Pairwise and four-wise independent hash functions can be
       described in a constant number of words
    The F2 algorithm uses a careful mix of limited and full
     independence
     – Each hash function is four-wise independent over all n items
     – Each repetition is fully independent of all others – but there
       are only O(log 1/) repetitions.




54
                        Data Stream Algorithms - Introduction
AMS Sketch Exercises

1.   Let x and y be binary streams of length n.
     The Hamming distance H(x,y) = |{i | x[i] y[i]}|
     Show how to use AMS sketches to approximate H(x,y)
2.   Extend for strings drawn from an arbitrary alphabet
3.   The inner product of two strings x, y is x y = i=1n x[i]*y[i]
     Use AMS sketches to estimate x  y
     –   Hint: try computing the inner product of the sketches.
         Show the estimator is unbiased (correct in expectation)
     –   What form does the error in the approximation take?
     –   Use Count-Min Sketches for the same problem and
         compare the errors.
     –   Is it possible to build a (1) approximation of x  y?
55
                         Data Stream Algorithms - Introduction
Frequency Moments

    Introduction to Frequency Moments and Sketches
    Count-Min sketch for F and frequent items
    AMS Sketch for F2
    Estimating F0
    Extensions:
     – Higher frequency moments
     – Combined frequency moments




56
                      Data Stream Algorithms - Introduction
F0 Estimation

    F0 is the number of distinct items in the stream
     –   a fundamental quantity with many applications
    Early algorithms by Flajolet and Martin [1983] gave nice
     hashing-based solution
     –   analysis assumed fully independent hash functions
    Will describe a generalized version of the FM algorithm
     due to Bar-Yossef et. al with only pairwise indendence




57
                         Data Stream Algorithms - Introduction
F0 Algorithm
    Let m be the domain of stream elements
     –   Each item in stream is from [1…m]
    Pick a random hash function h: [m]  [m3]
     –   With probability at least 1-1/m, no collisions under h


                0m3                                               vt m3

    For each stream item i, compute h(i), and track the t
     distinct items achieving the smallest values of h(i)
     – Note: if same i is seen many times, h(i) is same
     – Let vt = t’th smallest value of h(i) seen.
    If F0 < t, give exact answer, else estimate F’0 = tm3/vt
     –   vt/m3  fraction of hash domain occupied by t smallest
58
                          Data Stream Algorithms - Introduction
Analysis of F0 algorithm

    Suppose F’0 = tm3/vt > (1+) F0 [estimate is too high]



                 0m3      vt      tm3/(1+)F0                     m3

    So for stream = set S  2[m], we have
     – |{ s  S | h(s) < tm3/(1+)F0 }| > t
     – Because  < 1, we have tm3/(1+)F0  (1-/2)tm3/F0
     – Pr[ h(s) < (1-/2)tm3/F0]  1/m3 * (1-/2)tm3/F0 = (1-/2)t/F0


     –   (this analysis outline hides some rounding issues)

59
                          Data Stream Algorithms - Introduction
Chebyshev Analysis

    Let Y be number of items hashing to under tm3/(1+)F0
     – E[Y] = F0 * Pr[ h(s) < tm3/(1+)F0] = (1-/2)t
     – For each item i, variance of the event = p(1-p) < p
     – Var[Y] = s  S Var[ h(s) < tm3/(1+)F0] < (1-/2)t
         We sum variances because of pairwise independence



    Now apply Chebyshev:
     –   Pr[ Y > t ]    Pr[|Y – E[Y]| > t/2]
                        4Var[Y]/2t2
                        4/(2t2)
     – Set t=6/2 to make this Prob  1/9


60
                         Data Stream Algorithms - Introduction
Completing the analysis

    We have shown
        Pr[ F’0 > (1+) F0 ] < 1/9
    Can show Pr[ F’0 < (1-) F0 ] < 1/9 similarly
     –   too few items hash below a certain value
    So Pr[ (1-) F0  F’0  (1+)F0] > 7/9 [Good estimate]

    Amplify this probability: repeat O(log 1/) times in parallel
     with different choices of hash function h
     –   Take the median of the estimates, analysis as before




61
                         Data Stream Algorithms - Introduction
F0 Issues

    Space cost:
     – Store t hash values, so O(1/2 log m) bits
     – Can improve to O(1/2 + log m) with additional tricks




    Time cost:
     – Find if hash value h(i) < vt
     – Update vt and list of t smallest if h(i) not already present
     – Total time O(log 1/ + log m) worst case



62
                         Data Stream Algorithms - Introduction
Range Efficiency

    Sometimes input is specified as a stream of ranges [a,b]
     – [a,b] means insert all items (a, a+1, a+2 … b)
     – Trivial solution: just insert each item in the range
    Range efficient F0 [Pavan, Tirthapura 05]
     – Start with an alg for F0 based on pairwise hash functions
     – Key problem: track which items hash into a certain range
     – Dives into hash fns to divide and conquer for ranges
    Range efficient F2 [Calderbank et al. 05, Rusu,Dobra 06]
     – Start with sketches for F2 which sum hash values
     – Design new hash functions so that range sums are fast



63
                         Data Stream Algorithms - Introduction
F0 Exercises

    Suppose the stream consists of a sequence of insertions
     and deletions.
     Design an algorithm to approximate F0 of the current set.
     –   What happens when some frequencies are negative?
    Give an algorithm to find F0 of the most recent W arrivals
    Use F0 algorithms to approximate Max-dominance: given
     a stream of pairs (i,x(i)), approximate j max(i, x(i)) x(i)




64
                        Data Stream Algorithms - Introduction
Frequency Moments

    Introduction to Frequency Moments and Sketches
    Count-Min sketch for F and frequent items
    AMS Sketch for F2
    Estimating F0
    Extensions:
     – Higher frequency moments
     – Combined frequency moments




65
                      Data Stream Algorithms - Introduction
Higher Frequency Moments

    Fk for k>2. Use sampling trick as with Entropy [Alon et al 96]:
     – Uniformly pick an item from the stream length 1…n
     – Set r = how many times that item appears subsequently
     – Set estimate F’k = n(rk – (r-1)k)


    E[F’k]=1/n*n*[ f1k - (f1-1)k + (f1-1)k - (f1-2)k + … + 1k-0k]+…
         = f1k + f2k + … = Fk
    Var[F’k]1/n*n2*[(f1k-(f1-1)k)2 + …]
     – Use various bounds to bound the variance by k m1-1/k Fk2
     – Repeat k m1-1/k times in parallel to reduce variance
    Total space needed is O(k m1-1/k) machine words

66
                         Data Stream Algorithms - Introduction
Improvements

    [Coppersmith and Kumar ‘04]: Generalize the F2 approach
     – E.g. For F3, set p=1/m, and hash items onto {1-1/p, -1/p}
       with probability {1/p, 1-1/p} respectively.
     – Compute cube of sum of the hash values of the stream
     – Correct in expectation, bound variance  O(mF32)
    [Indyk, Woodruff ‘05, Bhuvangiri et al. ‘06]: Optimal solutions by
     extracting different frequencies
     – Use hashing to sample subsets of items and fi’s
     – Combine these to build the correct estimator
     – Cost is O(m1-2/k poly-log(m,n,1/)) space




67
                          Data Stream Algorithms - Introduction
Combined Frequency Moments
Consider network traffic data: defines a
communication graph
eg edge: (source, destination)
or edge: (source:port, dest:port)
Defines a (directed) multigraph
We are interested in the underlying
(support) graph on n nodes

    Want to focus on number of distinct communication
     pairs, not size of communication
    So want to compute moments of F0 values...
68
                      Data Stream Algorithms - Introduction
Multigraph Problems

    Let G[i,j] = 1 if (i,j) appears in stream:
     edge from i to j. Total of m distinct edges
    Let di = Sj=1n G[i,j] : degree of node i
    Find aggregates of di’s:
     – Estimate heavy di’s (people who talk to many)
     – Estimate frequency moments:
       number of distinct di values, sum of squares
     – Range sums of di’s (subnet traffic)




69
                        Data Stream Algorithms - Introduction
F (F0) using CM-FM

    Find i’s such that di > f i di
     Finds the people that talk to many others
    Count-Min sketch only uses additions, so can apply:




70
                       Data Stream Algorithms - Introduction
Accuracy for F(F0)

    Focus on point query accuracy: estimate di.
    Can prove estimate has only small bias in expectation
     –   Analysis is similar to original CM sketch analysis, but now
         have to take account of F0 estimation of counts
    Gives an bound of O(1/3 poly-log(n)) space:
     –   The product of the size of the sketches


    Remains to fully understand other combinations of
     frequency moments, eg. F2(F0), F2(F2) etc.




71
                          Data Stream Algorithms - Introduction
Exercises / Problems

1.   (Research problem) Read, understand and simplify
     analysis for optimal Fk estimation algorithms
2.   Take the sampling Fk algorithm and combine it with F0
     estimators to approximate Fk of node degrees
3.   Why can’t we use the sketch approach for F2 of node
     degrees? Show there the analysis breaks down
4.   (Research problem) What can be computed for other
     combinations of frequency moments, e.g. F2 of F2
     values, etc.?




72
                      Data Stream Algorithms - Introduction
Frequency Moments

    Introduction to Frequency Moments and Sketches
    Count-Min sketch for F and frequent items
    AMS Sketch for F2
    Estimating F0
    Extensions:
     – Higher frequency moments
     – Combined frequency moments




73
                      Data Stream Algorithms - Introduction
Data Stream Algorithms
    Lower Bounds


     Graham Cormode
     graham@research.att.com
Streaming Lower Bounds

    Lower bounds for data streams
     – Communication complexity bounds
     – Simple reductions
     – Hardness of Gap-Hamming problem
     – Reductions to Gap-Hamming



                                      Alice


                                        1 0 1 1 1 0 1 0 1 …

                                                              Bob

75
                      Data Stream Algorithms - Introduction
This Time: Lower Bounds

    So far, have seen many examples of things we can do
     with a streaming algorithm
    What about things we can’t do?
    What’s the best we could achieve for things we can do?
    Will show some simple lower bounds for data streams
     based on communication complexity




76
                       Data Stream Algorithms - Introduction
Streaming As Communication

           Alice


            1 0 1 1 1 0 1 0 1 …

                     Bob




    Imagine Alice processing a stream
    Then take the whole working memory, and send to Bob
    Bob continues processing the remainder of the stream


77
                      Data Stream Algorithms - Introduction
Streaming As Communication

    Suppose Alice’s part of the stream corresponds to string
     x, and Bob’s part corresponds to string y...
    ...and that computing the function on the stream
     corresponds to computing f(x,y)...
    ...then if f(x,y) has communication complexity (g(n)),
     then the streaming computation has a space lower
     bound of (g(n))
    Proof by contradiction:
     If there was an algorithm with better space usage, we
     could run it on x, then send the memory contents as a
     message, and hence solve the communication problem

78
                       Data Stream Algorithms - Introduction
Deterministic Equality Testing
              1 0 1 1 1 0 1 0 1 …

                  1 0 1 1 0 0 1 0 1 …

    Alice has string x, Bob has string y, want to test if x=y
    Consider a deterministic (one-round, one-way) protocol
     that sends a message of length m < n
    There are 2m possible messages, so some strings must
     generate the same message: this would cause error
    So a deterministic message (sketch) must be (n) bits
     –   In contrast, we saw a randomized sketch of size O(log n)


79
                         Data Stream Algorithms - Introduction
Hard Communication Problems

    INDEX: x is a binary string of length n
     y is an index in [n]
     Goal: output x[y]
     Result: (one-way) (randomized) communication
     complexity of INDEX is (n) bits

    DISJOINTNESS: x and y are both length n binary strings
     Goal: Output 1 if i: x[i]=y[i]=1, else 0
     Result: (multi-round) (randomized) communication
     complexity of DISJOINTNESS is (n) bits



80
                      Data Stream Algorithms - Introduction
Simple Reduction to Disjointness
             x: 1 0 1 1 0 1                  1, 3, 4, 6

             y: 0 0 0 1 1 0                  4, 5


    F: output the highest frequency in a stream
    Input: the two strings x and y from disjointness
    Stream: if x[i]=1, then put i in stream; then same for y
    Analysis: if F=2, then intersection; if F1, then disjoint.
    Conclusion: Giving exact answer to F requires (N) bits
     – Even approximating up to 50% error is hard
     – Even with randomization: DISJ bound allows randomness

81
                        Data Stream Algorithms - Introduction
Simple Reduction to Index
             x: 1 0 1 1 0 1                   1, 3, 4, 6

             y: 5                             5


    F0: output the number of items in the stream
    Input: the strings x and index y from INDEX
    Stream: if x[i]=1, put i in stream; then put y in stream
    Analysis: if (1-)F’0(xy)>(1+)F’(x) then x[y]=1, else it is 0
    Conclusion: Approximating F0 for <1/N requires (N) bits
     – Implies that space to approximate must be (1/)
     – Bound allows randomization
82
                         Data Stream Algorithms - Introduction
Hardness Reduction Exercises

Use reductions to DISJ or INDEX to show the hardness of:
    Frequent items: find all items in the stream whose
     frequency > fN, for some f.
    Sliding window: given a stream of binary (0/1) values,
     compute the sum of the last N values
     –   Can this be approximated instead?
    Min-dominance: given a stream of pairs (i,x(i)),
     approximate j min(i, x(i)) x(i)
    Rank sum: Given a stream of (x,y) pairs and query (p,q)
     specified after stream, approximate |{(x,y)| x<p, y<q}|


83
                         Data Stream Algorithms - Introduction
Streaming Lower Bounds

    Lower bounds for data streams
     – Communication complexity bounds
     – Simple reductions
     – Hardness of Gap-Hamming problem
     – Reductions to Gap-Hamming



                                      Alice


                                        1 0 1 1 1 0 1 0 1 …

                                                              Bob

84
                      Data Stream Algorithms - Introduction
Gap Hamming

GAP-HAMM communication problem:
    Alice holds x  {0,1}N, Bob holds y  {0,1}N
    Promise: H(x,y) is either  N/2 - pN or  N/2 + pN
    Which is the case?
    Model: one message from Alice to Bob

Requires (N) bits of one-way randomized communication
            [Indyk, Woodruff’03, Woodruff’04, Jayram, Kumar, Sivakumar ’07]




85
                         Data Stream Algorithms - Introduction
Hardness of Gap Hamming

    Reduction from an instance of INDEX
    Map string x to u by 1 +1, 0  -1 (i.e. u[i] = 2x[i] -1 )
    Assume both Alice and Bob have access to public
     random strings rj, where each bit of rj is iid {-1, +1}
    Assume w.l.o.g. that length of string n is odd (important!)
    Alice computes aj = sign(rj  u)
    Bob computes bj = sign(rj[y])
    Repeat N times with different random strings, and
     consider the Hamming distance of a1... aN with b1 ... bN



86
                        Data Stream Algorithms - Introduction
Probability of a Hamming Error

    Consider the pair aj= sign(rj  u), bj = sign(rj[y])
    Let w = i  y u[i] rj[i]
     –   w is a sum of (n-1) values distributed iid uniform {-1,+1}
    Case 1: w  0. So |w| 2, since (n-1) is even
     – so sign(aj) = sign(w), independent of x[y]
     – Then Pr[aj  bj] = Pr[sign(w)  sign(rj[y]) = ½
    Case 2: w = 0.
     So aj = sign(rju) = sign(w + u[y]rj[y]) = sign(u[y]rj[y])
     – Then Pr[aj  bj] = Pr[sign(u[y]rj[y]) = sign(rj[y])]
     – This probability is 1 is u[y]=+1, 0 if u[y]=-1
     – Completely biased by the answer to INDEX


87
                          Data Stream Algorithms - Introduction
Finishing the Reduction

    So what is Pr[w=0]?
     – w is sum of (n-1) iid uniform {-1,+1} values
     – Textbook: Pr[w=0] = c/n, for some constant c
    Do some probability manipulation:
     – Pr[aj = bj] = ½ + c/2n if x[y]=1
     – Pr[aj = bj] = ½ - c/2n if x[y]=0
    Amplify this bias by making strings of length N=4n/c2
     – Apply Chernoff bound on N instances
     – With prob>2/3, either H(a,b)>N/2 + N or H(a,b)<N/2 - N
    If we could solve GAP-HAMMING, could solve INDEX
     –   Therefore, need (N) = (n) bits for GAP-HAMMING

88
                         Data Stream Algorithms - Introduction
Streaming Lower Bounds

    Lower bounds for data streams
     – Communication complexity bounds
     – Simple reductions
     – Hardness of Gap-Hamming problem
     – Reductions to Gap-Hamming



                                      Alice


                                        1 0 1 1 1 0 1 0 1 …

                                                              Bob

89
                      Data Stream Algorithms - Introduction
Lower Bound for Entropy

                 Alice: x  {0,1}N, Bob: y  {0,1}N
                 Entropy estimation algorithm A
    Alice runs A on enc(x) = (1,x1), (2,x2), …, (N,xN)
    Alice sends over memory contents to Bob
    Bob continues A on enc(y) = (1,y1), (2,y2), …, (N,yN)


                  0      1               0              0       1   1
        Alice
                (1,0) (2,1) (3,0) (4,0) (5,1) (6,1)

                (1,1) (2,1) (3,0) (4,0) (5,1) (6,0)
         Bob
                  1      1               0              0       1   0
90
                        Data Stream Algorithms - Introduction
Lower Bound for Entropy

    Observe: there are
     – 2H(x,y) tokens with frequency 1 each
     – N-H(x,y) tokens with frequency 2 each
    So, H(S) = log N + H(x,y)/N
    Thus size of Alice’s memory contents = (N).
     Set  = 1/(p(N) log N) to show bound of (/log 1/)-2)

                  0       1              0              0       1   1
         Alice
                 (1,0) (2,1) (3,0) (4,0) (5,1) (6,1)

                 (1,1) (2,1) (3,0) (4,0) (5,1) (6,0)
         Bob
                  1       1              0              0       1   0
91
                        Data Stream Algorithms - Introduction
Lower Bound for F0

    Same encoding works for F0 (Distinct Elements)
     – 2H(x,y) tokens with frequency 1 each
     – N-H(x,y) tokens with frequency 2 each
    F0(S) = N + H(x,y)
    Either H(x,y)>N/2 + N or H(x,y)<N/2 - N
     – If we could approximate F0 with  < 1/N, could separate
     – But space bound = (N) = (-2) bits
    Dependence on  for F0 is tight

    Similar arguments show (-2) bounds for Fk,
     –   Proof assumes k (and hence 2k) are constants
92
                         Data Stream Algorithms - Introduction
Lower Bounds Exercises

1.   Formally argue the space lower bound for F2 via Gap-
     Hamming
2.   Argue space lower bounds for Fk via Gap-Hamming
3.   (Research problem) Extend lower bounds for the case
     when the order of the stream is random or near-random
4.   (Research problem) Kumar conjectures the multi-round
     communication complexity of Gap-Hamming is (n) –
     this would give lower bounds for multi-pass streaming




93
                     Data Stream Algorithms - Introduction
Streaming Lower Bounds

    Lower bounds for data streams
     – Communication complexity bounds
     – Simple reductions
     – Hardness of Gap-Hamming problem
     – Reductions to Gap-Hamming



                                      Alice


                                        1 0 1 1 1 0 1 0 1 …

                                                              Bob

94
                      Data Stream Algorithms - Introduction
Data Stream Algorithms
 Extensions and Open
       Problems

      Graham Cormode
      graham@research.att.com
This Time: Extensions

    Have given “the basics” of streaming: streams of items,
     frequency moments, upper and lower bounds
    Many variations with many open problems
     – Streams representing different combinatorial objects
     – Streams that are distributed, correlated, uncertain
     – Systems for processing streams
     – Different models of streams


    See also “Open problems in Data Streams” [McGregor ’07]
     –   Result of a workshop held at IIT Kanpur in Dec 2006


96
                         Data Stream Algorithms - Introduction
Deterministic Streaming Algorithms

    Focus so far has been on randomized algorithms
    Many important problems can be solved deterministically!
     – Finding frequent items/ heavy hitters
     – Finding quantiles of a distribution
    For many problems, lower bounds show randomization is
     necessary for sublinear space:
     – Anything involving equality testing as a special case
     – Frequency moments
    When they are possible, deterministic algorithms are often
     faster and use less space: more practical to implement


97
                        Data Stream Algorithms - Introduction
Clustering On Data Streams
        Goal: output k cluster centers at end
         - any point can be classified using these centers.
        Use divide and conquer approach [Guha et al. ’00]:
         – Buffer as many points as possible, then cluster them
         – Cluster the clusters
         – Cluster the cluster clusters, etc...
         – Each level of clustering gives up extra factors in quality




Input:                                                Output:


98
                           Data Stream Algorithms - Introduction
Geometric Streaming

    Stream specifies a sequence of d-dimensional points
    Answer various geometric problems such as:
     – Convex hull
     – Minimum spanning tree weight
     – Facility location
     – Minimum enclosing ball
    Gridding approach reduces to Fk or related problems
     [Indyk ’03]
    Core-set: keep a carefully chosen small subset of points
     and evaluate on them [Har-Peled 02, Chan’06]
     –   Simple example: For minimum enclosing ball, keep
         extremal points in evenly-space directions
99
                         Data Stream Algorithms - Introduction
Sliding Window Computations

     In a sliding window, we only consider the last W items
      –   W still very large, so want poly-log(W) solutions
     Exponential Histograms [Datar et al.02]
      and Waves [Gibbons Tirthapura’02]
      – Deterministic structure tracks counts in a window
      – Based on doubling bucket sizes to give relative error
      – Same structure + sketches solves for aggregates
     Asynchronous streams: items not in timestamp order
      – Relative error counts possible [Busch, Tirthapura ’07]
      – Extend concept to other aggregates [C. et al. ’08]



100
                           Data Stream Algorithms - Introduction
Time Decay

     Assign a weight to each item as a function of its age
       – E.g. Exponential decay or polynomial decay
       – Implies “weighted” versions of problems
     Cohen and Strauss [2003]:
                                                                    age 
       –   Can reduce sum and counts to multiple
           instances of sliding window queries
     C., Korn and Tirthapura [2008]:
       –   Same observations applies to other
           computations (quantiles, frequent items)




101
                            Data Stream Algorithms - Introduction
Multi-Pass Algorithms

     Some situations allow multiple passes of the stream
      –   E.g. scanning over slow storage (tape):
          random access not possible, but can scan multiple times
     Earliest work in streaming [Munro, Paterson ’78] studied the
      pass/space tradeoff for finding medians
     Lower bounds can follow from multi-round
      communication complexity bounds


                    1 0 1 1 1 0 1 0 1 …



102
                          Data Stream Algorithms - Introduction
Other Massive Data Models




   Massive Unordered Data (MUD) model [Feldman et al. ‘08]
      – Abstracts computations in MapReduce/Hadoop settings
      – Can provably simulate deterministic streaming algs
      – What about randomized computations, multiple passes?
103
                         Data Stream Algorithms - Introduction
Skewed Streams




                                                                  log frequency
            frequency




                        items sorted by frequency                                 log rank


     In practice, not all frequency distributions are worst case
      – Few items are frequent, then a long tail of infrequent items
      – Such skew is prevalent in network data, word frequency,
        paper citations, city sizes, etc.
      – “Zipfian” distribution with skew z > 0 (z = [1..2] typical)
     Analyze algorithms under assumption of skewed data
      –   Improved F2 space cost = O(-2/(1+z) log 1/), provided z>1
104
                                         Data Stream Algorithms - Introduction
Graph Streaming
            3

 1                     5
                                               (4,5) (2,3) (1,3) (3,5) (1,2) (2,4) (1,5) (3,4)

                   4
             2

     Stream specifies a massive graph edge by edge
      – Most natural problems have (|V|) space lower bounds
      – Semi-streaming model: allow (|V|) but o(|E|) space
        Therefore also o(|V|2) space also
     Allow one (or few) passes to approximate:
      – Minimum Spanning Tree Weight
      – Graph Distances (based on spanners)
      – Maximum weight matching
      – Counting Triangles
105
                           Data Stream Algorithms - Introduction
Matrix Streaming

     Stream specifies a massive n  n matrix
      –   Either by giving entries in some
          order, or updates to entries                             O(1) Columns       O(1) Rows
     In one (or few) passes, find:
      – CUR Decomposition
      – Page Rank Vector
                                                   ( )=(  A               C
                                                                              )( )(
                                                                                  U      R

                                                                                  Carefully
                                                                                              )
      – Approximate Matrix product                                                chosen U
      – Singular Vector Decomposition
     Current methods take small constant number of passes,
      sample constant number of rows and columns by weight
      –   Sketching methods don’t seem so useful here

106
                           Data Stream Algorithms - Introduction
Permutation Streaming

           1     3     4       2                            3      4   1   2

     Stream presents a permutation of items
      –   Abstraction of several settings, more of theoretic interest
     Approximate number of inversions in the stream
      – Locations where i > j but i appears before j in stream
      – Can be reduced to a variation of quantiles [Gupta, Zane’03]
     Find length of longest increasing subsequence
      – Reduce (up to factor 2) to simpler function [Ergun, Jowhari ’08]
      – Approximate this using a different variation of quantiles
      – Deterministic lower bound (N1/2), randomized bound open

107
                           Data Stream Algorithms - Introduction
Random Order Streaming

     Lower bounds are sometimes based on carefully
      creating adversarial orders of streams
     Random order streams: order is uniformly permuted
      – Can sometimes give much better upper bounds– prefix of
        stream gives a good sample of dbn. to come
      – Lower bounds in random order give stronger evidence of
        “robust” hardness, e.g. [Chakrabarti et al. ’08]
           GAP-HAMMING still has linear lower bound

           t2-party DISJOINTNESS has (n/t) lower bound




108
                        Data Stream Algorithms - Introduction
Probabilistic Streams
                 Example: S = (x, ½, y, 1/3, y, ¼)
                    Encodes 6 “possible worlds”:
                  G       f          x           y         x,y        y,y    x,y,y
                Pr[G]    ¼          ¼         5/24        5/24        1/24   1/24


     Instead of exact values, stream of discrete distributions
      –   Specify exponentially many possible worlds
     Adds complexity to previously studied problems
      – Sum and Count are easy (by linearity of expectation)
      – Avg=Sum/Count is hard! –because of ratio [McGregor et al. ’07]
     Linearity of expectation, summation of variance
      –   Allows estimation of Fk over streams [C, Garofalakis ’07]

109
                              Data Stream Algorithms - Introduction
Distributed Streams




                                                                      http://www.intel.com/research/exploratory/motes.htm
     Motivated by Sensor Networks – large wireless nets
      –   Communication drains battery: compute more, send less
     Key problem: design stream summary data structures
      that can be combined to summarize the union of streams
      – Most sketches (AMS, Count-Min, F0) naturally distribute
      – Similar results needed for other problems




       base station
       (root, coordinator…)

110
                              Data Stream Algorithms - Introduction
Continuous Distributed Model
                                 Coordinator                        Track Q(S1,…,Sm)

                                                                             local stream(s)
                                                                              seen at each
      m sites                                                                      site


          S1                                                                        Sm

   Goal: Continuously track (global) query over streams at
    the coordinator while bounding the communication
      –   Large-scale network-event monitoring, real-time anomaly/
          DDoS attack detection, power grid monitoring, …
   Results known for quantiles, Fk, clustering...
      –   Cost not much higher than one time computation [C et al. 08]
111
                            Data Stream Algorithms - Introduction
Extensions for P2P Networks

     Much work focused on specifics of sensor and wired nets
     P2P and Grid computing present alternate models
       – Structure of multi-hop overlay networks
       – “Controlled failure” model: nodes explicitly leave and join
     Allows us to think beyond model of “highly resource
      constrained” sensors.
     Implementations such as OpenDHT over PlanetLab
      [Rhea et al.’05]




112
                           Data Stream Algorithms - Introduction
Authenticated Stream Aggregation
                             Wide-area query processing
                                –   Possible malicious aggregators
                                –   Can suppress or add spurious
                                    information
                             Authenticate query results at the
                              querier?
                                –   Perhaps, to within some
                                    approximation error
                             Initial steps in [Garofalakis et al.’06],
                             Sliding window: [Hadjieleftheriou et al. ’07]



113
           Data Stream Algorithms - Introduction
Data Stream Algorithms

     Slides are on the web on my website
     Long list of references also on the web
     http://dimacs.rutgers.edu/~graham




114
                      Data Stream Algorithms - Introduction

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:3/27/2013
language:Latin
pages:114