Docstoc

cfd

Document Sample
cfd Powered By Docstoc
					      Estimating the Confidence
      of Conditional Functional
      Dependencies
Xi Zhang, University at Buffalo, SUNY
Graham Cormode, AT&T Labs-Research
Lukasz Golab, AT&T Labs-Research
Flip Korn, AT&T Labs-Research
Andrew McGregor, University of Massachusetts Amherst
Divesh Srivastava, AT&T Labs-Research
Outline
   Background
       Conditional Functional Dependency (CFD)
   Problem Definitions
       Fixed CFD Problem
       Variable CFD Problem
   Solutions
       A two-pass algorithm
       An “Idealized” one-pass algorithm
       A one-pass algorithm based on multi-level sampling
   Experiments
   Concluding Remarks
    Conditional Functional Dependency
    (CFD)
   Motivation: Data Integration / Data Cleaning
              source    age_grp   education   occupation   salary_grp
             sourceA     31-40      PhD       Professor       SG4
             sourceB     31-40      PhD       Professor       SG4
             sourceC     31-40      PhD       Professor       SG3


      Functional      Dependencies (FDs) usually do not
      hold
          e.g. {age_grp,education,occupation} → {salary_grp}
   CFD = Functional Dependency + Constraints
  CFD Example
  age_grp   education    occupation     salary_grp
   21-30     Masters    Schoolteacher      SG2
                                                       CFD holds
   21-30     Masters    Schoolteacher      SG2
                                                       For all rows matching
   31-40     Masters     Accountant        SG2
                                                       the antecedent of
   31-40     Masters     Accountant        SG1         at least one pattern
   31-40      PhD         Professor        SG4         in the tableau,
   31-40      PhD         Professor        SG4         (1) FD holds
   41-50      PhD         Professor        SG4         (2) assertions hold
CFD : FD + Pattern Tableau
FD :         antecedent            consequent
 {age_grp, education, occupation} → salary_grp
Pattern Tableau :
   age_grp education      occupation     salary_grp
                                                            Does not hold!
     __       Masters   Schoolteacher        __
     __         PhD        Professor        SG4


                 conditions               assertions
   Support of a CFD
   age_grp   education    occupation     salary_grp   CFD  = (X → Y, Tp)
    21-30     Masters    Schoolteacher      SG2
                                                      “support set”:
    21-30     Masters    Schoolteacher      SG2
    21-30     Masters    Schoolteacher      SG1
    31-40     Masters      Accountant       SG2
    31-40     Masters      Accountant       SG1       “support”:
    31-40       PhD        Professor        SG4
    31-40       PhD        Professor        SG4
    41-50       PhD        Professor        SG3

FD (X→Y):
 {age_grp, education, occupation} → salary_grp
Pattern Tableau (Tp) :
                                                            supp = 6/8 = 0.75
  age_grp education       occupation     salary_grp
     __       Masters    Schoolteacher      __
     __         PhD        Professor        SG4
  Confidence of a CFD
   age_grp   education    occupation     salary_grp          CFD  = (X → Y, Tp)
    21-30     Masters    Schoolteacher      SG2
    21-30     Masters    Schoolteacher      SG2
                                                       Violations in Support Set:
    21-30     Masters    Schoolteacher      SG1
                                                       (1) Not satisfying FD
    31-40     Masters      Accountant       SG2
                                                       (2) Not satisfying the assertions
    31-40     Masters      Accountant       SG1
    31-40       PhD        Professor        SG4
                                                      “Confidence”:
                                                      The fraction of the maximal support
    31-40       PhD        Professor        SG4
                                                      set that can be retained in order to
    41-50       PhD        Professor        SG3
                                                      satisfy the CFD
FD (X → Y):
 {age_grp, education, occupation} → salary_grp
Pattern Tableau (Tp) :
  age_grp education       occupation     salary_grp         C = (2+2)/6 = 0.67
     __       Masters    Schoolteacher      __
     __         PhD        Professor        SG4       Confidence would be lower, i.e.,
                                                      (1+2)/6 = 0.5, if we do not use the
                                                      maximal support set satisfying the
                                                      CFD
Problem Definitions
   Fixed CFD Problem
       Given a relation and a CFD, estimate its confidence
   Variable CFD Problem
       Process a relation and create a summary so that given a CFD
        we can estimate its confidence
   Exact solution: SQL query
       Takes over 24 hrs on 1 billion data
   Approximation: we focus here on Fixed CFD Problem
    with no assertions
       By extending our approximation algorithms, we can allow
        assertions and solve the variable CFD version
   Group and Keepers
age_grp   education       occupation     salary_grp         “Group” of x
  21-30    Masters       Schoolteacher      SG2
                                                                rows with the same value on the
                                                                antecedent x
  21-30    Masters       Schoolteacher      SG2       2/3
  21-30    Masters       Schoolteacher      SG1
                                                            “Keepers”
  31-40    Masters        Accountant        SG2
                                                                Within each group, rows with the
  31-40    Masters        Accountant       SG1
                                                                most common consequent
  31-40       PhD          Professor        SG4
                                                      1/2
  41-50       PhD          Professor        SG3             “Confidence” of a group x
CFD  = (X → Y, Tp)
FD (X → Y):
 {age_grp, education, occupation} → salary_grp
Pattern Tableau (Tp) :                                      Confidence of a CFD v.s.
                                                              Confidence of groups
age_grp   education       occupation     salary_grp

   __      Masters       Schoolteacher      __
   __         PhD          Professor        __


   C = (2+1) / (3+2)
      = (3/(3+2)) * (2/3) + (2/(3+2)) * (1/2) = 0.6
Why not simple uniform sampling?
                                                                                     Estimated conf  1
  FD (X→Y): {education, occupation} → salary_grp                                     for both R1 and R2,
  Pattern Tableau (Tp) :      education     occupation        salary_grp             unless the sample size
                               Masters            __                __               is (N1/2)



R1 :       true conf = 0.75                                           R2 :          true conf = 1
education     occupation   salary_grp     Grp       Grp      Grp         education    occupation    salary_grp
                                          conf      size     conf
 Masters         OC0          SG0                                         Masters        OC0          SG0
 Masters         OC0          SG0                                         Masters        OC0          SG0
                                           1        N/2       1
       …          …            …                                             …            …            …
                                               (a “large” group)
 Masters         OC0          SG0                                         Masters        OC0          SG0
 Masters         OC1          SG1                                         Masters        OC1          SG1
                                          0.5          2      1
 Masters         OC1          SG2                                         Masters        OC1          SG1
       …          …            …           (lots of “small” groups)          …            …            …
 Masters        OC(N/4)       SG1                                         Masters      OC(N/4)        SG1
                                           0.5         2      1
 Masters        OC(N/4)       SG2                                         Masters      OC(N/4)        SG1
Lower bounds on the Fixed CFD Problem

   Relative Error
     Estimating  with relative error  <1/3 in P
      passes over input of size N with at most 
      probability of error requires at least  (N/P)
      space.
   Additive Error
     Estimating with additive error at most  within
      a constant number of passes over the data
      with at most  probability of error requires at
      least (1/2) space.
Two-Pass Solution
   Pass 1: Uniformly sample             from the
    support set using reservoir sampling [Vitter85]
   Pass 2:
     Estimate the confidence of each group tagged in pass
      1 using one of following
          Sampling e.g., reservoir sampling [Vitter85] (space    )
          Heavy Hitter e.g., SpaceSaving [Metwally05] (space    )
     Report    the mean of group confidences
Two-Pass Solution                                         keepers




                true conf = (26/48)= 0.54
                                                     a uniform random sample
                                                    per group tagged in pass 1

Pass 1                                               Pass 2
                          (tagged groups)




A uniform random sample

                          (est. group confidence)   0.8 0.6 0.8 0.2 0.4 0.5 0.5

                          (est. conf = mean)                   0.53
Idealized One-Pass Solution
 One-pass algorithm based on an
  “idealized” quantity
 An unbiased estimator
 Works well in practice, though no worst-
  case guarantees
  Idealized One-Pass Solution



                                  {r1, r2, …, rN}


                                                          “suffix        No “suffix
                                                          keeper”         keeper”
“Suffix Stream” Ri               {ri, ri+1, …, rN}

ri is a “keeper of its suffix stream” Ri
                     ri = (xi, yi) and |(xi, yi)  Ri|>|(xi, y’)  Ri|

“Idealized” quantity Xi
                           1 iff ri is a “keeper” of its suffix stream Ri
                  Xi =
                           0 otherwise

              Xi is an unbiased estimator of confidence
Idealized One-Pass Solution                      true conf = (26/48)= 0.54




                                         (1) Create a uniform random sample
                                             of suffix streams
                                         (2) For each sampled suffix stream,
                                             create a uniform random sample

(suffix stream sample)


(a random sample per
     suffix stream )



(est. quantity Xi)       1   0   1   0      1    0   1    0

(est. conf = E(Xi))                      0.50
    One-Pass Solution
   Based on multi-level sampling
   Different levels target groups of different sizes
   Use Count-Min Sketch [Cormode04] to keep
    counts of frequent items
   A more involved analysis shows that using
                  buckets in CM sketches gives the
    desired guarantee on the absolute error
    One-Pass Multi-level Sampling Solution




(Level)                                    L0 …                  Lk                …       Llog(N)

(Sampling rate)

(CM Sketches)                                     (for groups)        (for rows)

                                           … …                                     …            …


(heavy hitters)
(more than    fraction of the freq.)                                               est.
                                                                                   keepers at
                                                                                   level Lk
(group size filter                     )


(est. conf)
Experiments
   Data
     Synthetic      Datasets
          Zipf distribution on group size
     Real   Datasets
          Online Retail Sales Record
                Retailer: 300,000 records
          World Cup ’98 website access logs
                WorldCup1Day: 7 million records
                WorldCup1Month: 1 billion records

   Baseline Algorithms: UniformRow, UniformGroup
                 Fixed CFD Estimation - Accuracy
Synthetic Datasets




Retailer
           Two-Pass v.s. Idealized One-Pass
WorldCup1Month (1billion)
Variable CFD Estimation
Experiment Highlights
   TWO-PASS and IDEALIZED provide estimates of a very
    small error given very limited space, regardless of data
    skew
   MULTILEVEL, while having strong analytical guarantees,
    requires a large amount of space in order to produce
    good estimates in practice.
   Increasing the space tends to improve the accuracy of all
    three algorithms
   TWO-PASS is slightly more accurate than IDEALIZED,
    though at the price of a second pass
   TWO-PASS and IDEALIZED do a good job solving the
    variable CFD estimation problem
       able to process data at a rate of over 100,000/sec
Concluding Remarks
   Design three algorithms to estimate the CFD
    confidence in a small number of passes when
    the tableau is given upfront (Fixed CFD), or
    afterwards (Variable CFD)
   Also applicable to variants of CFDs
     fail
         tableau [Golab08]
     Tableaux with negation and disjunctions [Bravo08]

   Future Work
     Use    our summary in the tableau discovery problem
Thank you!

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:16
posted:8/22/2011
language:English
pages:24