# cfd

Document Sample

```					      Estimating the Confidence
of Conditional Functional
Dependencies
Xi Zhang, University at Buffalo, SUNY
Graham Cormode, AT&T Labs-Research
Lukasz Golab, AT&T Labs-Research
Flip Korn, AT&T Labs-Research
Andrew McGregor, University of Massachusetts Amherst
Divesh Srivastava, AT&T Labs-Research
Outline
   Background
   Conditional Functional Dependency (CFD)
   Problem Definitions
   Fixed CFD Problem
   Variable CFD Problem
   Solutions
   A two-pass algorithm
   An “Idealized” one-pass algorithm
   A one-pass algorithm based on multi-level sampling
   Experiments
   Concluding Remarks
Conditional Functional Dependency
(CFD)
   Motivation: Data Integration / Data Cleaning
source    age_grp   education   occupation   salary_grp
sourceA     31-40      PhD       Professor       SG4
sourceB     31-40      PhD       Professor       SG4
sourceC     31-40      PhD       Professor       SG3

 Functional      Dependencies (FDs) usually do not
hold
   e.g. {age_grp,education,occupation} → {salary_grp}
   CFD = Functional Dependency + Constraints
CFD Example
age_grp   education    occupation     salary_grp
21-30     Masters    Schoolteacher      SG2
CFD holds
21-30     Masters    Schoolteacher      SG2
For all rows matching
31-40     Masters     Accountant        SG2
the antecedent of
31-40     Masters     Accountant        SG1         at least one pattern
31-40      PhD         Professor        SG4         in the tableau,
31-40      PhD         Professor        SG4         (1) FD holds
41-50      PhD         Professor        SG4         (2) assertions hold
CFD : FD + Pattern Tableau
FD :         antecedent            consequent
{age_grp, education, occupation} → salary_grp
Pattern Tableau :
age_grp education      occupation     salary_grp
Does not hold!
__       Masters   Schoolteacher        __
__         PhD        Professor        SG4

conditions               assertions
Support of a CFD
age_grp   education    occupation     salary_grp   CFD  = (X → Y, Tp)
21-30     Masters    Schoolteacher      SG2
“support set”:
21-30     Masters    Schoolteacher      SG2
21-30     Masters    Schoolteacher      SG1
31-40     Masters      Accountant       SG2
31-40     Masters      Accountant       SG1       “support”:
31-40       PhD        Professor        SG4
31-40       PhD        Professor        SG4
41-50       PhD        Professor        SG3

FD (X→Y):
{age_grp, education, occupation} → salary_grp
Pattern Tableau (Tp) :
supp = 6/8 = 0.75
age_grp education       occupation     salary_grp
__       Masters    Schoolteacher      __
__         PhD        Professor        SG4
Confidence of a CFD
age_grp   education    occupation     salary_grp          CFD  = (X → Y, Tp)
21-30     Masters    Schoolteacher      SG2
21-30     Masters    Schoolteacher      SG2
Violations in Support Set:
21-30     Masters    Schoolteacher      SG1
(1) Not satisfying FD
31-40     Masters      Accountant       SG2
(2) Not satisfying the assertions
31-40     Masters      Accountant       SG1
31-40       PhD        Professor        SG4
“Confidence”:
The fraction of the maximal support
31-40       PhD        Professor        SG4
set that can be retained in order to
41-50       PhD        Professor        SG3
satisfy the CFD
FD (X → Y):
{age_grp, education, occupation} → salary_grp
Pattern Tableau (Tp) :
age_grp education       occupation     salary_grp         C = (2+2)/6 = 0.67
__       Masters    Schoolteacher      __
__         PhD        Professor        SG4       Confidence would be lower, i.e.,
(1+2)/6 = 0.5, if we do not use the
maximal support set satisfying the
CFD
Problem Definitions
   Fixed CFD Problem
   Given a relation and a CFD, estimate its confidence
   Variable CFD Problem
   Process a relation and create a summary so that given a CFD
we can estimate its confidence
   Exact solution: SQL query
   Takes over 24 hrs on 1 billion data
   Approximation: we focus here on Fixed CFD Problem
with no assertions
   By extending our approximation algorithms, we can allow
assertions and solve the variable CFD version
Group and Keepers
age_grp   education       occupation     salary_grp         “Group” of x
21-30    Masters       Schoolteacher      SG2
rows with the same value on the
antecedent x
21-30    Masters       Schoolteacher      SG2       2/3
21-30    Masters       Schoolteacher      SG1
“Keepers”
31-40    Masters        Accountant        SG2
Within each group, rows with the
31-40    Masters        Accountant       SG1
most common consequent
31-40       PhD          Professor        SG4
1/2
41-50       PhD          Professor        SG3             “Confidence” of a group x
CFD  = (X → Y, Tp)
FD (X → Y):
{age_grp, education, occupation} → salary_grp
Pattern Tableau (Tp) :                                      Confidence of a CFD v.s.
Confidence of groups
age_grp   education       occupation     salary_grp

__      Masters       Schoolteacher      __
__         PhD          Professor        __

C = (2+1) / (3+2)
= (3/(3+2)) * (2/3) + (2/(3+2)) * (1/2) = 0.6
Why not simple uniform sampling?
Estimated conf  1
FD (X→Y): {education, occupation} → salary_grp                                     for both R1 and R2,
Pattern Tableau (Tp) :      education     occupation        salary_grp             unless the sample size
Masters            __                __               is (N1/2)

R1 :       true conf = 0.75                                           R2 :          true conf = 1
education     occupation   salary_grp     Grp       Grp      Grp         education    occupation    salary_grp
conf      size     conf
Masters         OC0          SG0                                         Masters        OC0          SG0
Masters         OC0          SG0                                         Masters        OC0          SG0
1        N/2       1
…          …            …                                             …            …            …
(a “large” group)
Masters         OC0          SG0                                         Masters        OC0          SG0
Masters         OC1          SG1                                         Masters        OC1          SG1
0.5          2      1
Masters         OC1          SG2                                         Masters        OC1          SG1
…          …            …           (lots of “small” groups)          …            …            …
Masters        OC(N/4)       SG1                                         Masters      OC(N/4)        SG1
0.5         2      1
Masters        OC(N/4)       SG2                                         Masters      OC(N/4)        SG1
Lower bounds on the Fixed CFD Problem

   Relative Error
 Estimating  with relative error  <1/3 in P
passes over input of size N with at most 
probability of error requires at least  (N/P)
space.
 Estimating with additive error at most  within
a constant number of passes over the data
with at most  probability of error requires at
least (1/2) space.
Two-Pass Solution
   Pass 1: Uniformly sample             from the
support set using reservoir sampling [Vitter85]
   Pass 2:
 Estimate the confidence of each group tagged in pass
1 using one of following
   Sampling e.g., reservoir sampling [Vitter85] (space    )
   Heavy Hitter e.g., SpaceSaving [Metwally05] (space    )
 Report    the mean of group confidences
Two-Pass Solution                                         keepers

true conf = (26/48)= 0.54
a uniform random sample
per group tagged in pass 1

Pass 1                                               Pass 2
(tagged groups)

A uniform random sample

(est. group confidence)   0.8 0.6 0.8 0.2 0.4 0.5 0.5

(est. conf = mean)                   0.53
Idealized One-Pass Solution
 One-pass algorithm based on an
“idealized” quantity
 An unbiased estimator
 Works well in practice, though no worst-
case guarantees
Idealized One-Pass Solution

{r1, r2, …, rN}

“suffix        No “suffix
keeper”         keeper”
“Suffix Stream” Ri               {ri, ri+1, …, rN}

ri is a “keeper of its suffix stream” Ri
ri = (xi, yi) and |(xi, yi)  Ri|>|(xi, y’)  Ri|

“Idealized” quantity Xi
1 iff ri is a “keeper” of its suffix stream Ri
Xi =
0 otherwise

Xi is an unbiased estimator of confidence
Idealized One-Pass Solution                      true conf = (26/48)= 0.54

(1) Create a uniform random sample
of suffix streams
(2) For each sampled suffix stream,
create a uniform random sample

(suffix stream sample)

(a random sample per
suffix stream )

(est. quantity Xi)       1   0   1   0      1    0   1    0

(est. conf = E(Xi))                      0.50
One-Pass Solution
   Based on multi-level sampling
   Different levels target groups of different sizes
   Use Count-Min Sketch [Cormode04] to keep
counts of frequent items
   A more involved analysis shows that using
buckets in CM sketches gives the
desired guarantee on the absolute error
One-Pass Multi-level Sampling Solution

(Level)                                    L0 …                  Lk                …       Llog(N)

(Sampling rate)

(CM Sketches)                                     (for groups)        (for rows)

… …                                     …            …

(heavy hitters)
(more than    fraction of the freq.)                                               est.
keepers at
level Lk
(group size filter                     )

(est. conf)
Experiments
   Data
 Synthetic      Datasets
   Zipf distribution on group size
 Real   Datasets
   Online Retail Sales Record
   Retailer: 300,000 records
   World Cup ’98 website access logs
   WorldCup1Day: 7 million records
   WorldCup1Month: 1 billion records

   Baseline Algorithms: UniformRow, UniformGroup
Fixed CFD Estimation - Accuracy
Synthetic Datasets

Retailer
Two-Pass v.s. Idealized One-Pass
WorldCup1Month (1billion)
Variable CFD Estimation
Experiment Highlights
   TWO-PASS and IDEALIZED provide estimates of a very
small error given very limited space, regardless of data
skew
   MULTILEVEL, while having strong analytical guarantees,
requires a large amount of space in order to produce
good estimates in practice.
   Increasing the space tends to improve the accuracy of all
three algorithms
   TWO-PASS is slightly more accurate than IDEALIZED,
though at the price of a second pass
   TWO-PASS and IDEALIZED do a good job solving the
variable CFD estimation problem
   able to process data at a rate of over 100,000/sec
Concluding Remarks
   Design three algorithms to estimate the CFD
confidence in a small number of passes when
the tableau is given upfront (Fixed CFD), or
afterwards (Variable CFD)
   Also applicable to variants of CFDs
 fail
tableau [Golab08]
 Tableaux with negation and disjunctions [Bravo08]

   Future Work
 Use    our summary in the tableau discovery problem
Thank you!

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 16 posted: 8/22/2011 language: English pages: 24