VIEWS: 0 PAGES: 36 POSTED ON: 3/21/2013 Public Domain
Agrégation de documents XML probabilistes Serge Abiteboul 1, T.-H. Hubert Chan 2, Evgeny Kharlamov 1,3 Werner Nutt 3, Pierre Senellart 4 1 INRIA Saclay – Île-de-France 2 The University of Hong-Kong 3 Free University of Bozen-Bolzano 4 Télécom ParisTech Incomplete Databases An incomplete database D contains many instances D = { d1,..., dn,...} Query q(x), constant c • c is a certain answer for q if c q(di) for all di D • c is a possible answer for q if c q(di) for some di D Many ways to represent incomplete databases Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 2 Probabilistic Databases Incomplete database D = { d1,..., dn } • with probabilities Pr(di) > 0 for each instance • such that Pr(d1) + ... + Pr(dn) = 1 Query q returns constant c with probability p if p = Pr(di) cq(di) • Mainly studied in the relational setting • Imprecise data on the Web Probabilistic XML Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 3 Personnel Data, Instance 1 IT-personnel person person name bonus name bonus John Laptop Mary PDA 37 50 30 44 Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 4 Personnel Data, Instance 2 IT-personnel person person name bonus name bonus Rick PDA Mary PDA 25 50 44 Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 5 Example: Personnel Queries “What are the names of the IT personnel?” “What bonuses were paid for the PDA project?” “What is the sum of bonuses paid to all employees?" Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 6 Personnel DB: Certain/Possible Answers “What are the names of the IT personnel?” Mary: certain Rick: possible “What bonuses were paid for the PDA project?” 44: certain 15: possible “What is the sum of bonuses paid to all employees?“ no certain answer 161, 119: possible Aggregate queries depend on the presence of many data Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 7 If We Had Probabilities … Distribution of sums of bonuses … we could ask • What is the probability that the sum of bonuses = 161? • What are all possible sums of bonuses? And what is each one’s probability? • What is the expected value of the sum of bonuses? And what the variance? Moments Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 8 The Problem Space Probabilistic XML Events Document Models Our focus Distributional Nodes COUNT SUM MIN COUNTD AVG (MUX-DET) Aggregate Function Single Path Queries Tree Pattern Queries Query Language Tree Pattern Queries with Joins Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 9 Probabilistic XML: Events [Abiteboul/Senellart] IT-personnel person person name bonus name bonus J J J J John Rick Laptop PDA Mary PDA J, M M M 37 50 25 50 30 44 15 J: John hired for Pr(J) = 0.3 Probabilities of Independent Laptop project Events Events M: Mary worked Pr(M) = 0.6 overtime Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 10 “John was hired, Mary worked overtime” IT-personnel person person name bonus name bonus J J J J John Rick Laptop PDA Mary PDA J, M M M 37 50 25 50 30 44 15 J: John hired for Pr(J) = 0.3 Laptop project M: Mary worked Pr(M) = 0.6 overtime Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 11 “John was hired, Mary worked overtime” IT-personnel person person name bonus name bonus John Laptop Mary PDA 37 50 30 44 J: John hired for Pr(J) = 0.3 Laptop project Pr(d1) = 0.3 x 0.6 M: Mary worked Pr(M) = 0.6 overtime Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 12 “John wasn’t hired, Mary worked overtime” IT-personnel person person name bonus name bonus J J J J John Rick Laptop PDA Mary PDA J, M M M 37 50 25 50 30 44 15 J: John hired for Pr(J) = 0.3 Laptop project Pr(d2) = 0.7 x 0.6 M: Mary worked Pr(M) = 0.6 overtime Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 13 “John wasn’t hired, Mary worked overtime” IT-personnel person person name bonus name bonus Rick PDA Mary PDA 25 50 44 J: John hired for Pr(J) = 0.3 Laptop project Pr(d2) = 0.7 x 0.6 M: Mary worked Pr(M) = 0.6 overtime Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 14 Probabilistic XML: MUX and DET Nodes IT-personnel [Nierman/ Jagadish, person person Kimelfeld/ Sagiv] name bonus name bonus MUX MUX PDA 0.3 0.7 Mary 0.3 0.7 Laptop PDA MUX John Rick 0.6 0.4 37 50 25 50 DET 15 MUX Children represent mutually exclusive choices, 30 44 choices for different mux-nodes are independent DET Deterministic nodes, children are combined Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 15 Probabilistic XML: MUX and DET Nodes IT-personnel person person name bonus name bonus MUX MUX PDA John 0.3 PDA 0.7 Mary 0.3 0.7 Laptop PDA MUX John Rick 30 0.6 44 0.4 25 50 37 50 25 50 DET 15 30 44 Pr = 0.3 x 0.7 x 0.6 Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 16 Probabilistic XML (PXML) • A PXML document D – represents (exponentially) many document instances d – each with a probability Pr(d) • PXML document models – CIE: long-distance dependencies – MUX-DET: only hierarchical dependencies – MUX-DET can be expressed by CIE, but not (concisely) the other way round Other models can be reduced to the ones above, or behave similarly Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 17 Aggregate Functions : finite bags of values domain Examples: • count, countd: finite bags of anything N • sum, avg: finite bags of rational numbers Q Similarly: min, max, parity, top K, ... Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 18 Aggregate Queries Q = (q) Two Layers • nonaggregate query q(x) – returns set of nodes q(d) over instance d • aggregate function – applied to the labels of nodes in q(d) – returns single value (q(d)) Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 19 Single Path Queries Which bonuses have been paid? Simple form of tree pattern queries qbonus Paths of node labels or *, IT-personnel connected by “child” and “descendant” edges bonus Return the set of leaf nodes reachable from the root along such a path * Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 20 Single Path Aggregate Queries: Examples • SUM(qbonus) “What is the sum of all bonuses?” • MAX(qbonus) “What is maximal bonus that was paid?” Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 21 Answer Distributions PXML document D, instances {d1,…, dn} SUM(qbonus) returns exactly one number for every di SUM(qbonus) is a random variable SUM(qbonus) induces a probability distribution over D f(s) = Pr(di), SUM(qbonus)(di) = s the answer distribution Notation: SUM(qbonus)(D) or (q)(D) abstractly Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 22 Special Case: Document Aggregation D with instances {d1,…, dn}, • Applying to a regular document di : (di) := ({| c | c is a value on a leaf of di |}) • Applying to the probabilistic document D: (D)(c) = Pr(di) (di) = c yields again a distribution (D) Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 23 Reduction to Document Aggregation (q)(D) = ? Step 1: Compute a smaller PXML document D' = q(D) containing only matching paths Depends on Step 2: Apply to D' document models and simple path queries Theorem: (q)(D) = (D') Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 24 Applying qbonus IT-personnel qbonus( person person ) name bonus name bonus J J J J John Rick Laptop PDA Mary PDA J, M M M 37 50 25 50 30 44 15 = keep only the paths that match … analogous for MUX-DET Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 25 Evaluating Single Path Queries/2 IT-personnel qbonus( ) person person name bonus name bonus MUX MUX PDA 0.3 0.7 Mary 0.3 0.7 Laptop PDA MUX John Rick 0.6 0.4 37 50 25 50 DET 15 30 44 Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 26 Problems Investigated PXML document D, constant c • Possible Value: Decide Pr((D) = c) > 0 • Probability Computation: Compute Pr((D) = c) • Moment Computation: Compute E((D)k) E is “expected value” Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 27 Aggregation over CIE COUNT SUM MIN COUNTD AVG Possible NP-c NP-c NP-c NP-c NP-c Value Probability in FP#P in FP#P FP#P-c FP#P-c FP#P-c Computation Moment P P FP#P-c FP#P-c FP#P-c Computation Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 28 Aggregation over CIE/2 • Possible Value: “Too much propositional logic present” • Probability Computation: cannot be easier … • Moment Computation: – Difficult for MIN, COUNTD, AVG – Easy for COUNT and SUM: “Moments are sums, moments of COUNT and SUM are sums of sums, which can be rearranged …” Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 29 Aggregation over MUX-DET COUNT SUM MIN COUNTD AVG Possible P NP-c P NP-c In NP Value Probability P P in |input| + P FP#P-c FP#P-c Computation |distribution| Moment P P P P P Computation Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 30 COUNT, SUM, MIN are Easy ... … because they allow for divide and conquer evaluation: SUM {| a,b,c,d |} = SUM {| a,b |} + SUM {| c,d |} is a monoid aggregate function if ({| a1,..., an |} = ({| a1 |}) ... ({| an |}) for some commutative monoid (M, ) and all a1,..., an in M Examples: • count, sum, min, parity, top K • countd, avg Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 31 Convex Sums and Convolutions If is a monoid function, answer distributions can be computed bottom up, using two operations: MUX p q Convex Sum: ( ) = p ( ) + q ( ) depends on Other node the monoid Convolution: ( ) = ( ) ( ) Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 32 Convolution of Distributions (M, ) monoid (D1), (D2) distributions of subdocuments (((D1) (D2)) (c) = (D1)(c1) (D2)(c2) c1 c2 = c Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 33 Approximating Query Answers Over CIE, probability and moment computation can be hard How good are Monte-Carlo methods? Classical results (Hoeffding) imply: To achieve | E((D)k) – Estimate | < with probability 1– at most O(R2k -2 log 1/) samples are needed, where R = max |(d)|. Consequence: Given and , at most quadratically many samples are needed for E(COUNTD(D)). Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 34 Probabilistic Aggregation: Related Work • Tree pattern queries over MUX-DET with HAVING constraints [Cohen/Kimelfeld/Sagiv] • Conjunctive queries with HAVING constraints over relational probabilistic databases [Re/Suciu] • Work on various special topics in the relational setting – probabilistic data streams – uncertain schema mappings Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 35 Aggregates over PXML: Conclusion First results of an ongoing project • Map of the problem space • Largely complete investigation for single path queries: – Intractability for CIE – Hierarchical dependencies in MUX-DET can be exploited for monoid aggregation functions Some results carry over to other models, e.g., • Uncertain schema mappings (Dong/Halevy/Yu) Current work: • richer query languages, continuous distributions on leaves Agrégation de documents XML probabilistes BDA Namur - 21/10/2009 36