Mary by panaapan

VIEWS: 0 PAGES: 36

• pg 1
```									Agrégation de
documents XML
probabilistes

Serge Abiteboul 1, T.-H. Hubert Chan 2, Evgeny Kharlamov 1,3
Werner Nutt 3, Pierre Senellart 4
1 INRIA Saclay – Île-de-France
2 The University of Hong-Kong

3 Free University of Bozen-Bolzano

4 Télécom ParisTech
Incomplete Databases

An incomplete database D contains many instances
D = { d1,..., dn,...}

Query q(x), constant c

• c is a certain answer for q if c  q(di) for all di  D

• c is a possible answer for q if c  q(di) for some di  D

Many ways to represent incomplete databases

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   2
Probabilistic Databases
Incomplete database D = { d1,..., dn }
• with probabilities Pr(di) > 0 for each instance
• such that Pr(d1) + ... + Pr(dn) = 1

Query q returns constant c with probability p if
p =              Pr(di)
cq(di)

• Mainly studied in the relational setting
• Imprecise data on the Web  Probabilistic XML

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   3
Personnel Data, Instance 1

IT-personnel

person                                      person

name                    bonus                  name            bonus

John                         Laptop                          Mary            PDA

37       50                                   30     44

Agrégation de documents XML probabilistes                BDA Namur - 21/10/2009      4
Personnel Data, Instance 2

IT-personnel

person                                     person

name                 bonus                    name            bonus

Rick                              PDA           Mary            PDA

25        50                        44

Agrégation de documents XML probabilistes               BDA Namur - 21/10/2009      5
Example: Personnel Queries

“What are the names of the IT personnel?”

“What bonuses were paid for the PDA project?”

“What is the sum of bonuses paid to all employees?"

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   6
“What are the names of the IT personnel?”
Mary: certain           Rick: possible

“What bonuses were paid for the PDA project?”
44: certain     15: possible

“What is the sum of bonuses paid to all employees?“
no certain answer       161, 119: possible

 Aggregate queries depend
on the presence of many data

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   7
Distribution of
sums of bonuses

• What is the probability that the sum of bonuses = 161?
• What are all possible sums of bonuses?
And what is each one’s probability?
• What is the expected value of the sum of bonuses?
And what the variance?

Moments

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009       8
The Problem Space

Probabilistic XML
Events
Document Models
Our focus

Distributional
Nodes
COUNT     SUM      MIN   COUNTD    AVG
(MUX-DET)

Aggregate Function
Single Path Queries

Tree Pattern Queries
Query
Language
Tree Pattern Queries
with Joins

Agrégation de documents XML probabilistes                  BDA Namur - 21/10/2009               9
Probabilistic XML: Events                                                    [Abiteboul/Senellart]

IT-personnel

person                                           person

name                      bonus                   name              bonus

J     J                J          J
John         Rick             Laptop               PDA           Mary              PDA

J, M     M           M
37         50       25        50                   30     44           15

J: John hired for                     Pr(J) = 0.3                  Probabilities of
Independent
Laptop project                                                  Events
Events
M: Mary worked                        Pr(M) = 0.6
overtime

Agrégation de documents XML probabilistes                    BDA Namur - 21/10/2009                  10
“John was hired, Mary worked overtime”
IT-personnel

person                                           person

name                      bonus                   name              bonus

J   J                 J          J
John        Rick             Laptop               PDA           Mary              PDA

J, M     M       M
37         50       25        50                   30     44      15

J: John hired for                     Pr(J) = 0.3
Laptop project

M: Mary worked                        Pr(M) = 0.6
overtime

Agrégation de documents XML probabilistes                    BDA Namur - 21/10/2009            11
“John was hired, Mary worked overtime”
IT-personnel

person                                      person

name                    bonus                  name              bonus

John                         Laptop                          Mary               PDA

37       50                                   30        44

J: John hired for                Pr(J) = 0.3
Laptop project
Pr(d1) = 0.3 x 0.6
M: Mary worked                   Pr(M) = 0.6
overtime

Agrégation de documents XML probabilistes                BDA Namur - 21/10/2009               12
“John wasn’t hired, Mary worked overtime”
IT-personnel

person                                           person

name                      bonus                   name                bonus

J   J                 J          J
John        Rick             Laptop               PDA           Mary                 PDA

J, M       M        M
37         50       25        50                   30        44       15

J: John hired for                     Pr(J) = 0.3
Laptop project
Pr(d2) = 0.7 x 0.6
M: Mary worked                        Pr(M) = 0.6
overtime

Agrégation de documents XML probabilistes                    BDA Namur - 21/10/2009                13
“John wasn’t hired, Mary worked overtime”
IT-personnel

person                                        person

name                 bonus                     name                bonus

Rick                               PDA           Mary                 PDA

25        50                             44

J: John hired for                  Pr(J) = 0.3
Laptop project
Pr(d2) = 0.7 x 0.6
M: Mary worked                     Pr(M) = 0.6
overtime

Agrégation de documents XML probabilistes                 BDA Namur - 21/10/2009                14
Probabilistic XML: MUX and DET Nodes
IT-personnel                                [Nierman/
person                                      person                 Kimelfeld/
Sagiv]
name                  bonus                   name             bonus

MUX                           MUX                                         PDA
0.3            0.7            Mary
0.3          0.7
Laptop          PDA                                  MUX
John           Rick
0.6            0.4

37      50      25         50                       DET              15

MUX
Children represent mutually exclusive choices,                  30           44
choices for different mux-nodes are independent

DET     Deterministic nodes, children are combined

Agrégation de documents XML probabilistes                BDA Namur - 21/10/2009                    15
Probabilistic XML: MUX and DET Nodes
IT-personnel

person                                      person

name                   bonus                  name             bonus

MUX                           MUX                                      PDA
John            0.3     PDA    0.7            Mary
0.3          0.7
Laptop          PDA                                MUX
John           Rick
30
0.6    44   0.4
25      50
37      50      25         50                       DET           15

30         44

Pr = 0.3 x 0.7 x 0.6

Agrégation de documents XML probabilistes                BDA Namur - 21/10/2009                 16
Probabilistic XML (PXML)
• A PXML document D
– represents (exponentially) many document instances d
– each with a probability Pr(d)

• PXML document models
– CIE: long-distance dependencies
– MUX-DET: only hierarchical dependencies
– MUX-DET can be expressed by CIE,
but not (concisely) the other way round

Other models can be reduced to the ones above,
or behave similarly

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   17
Aggregate Functions

: finite bags of values  domain

Examples:
• count, countd: finite bags of anything N
• sum, avg: finite bags of rational numbers  Q

Similarly: min, max, parity, top K, ...

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   18
Aggregate Queries

Q = (q)

Two Layers
• nonaggregate query q(x)
– returns set of nodes q(d) over instance d

• aggregate function 
– applied to the labels of nodes in q(d)
– returns single value
(q(d))

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   19
Single Path Queries                                 Which bonuses
have been paid?

Simple form of tree pattern queries                                  qbonus

Paths of node labels or *,                                     IT-personnel
connected by “child” and
“descendant” edges
bonus
Return the set of leaf nodes
reachable from the root
along such a path
*

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009            20
Single Path Aggregate Queries: Examples

• SUM(qbonus)

“What is the sum of all bonuses?”

• MAX(qbonus)

“What is maximal bonus that was paid?”

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   21
PXML document D, instances {d1,…, dn}

SUM(qbonus) returns exactly one number for every di
 SUM(qbonus) is a random variable
 SUM(qbonus) induces a probability distribution over D

f(s) =                       Pr(di),
SUM(qbonus)(di) = s

Notation:           SUM(qbonus)(D)                   or (q)(D) abstractly
Agrégation de documents XML probabilistes          BDA Namur - 21/10/2009    22
Special Case: Document Aggregation
D with instances {d1,…, dn},

• Applying  to a regular document di :

(di) := ({| c | c is a value on a leaf of di |})

• Applying  to the probabilistic document D:

(D)(c)  =                   Pr(di)
(di) = c

yields again a distribution (D)
Agrégation de documents XML probabilistes      BDA Namur - 21/10/2009   23
Reduction to Document Aggregation
(q)(D) = ?
Step 1:             Compute a smaller PXML document

D' = q(D)

containing only matching paths
Depends on
Step 2:             Apply  to D'                     document models and
simple path queries

Theorem:
(q)(D) = (D')

Agrégation de documents XML probabilistes    BDA Namur - 21/10/2009     24
Applying qbonus
IT-personnel

qbonus(                 person                                         person
)
name                      bonus                   name            bonus

J   J                 J          J
John        Rick             Laptop               PDA           Mary            PDA

J, M     M       M
37         50       25        50                 30     44      15

= keep only the paths that match

… analogous for MUX-DET

Agrégation de documents XML probabilistes                   BDA Namur - 21/10/2009           25
Evaluating Single Path Queries/2
IT-personnel
qbonus(                                                                                                  )
person                                       person

name                  bonus                    name              bonus

MUX                           MUX                                           PDA
0.3             0.7            Mary
0.3           0.7
Laptop           PDA                                   MUX
John           Rick
0.6          0.4

37      50       25         50                        DET            15

30           44

Agrégation de documents XML probabilistes                   BDA Namur - 21/10/2009                   26
Problems Investigated

PXML document D, constant c

• Possible Value: Decide Pr((D) = c) > 0

• Probability Computation: Compute Pr((D) = c)

• Moment Computation: Compute E((D)k)

E is “expected value”

Agrégation de documents XML probabilistes          BDA Namur - 21/10/2009   27
Aggregation over CIE

COUNT            SUM       MIN       COUNTD         AVG

Possible
NP-c           NP-c     NP-c          NP-c        NP-c
Value

Probability
in FP#P         in FP#P   FP#P-c       FP#P-c       FP#P-c
Computation

Moment
P               P     FP#P-c       FP#P-c       FP#P-c
Computation

Agrégation de documents XML probabilistes     BDA Namur - 21/10/2009            28
Aggregation over CIE/2
• Possible Value: “Too much propositional logic present”

• Probability Computation: cannot be easier …

• Moment Computation:
– Difficult for MIN, COUNTD, AVG
– Easy for COUNT and SUM:

“Moments are sums,
moments of COUNT and SUM are sums of sums,
which can be rearranged …”

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   29
Aggregation over MUX-DET

COUNT            SUM            MIN       COUNTD         AVG

Possible
P            NP-c            P           NP-c        In NP
Value

Probability                             P
P           in |input| +     P          FP#P-c       FP#P-c
Computation                        |distribution|

Moment
P               P            P             P           P
Computation

Agrégation de documents XML probabilistes          BDA Namur - 21/10/2009            30
COUNT, SUM, MIN are Easy ...
… because they allow for divide and conquer evaluation:
SUM {| a,b,c,d |} = SUM {| a,b |} + SUM {| c,d |}

 is a monoid aggregate function if
({| a1,..., an |} = ({| a1 |}) ...  ({| an |})
for some commutative monoid (M, ) and all a1,..., an in M

Examples:
• count, sum, min, parity, top K
• countd, avg
Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   31
Convex Sums and Convolutions

If  is a monoid function, answer distributions
can be computed bottom up, using two operations:

MUX
p           q
Convex Sum:                (                      ) = p (           ) + q (         )

depends on
Other
node
the monoid

Convolution:               (                      ) = (         )  (        )

Agrégation de documents XML probabilistes            BDA Namur - 21/10/2009            32
Convolution of Distributions

(M, ) monoid

(D1), (D2) distributions of subdocuments

(((D1)  (D2)) (c) =                        (D1)(c1) (D2)(c2)
c1  c2 = c

Agrégation de documents XML probabilistes          BDA Namur - 21/10/2009   33
Over CIE, probability and moment computation can be hard
How good are Monte-Carlo methods?

Classical results (Hoeffding) imply: To achieve
| E((D)k) – Estimate | <  with probability 1– 
at most O(R2k -2 log 1/) samples are needed,
where R = max |(d)|.

Consequence: Given and , at most quadratically many
samples are needed for E(COUNTD(D)).

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   34
Probabilistic Aggregation: Related Work

• Tree pattern queries over MUX-DET
with HAVING constraints [Cohen/Kimelfeld/Sagiv]

• Conjunctive queries with HAVING constraints over
relational probabilistic databases [Re/Suciu]

• Work on various special topics in the relational setting
– probabilistic data streams
– uncertain schema mappings

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   35
Aggregates over PXML: Conclusion
First results of an ongoing project
• Map of the problem space
• Largely complete investigation for single path queries:
– Intractability for CIE
– Hierarchical dependencies in MUX-DET can be
exploited for monoid aggregation functions
Some results carry over to other models, e.g.,
• Uncertain schema mappings (Dong/Halevy/Yu)

Current work:
• richer query languages, continuous distributions on leaves

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   36

```
To top