Docstoc

Mary

Document Sample
Mary Powered By Docstoc
					Agrégation de
documents XML
probabilistes

Serge Abiteboul 1, T.-H. Hubert Chan 2, Evgeny Kharlamov 1,3
Werner Nutt 3, Pierre Senellart 4
       1 INRIA Saclay – Île-de-France
       2 The University of Hong-Kong

       3 Free University of Bozen-Bolzano

       4 Télécom ParisTech
Incomplete Databases

An incomplete database D contains many instances
  D = { d1,..., dn,...}


Query q(x), constant c

• c is a certain answer for q if c  q(di) for all di  D

• c is a possible answer for q if c  q(di) for some di  D


Many ways to represent incomplete databases


Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   2
Probabilistic Databases
Incomplete database D = { d1,..., dn }
• with probabilities Pr(di) > 0 for each instance
• such that Pr(d1) + ... + Pr(dn) = 1

Query q returns constant c with probability p if
         p =              Pr(di)
                 cq(di)


• Mainly studied in the relational setting
• Imprecise data on the Web  Probabilistic XML


Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   3
Personnel Data, Instance 1

                                              IT-personnel


                        person                                      person


              name                    bonus                  name            bonus



John                         Laptop                          Mary            PDA




                        37       50                                   30     44




Agrégation de documents XML probabilistes                BDA Namur - 21/10/2009      4
Personnel Data, Instance 2

                                             IT-personnel


                        person                                     person


              name                 bonus                    name            bonus



            Rick                              PDA           Mary            PDA




                                        25        50                        44




Agrégation de documents XML probabilistes               BDA Namur - 21/10/2009      5
Example: Personnel Queries

“What are the names of the IT personnel?”

“What bonuses were paid for the PDA project?”

“What is the sum of bonuses paid to all employees?"




Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   6
Personnel DB: Certain/Possible Answers
“What are the names of the IT personnel?”
  Mary: certain           Rick: possible

“What bonuses were paid for the PDA project?”
  44: certain     15: possible

“What is the sum of bonuses paid to all employees?“
  no certain answer       161, 119: possible

 Aggregate queries depend
     on the presence of many data

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   7
If We Had Probabilities …
                                                       Distribution of
                                                      sums of bonuses
… we could ask

• What is the probability that the sum of bonuses = 161?
• What are all possible sums of bonuses?
     And what is each one’s probability?
• What is the expected value of the sum of bonuses?
     And what the variance?

                                                            Moments


Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009       8
The Problem Space

                                         Probabilistic XML
                          Events
                                         Document Models
                                                                     Our focus



                    Distributional
                           Nodes
                                        COUNT     SUM      MIN   COUNTD    AVG
                     (MUX-DET)

                                                                           Aggregate Function
                                     Single Path Queries

                            Tree Pattern Queries
     Query
     Language
                     Tree Pattern Queries
                     with Joins




Agrégation de documents XML probabilistes                  BDA Namur - 21/10/2009               9
 Probabilistic XML: Events                                                    [Abiteboul/Senellart]

                                                   IT-personnel


                          person                                           person


                name                      bonus                   name              bonus

        J     J                J          J
 John         Rick             Laptop               PDA           Mary              PDA


                                                                           J, M     M           M
                          37         50       25        50                   30     44           15


                       J: John hired for                     Pr(J) = 0.3                  Probabilities of
Independent
                          Laptop project                                                  Events
   Events
                       M: Mary worked                        Pr(M) = 0.6
                          overtime


  Agrégation de documents XML probabilistes                    BDA Namur - 21/10/2009                  10
“John was hired, Mary worked overtime”
                                                 IT-personnel


                        person                                           person


              name                      bonus                   name              bonus

       J   J                 J          J
John        Rick             Laptop               PDA           Mary              PDA


                                                                         J, M     M       M
                        37         50       25        50                   30     44      15


                     J: John hired for                     Pr(J) = 0.3
                        Laptop project

                     M: Mary worked                        Pr(M) = 0.6
                        overtime


Agrégation de documents XML probabilistes                    BDA Namur - 21/10/2009            11
“John was hired, Mary worked overtime”
                                              IT-personnel


                        person                                      person


              name                    bonus                  name              bonus



John                         Laptop                          Mary               PDA




                        37       50                                   30        44


                     J: John hired for                Pr(J) = 0.3
                        Laptop project
                                                                             Pr(d1) = 0.3 x 0.6
                     M: Mary worked                   Pr(M) = 0.6
                        overtime


Agrégation de documents XML probabilistes                BDA Namur - 21/10/2009               12
“John wasn’t hired, Mary worked overtime”
                                                 IT-personnel


                        person                                           person


              name                      bonus                   name                bonus

       J   J                 J          J
John        Rick             Laptop               PDA           Mary                 PDA


                                                                         J, M       M        M
                        37         50       25        50                   30        44       15


                     J: John hired for                     Pr(J) = 0.3
                        Laptop project
                                                                                  Pr(d2) = 0.7 x 0.6
                     M: Mary worked                        Pr(M) = 0.6
                        overtime


Agrégation de documents XML probabilistes                    BDA Namur - 21/10/2009                13
“John wasn’t hired, Mary worked overtime”
                                              IT-personnel


                        person                                        person


              name                 bonus                     name                bonus



            Rick                               PDA           Mary                 PDA




                                         25        50                             44


                     J: John hired for                  Pr(J) = 0.3
                        Laptop project
                                                                               Pr(d2) = 0.7 x 0.6
                     M: Mary worked                     Pr(M) = 0.6
                        overtime


Agrégation de documents XML probabilistes                 BDA Namur - 21/10/2009                14
        Probabilistic XML: MUX and DET Nodes
                                                      IT-personnel                                [Nierman/
                                                                                                   Jagadish,
                                person                                      person                 Kimelfeld/
                                                                                                   Sagiv]
                       name                  bonus                   name             bonus

                 MUX                           MUX                                         PDA
                                       0.3            0.7            Mary
  0.3          0.7
                                      Laptop          PDA                                  MUX
John           Rick
                                                                                     0.6            0.4

                                 37      50      25         50                       DET              15


        MUX
                Children represent mutually exclusive choices,                  30           44
                choices for different mux-nodes are independent

        DET     Deterministic nodes, children are combined


        Agrégation de documents XML probabilistes                BDA Namur - 21/10/2009                    15
        Probabilistic XML: MUX and DET Nodes
                                                      IT-personnel


                                person                                      person


                       name                   bonus                  name             bonus

                 MUX                           MUX                                      PDA
                       John            0.3     PDA    0.7            Mary
  0.3          0.7
                                      Laptop          PDA                                MUX
John           Rick
                                                                                     30
                                                                                     0.6    44   0.4
                                         25      50
                                 37      50      25         50                       DET           15


                                                                                30         44

                                      Pr = 0.3 x 0.7 x 0.6


        Agrégation de documents XML probabilistes                BDA Namur - 21/10/2009                 16
Probabilistic XML (PXML)
• A PXML document D
   – represents (exponentially) many document instances d
   – each with a probability Pr(d)

• PXML document models
   – CIE: long-distance dependencies
   – MUX-DET: only hierarchical dependencies
   – MUX-DET can be expressed by CIE,
             but not (concisely) the other way round

Other models can be reduced to the ones above,
                            or behave similarly

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   17
Aggregate Functions


         : finite bags of values  domain



Examples:
• count, countd: finite bags of anything N
• sum, avg: finite bags of rational numbers  Q

Similarly: min, max, parity, top K, ...


Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   18
Aggregate Queries

                               Q = (q)

Two Layers
• nonaggregate query q(x)
   – returns set of nodes q(d) over instance d

• aggregate function 
   – applied to the labels of nodes in q(d)
   – returns single value
                                (q(d))

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   19
Single Path Queries                                 Which bonuses
                                                    have been paid?

Simple form of tree pattern queries                                  qbonus

Paths of node labels or *,                                     IT-personnel
  connected by “child” and
  “descendant” edges
                                                                     bonus
Return the set of leaf nodes
  reachable from the root
  along such a path
                                                                       *


Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009            20
Single Path Aggregate Queries: Examples


• SUM(qbonus)

         “What is the sum of all bonuses?”


• MAX(qbonus)

         “What is maximal bonus that was paid?”




Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   21
Answer Distributions
PXML document D, instances {d1,…, dn}

SUM(qbonus) returns exactly one number for every di
    SUM(qbonus) is a random variable
    SUM(qbonus) induces a probability distribution over D


                    f(s) =                       Pr(di),
                             SUM(qbonus)(di) = s

                                                   the answer distribution

Notation:           SUM(qbonus)(D)                   or (q)(D) abstractly
Agrégation de documents XML probabilistes          BDA Namur - 21/10/2009    22
Special Case: Document Aggregation
D with instances {d1,…, dn},

• Applying  to a regular document di :

      (di) := ({| c | c is a value on a leaf of di |})


• Applying  to the probabilistic document D:

       (D)(c)  =                   Pr(di)
                          (di) = c

                               yields again a distribution (D)
Agrégation de documents XML probabilistes      BDA Namur - 21/10/2009   23
Reduction to Document Aggregation
                               (q)(D) = ?
Step 1:             Compute a smaller PXML document

                               D' = q(D)

                    containing only matching paths
                                                      Depends on
Step 2:             Apply  to D'                     document models and
                                                      simple path queries

Theorem:
                          (q)(D) = (D')

Agrégation de documents XML probabilistes    BDA Namur - 21/10/2009     24
Applying qbonus
                                                 IT-personnel

qbonus(                 person                                         person
                                                                                             )
              name                      bonus                   name            bonus

       J   J                 J          J
John        Rick             Laptop               PDA           Mary            PDA


                                                                       J, M     M       M
                        37         50       25        50                 30     44      15




                   = keep only the paths that match

                                                       … analogous for MUX-DET

Agrégation de documents XML probabilistes                   BDA Namur - 21/10/2009           25
   Evaluating Single Path Queries/2
                                                        IT-personnel
  qbonus(                                                                                                  )
                                 person                                       person


                        name                  bonus                    name              bonus

                  MUX                           MUX                                           PDA
                                        0.3             0.7            Mary
  0.3           0.7
                                       Laptop           PDA                                   MUX
John           Rick
                                                                                        0.6          0.4

                                  37      50       25         50                        DET            15


                                                                                   30           44




       Agrégation de documents XML probabilistes                   BDA Namur - 21/10/2009                   26
Problems Investigated

PXML document D, constant c


• Possible Value: Decide Pr((D) = c) > 0


• Probability Computation: Compute Pr((D) = c)


• Moment Computation: Compute E((D)k)

                                            E is “expected value”

Agrégation de documents XML probabilistes          BDA Namur - 21/10/2009   27
Aggregation over CIE


                    COUNT            SUM       MIN       COUNTD         AVG


Possible
                      NP-c           NP-c     NP-c          NP-c        NP-c
Value

Probability
                    in FP#P         in FP#P   FP#P-c       FP#P-c       FP#P-c
Computation

Moment
                        P               P     FP#P-c       FP#P-c       FP#P-c
Computation




 Agrégation de documents XML probabilistes     BDA Namur - 21/10/2009            28
Aggregation over CIE/2
• Possible Value: “Too much propositional logic present”

• Probability Computation: cannot be easier …

• Moment Computation:
  – Difficult for MIN, COUNTD, AVG
  – Easy for COUNT and SUM:

       “Moments are sums,
        moments of COUNT and SUM are sums of sums,
       which can be rearranged …”


Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   29
Aggregation over MUX-DET


                    COUNT            SUM            MIN       COUNTD         AVG


Possible
                        P            NP-c            P           NP-c        In NP
Value

Probability                             P
                        P           in |input| +     P          FP#P-c       FP#P-c
Computation                        |distribution|


Moment
                        P               P            P             P           P
Computation




 Agrégation de documents XML probabilistes          BDA Namur - 21/10/2009            30
COUNT, SUM, MIN are Easy ...
… because they allow for divide and conquer evaluation:
      SUM {| a,b,c,d |} = SUM {| a,b |} + SUM {| c,d |}

 is a monoid aggregate function if
    ({| a1,..., an |} = ({| a1 |}) ...  ({| an |})
for some commutative monoid (M, ) and all a1,..., an in M

Examples:
• count, sum, min, parity, top K
• countd, avg
Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   31
Convex Sums and Convolutions

If  is a monoid function, answer distributions
can be computed bottom up, using two operations:


                                       MUX
                                   p           q
Convex Sum:                (                      ) = p (           ) + q (         )

                                                                          depends on
                                       Other
                                       node
                                                                          the monoid

Convolution:               (                      ) = (         )  (        )


Agrégation de documents XML probabilistes            BDA Namur - 21/10/2009            32
Convolution of Distributions

(M, ) monoid

(D1), (D2) distributions of subdocuments


   (((D1)  (D2)) (c) =                        (D1)(c1) (D2)(c2)
                                    c1  c2 = c




Agrégation de documents XML probabilistes          BDA Namur - 21/10/2009   33
Approximating Query Answers
Over CIE, probability and moment computation can be hard
How good are Monte-Carlo methods?

Classical results (Hoeffding) imply: To achieve
         | E((D)k) – Estimate | <  with probability 1– 
at most O(R2k -2 log 1/) samples are needed,
                                   where R = max |(d)|.

Consequence: Given and , at most quadratically many
samples are needed for E(COUNTD(D)).

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   34
Probabilistic Aggregation: Related Work

• Tree pattern queries over MUX-DET
  with HAVING constraints [Cohen/Kimelfeld/Sagiv]

• Conjunctive queries with HAVING constraints over
  relational probabilistic databases [Re/Suciu]

• Work on various special topics in the relational setting
  – probabilistic data streams
  – uncertain schema mappings




Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   35
Aggregates over PXML: Conclusion
First results of an ongoing project
• Map of the problem space
• Largely complete investigation for single path queries:
    – Intractability for CIE
    – Hierarchical dependencies in MUX-DET can be
       exploited for monoid aggregation functions
Some results carry over to other models, e.g.,
• Uncertain schema mappings (Dong/Halevy/Yu)

Current work:
• richer query languages, continuous distributions on leaves

Agrégation de documents XML probabilistes   BDA Namur - 21/10/2009   36

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:3/21/2013
language:English
pages:36