# Matching Twigs in Probabilistic XML

Document Sample

```					                         VLDB 2007
Vienna, Austria

Matching Twigs in
Probabilistic XML
Benny Kimelfeld & Yehoshua Sagiv

The Selim and Rachel Benin School of Engineering and Computer Science
Example: Scanning Aerial Photography
Find regions that include a factory building and a road
… with a high probability

VLDB 2007         Matching Twigs in Probabilistic XML
What is the probability that this
Analyzing a Region region is an answer
(i.e., includes a factory building and a road)?

factory bldg. & wall (40%)
match                 / house & road (30%)
(36%)

can be
The probability of each matchroad (60%) significantly
smaller than the probability that there is any match
match
(24%)
match
house (50%) /           factory bldg. (40%) /
(45%)                                    apt. building (50%)
factory bldg. (50%)
But specifying the probability of each
match
match does not answer the question!
(36%)
VLDB 2007          Matching Twigs in Probabilistic XML
A Database Point of View
*
Query
region
Querying probabilistic data:
Each answer has an amount of certainty:
building    The probability of being obtained
when querying a random database
aerial-photo
Probabilistic
region
Data

A prob. process for                                                      4m
area   area       width
generating random data                        2
200m 550m2          2m

VLDB 2007                   Matching Twigs in Probabilistic XML
What Query Should We Pose?
A pattern           *
• An answer is a match
region              • What is the probability of each
specific match?
• What is the probability of each
pair of road & factory building?
building

• An answer is a projection of
A pattern w/          *
one or more matches
projection      region              • What is the prob. of each
project
on region                           • For each region, what is the
building     prob. that it has some pair of
This is what we need!                    road & factory building?
VLDB 2007                Matching Twigs in Probabilistic XML
Another Example
Find the following objects in one region:
A factory building, a road, an antenna, a heliport, a track

VLDB 2007             Matching Twigs in Probabilistic XML
Finding a Partial Match
Find the following objects in one region:
A factory building, a road, an antenna, a heliport, a track
heliport (80%)

partial
(36%)
No Track!

factory bldg. w/ antennas (50%) /
apt. building w/ water tanks (30%)

For many applications, that’s good enough …

VLDB 2007             Matching Twigs in Probabilistic XML
What If … filter out the whole match?
Should we just
Does not make sense!
What about the previous partial match?

heliport (80%)

track (20%)   match
(7.2%)

factory bldg w/ antennas (50%) /
apt. building w/ water tanks (30%)

The probability may be too low to be of any interest!

VLDB 2007           Matching Twigs in Probabilistic XML
Finding Maximal Matches
*
A pattern
region

building

The goal is to find the maximal among the
partial matches with a sufficient probability
aerial-photo

Probabilistic                                                     region

Data
4m
area   area        width
2
200m 550m2           2m

VLDB 2007              Matching Twigs in Probabilistic XML
Querying Prob. Data: Earlier Work
• Projection and incomplete semantics were
explored for relational models
– Projection: Very simple queries can be highly
intractable (data complexity) [Dalvi & Suciu, VLDB 04]
– Maximally joining relations: Tractable under data
complexity, generally intractable under query-and-
data complexity [Kimelfeld & Sagiv, PODS 07]
• Yet tractable for important classes of schemas

• None of these paradigms studied in the context
of prob. XML (only complete matches w/o projection)
But they are more relevant to prob. XML
since, as the paper shows, they become tractable
VLDB 2007                 Matching Twigs in Probabilistic XML
In the
Content of the preliminary results
also have
The paper, we of maximalsomePaperprojection on
the combination           matches and

Query evaluation over probabilistic XML
Efficient algorithms and complexity analysis

Evaluating twig queries with projection
Evaluating Boolean twig queries
Finding maximal matches of twigs

In the paper, we explain in detail why our results do not
follow from previous results on XML/relational models

VLDB 2007             Matching Twigs in Probabilistic XML
Talk Overview

1. Introduction
2. Twig Queries over Probabilistic XML
− XML and Twig Queries
− Probabilistic XML
− Querying Probabilistic XML (Complete Semantics)

3. Query Evaluation (Complete Semantics)
4. Finding Maximal Matches
5. Conclusion, Related and Future Work

VLDB 2007               Matching Twigs in Probabilistic XML
(Ordinary) XML Documents
Rooted tree
aerial-photo

region

factory                     field

complex bldg. @area            @area heliport

park.lot heliport      2.5km2      1.3km2

Each node has a tag, a value or both

VLDB 2007               Matching Twigs in Probabilistic XML
Twig Queries
Rooted tree                        Output node (projection)
Possibly, more than one
*

Descendant edge                                     Node predicate over
region
the tag and value
heliport            factory

park. lot              @area

≥10km2
Child edge

VLDB 2007         Matching Twigs in Probabilistic XML
A match of a twig T in a document d is a
mapping from the nodes of T to those of d
root(T) → root(d)            node predicates are satisfied
child edge → edge            desc. edge → path
*
T
aerial-photo
d
region
region

heliport            factory                           factory                    field

complex bldg. @area               @area heliport
An
park. lot   answer is obtained from a match by
@area
2
park.lot heliport 15km2 1.3km
listing the images of the output nodes
That ≥10km2
is, applying projection to the match
VLDB 2007                        Matching Twigs in Probabilistic XML
Boolean Queries
A twig without output nodes is a Boolean twig
The answer is either true or false

*
B(d) = true
B
means that there is                aerial-photo
d
region              a match of B in d                      region

factory                    field
heliport            factory

complex bldg. @area               @area heliport
park. lot           @area
park.lot heliport        15km2         1.3km2

≥10km2

VLDB 2007                         Matching Twigs in Probabilistic XML
Talk Overview

1. Introduction
2. Twig Queries over Probabilistic XML
− XML and Twig Queries
− Probabilistic XML
− Querying Probabilistic XML (Complete Semantics)

3. Query Evaluation (Complete Semantics)
4. Finding Maximal Matches
5. Conclusion, Related and Future Work

VLDB 2007               Matching Twigs in Probabilistic XML
Probabilistic XML
aerial-photo
d
region

factory                  field

complex bldg. area             area    heliport

∑ Pr(d) = 1    park.lot heliport   2.5km2        1.3km2

d
Probabilistic
XML document                           Random Instance
A probabilistic process                     An ordinary XML
of generating ordinary                   document d, generated
XML documents                          with probability Pr(d)

VLDB 2007            Matching Twigs in Probabilistic XML
Implicit Representations
In practice, the probability space may be huge
E.g., uncertainty is many small pieces of data
It is unrealistic to represent the probabilistic
document by explicitly specifying the entire space
We usually explore implicit representations
Such as the following
one that we consider:

aerial-photo                                                                                                 aerial-photo
aerial-photo   aerial-photo
aerial-photo
region                                        region                                                         region
region
region

factory                  field                factory                                 field                  factory                  field
factory                  field

complex bldg. area             area    heliport   bldg. area            area           complex bldg. area                            area    heliport   complex bldg. area          area    heliport

park.lot heliport   2.5km2        1.3km2                    2.5km2         1.3km2      park.lot heliport   2.5km2                       1.3km2

VLDB 2007                                                     Matching Twigs in Probabilistic XML
A ProTDB Document [Nierman & Jagadish 02]
aerial -photo
Ordinary                                                        Distributional
nodes                                                             nodes
region

neighborhood                         factory

Independent
0. 8

house         house       vehicle         building

size       size         type                park .lot     heliport
s          m                       • 2 types of nodes
Rooted tree
Mutually exclusive                         •   2 types of distributions
track    private
VLDB 2007                 Matching Twigs in Probabilistic XML
A ProTDB Document [Nierman & Jagadish 02]
aerial -photo
A probability for each outgoing
region
edge of a distributional node
neighborhood                         factory

0. 8

house         house       vehicle         building

size       size         type                park .lot   heliport
s          m

track    private
VLDB 2007                 Matching Twigs in Probabilistic XML
Instance Generation: Step 1
aerial -photo
Traverse the
region                       tree top-down

neighborhood                          factory
Choose children
Choose children
independently
independently
0. 8                                            Choose at
most one child
house         house         vehicle        building

size        size         type                park .lot     heliport
Choose at
most one child
s      m                                       Distributional nodes
Drop unchosen                                       choose a set of children
children                   track    private
VLDB 2007                  Matching Twigs in Probabilistic XML
Instance Generation: Step 2
aerial -photo

region

neighborhood                        factory

house                     vehicle

size                   type                          heliport
s
Drop the
distributional nodes
track
VLDB 2007                Matching Twigs in Probabilistic XML
Instance Generation: Step 2
aerial -photo               Connect each
ordinary node to its
region                 closest ancestor
neighborhood                        factory

house                     vehicle

size                   type                           heliport
s
Drop the
distributional nodes
track
VLDB 2007                Matching Twigs in Probabilistic XML
The Result: An Ordinary Document
aerial -photo

region

neighborhood                        factory

house                     vehicle

size                   type                          heliport
s

track
VLDB 2007                Matching Twigs in Probabilistic XML
Talk Overview

1. Introduction
2. Twig Queries over Probabilistic XML
− XML and Twig Queries
− Probabilistic XML
− Querying Probabilistic XML (Complete Semantics)

3. Query Evaluation (Complete Semantics)
4. Finding Maximal Matches
5. Conclusion, Related and Future Work

VLDB 2007               Matching Twigs in Probabilistic XML
Querying Probabilistic XML
Twig w/        *
projection                         Users pose an ordinary query
region                 That is, of the type that is applied
to non-probabilistic documents

Query                building

Probabilistic XML document                                     aerial-photo

region

4m
… but the document is550m width
area
200m
probabilistic
area
2m              2        2

VLDB 2007                  Matching Twigs in Probabilistic XML
When querying probabilistic data,
Each answer has a probability (certainty)

(
A is obtained by applying Q
Pr(A) = Pr to a random document of P                              )
P

Pr                                           aerial-photo

A∈ Q
region

factory

complex bldg. area

park.lot heliport   2.5km2

VLDB 2007              Matching Twigs in Probabilistic XML
The Prob. of Satisfying a Boolean Query
When querying probabilistic data,
Each answer has a probability (certainty)
If B is a Boolean pattern, we have interest in:

(There is a match of B in
Pr a random document of P                                   )
P

Pr                                aerial-photo

region

Q                 factory

complex bldg. area
= true
park.lot heliport   2.5km2

VLDB 2007              Matching Twigs in Probabilistic XML
Talk Overview

1. Introduction
2. Twig Queries over Probabilistic XML
− XML and Twig Queries
− Probabilistic XML
− Querying Probabilistic XML (Complete Semantics)

3. Query Evaluation (Complete Semantics)
4. Finding Maximal Matches
5. Conclusion, Related and Future Work

VLDB 2007               Matching Twigs in Probabilistic XML
Computational Problems
Non-Boolean Queries:
Input: A prob. document P, a non-Boolean twig
query Q, a threshold p≥0
Goal: Find all answers A, s.t. Pr(A∈Q(P))≥ p

Boolean Queries:
Input: A prob. document P, a Boolean twig query B
Goal: Compute Pr(B(P)=true)

VLDB 2007             Matching Twigs in Probabilistic XML
From Regular to Boolean Queries
We apply a standard reduction from regular queries
(that generate mappings) to Boolean ones:

1. Compute the answers as if the document is
ordinary (i.e., ignore the distributional nodes)
2. Compute the probability of each answer

Step 2 is done by evaluating a Boolean query
That is, computing the probability of a match

Next, we consider the evaluation of Boolean queries

VLDB 2007              Matching Twigs in Probabilistic XML
An Example
Q       *
P                               r                                         a       *       *

0.5
b       c   d
0.5
2                        0. 5
0.

0.
7

a               a                           e      a                e

b                          b                                d
0.8                         0.6
0.8                       0.4
c                 d         c              d

VLDB 2007                             Matching Twigs in Probabilistic XML
!
Possible Matches
• Matches are not disjoint events
• Matches are not independent events
Q            *
P                         r                                                       a           *             *

b       c         d
0.5                              r                    0.5
2          r                 0. 5
0.

0.
r
7

r
a             a                      e          a                   e
0.5                                                     0.5
0055
..
2
0.

0.5                                   .5
0.

b                 b0.5              d
7

2
0.

0.5                         00.5
0.        0055
..
7

2
0.
a         a               e4 e   a       e                                              0.5

2
0.
0.
7

0.
a 0.8 a a 0.6 0.               a   a e a e
7
0.8
a  a     a        e   e                                                  e
c b      d     c b      d           d
b     b     0.6
b     b           d     d
0.8
008
.        b     004
..    4
b                                                    d
.8     0.88
0.                   6 0. .
06
8
0.0.8                          .4
00.6
d.8                       0.4
c               c
c d c d
0
c
c
d                             c d c d
d                   d
VLDB 2007                         Matching Twigs in Probabilistic XML
Our Approach: Dynamic Programming
*                                                *                                  *

a       *           *               *                    *             *        *   a       *        *
b       c       d           b       c        b           c        d     b   d       b       c    b    …
0.0                     0.6                      0.0                0.4      0.0            1.0
r

When visiting a node,
0.5                                                                    0.5
0. 5
2
0.

evaluate a collection of
0.
7

a                   a                                e            a              e queries (inc. the original
one) over its subtree
b                                b                                   d
0.8                              0.6
0.8                                  0.4                Document nodes are
c                   d            c                    d             traversed bottom-up
VLDB 2007                                      Matching Twigs in Probabilistic XML
Our Approach: Dynamic Programming
*                                                 *                                  *

a       *            *               *                    *             *        *   a       *       *
b       c        d           b       c        b           c        d     b   d       b       c   b   …
Special treatment if the visited node is distributional
r

When visiting a node,
0.5                                                                     0.5
0. 5
2
0.

evaluate a collection of
0.
7

a                    a                                e            a              e queries (inc. the original)
over its subtree
b                                b                                   d
0.8                              0.6
0.8                                  0.4                Document nodes are
c                   d            c                    d             traversed bottom-up
VLDB 2007                                       Matching Twigs in Probabilistic XML
How can we compute the probability that there is a
Bottom-Up Evaluation
match, based on previous results for the descendants?*

r                                            a                             *                      *

b                 c            d
0.5                                                  0.5   0.5

2
0.

0.
a             a       7                   e      a               e

b                           b                              d
0.8                         0.6
0.8                       0.4
r
c                 d         c              d
0.5                                                      0.5
0. 5

0.

2
0.
7
a               a                             e      a                e

Problem: Each specific match can                                              b                             b                               d
0.8                           0.6
0.8                         0.4
involve several different children                                                c                 d           c              d

VLDB 2007                       Matching Twigs in Probabilistic XML
From Twig to Negated Branches
*                            *           *            *
aNext: *How to
*             ≡ a ⋀ * ⋀
compute this value                    *
b       c      d                             d        b           c

*                                   *         *                   *

Pr    a           *          *    = 1- Pr ⌝ a ⋁ ⌝ * ⋁ ⌝ *
b        c      d                                  d           b           c

VLDB 2007                   Matching Twigs in Probabilistic XML
From a Disjunction to Conjunctions
*          *           *

Pr    ⌝ a ⋁⌝ * ⋁⌝ *
d       b       c
Next: How to compute this value
The principle of
inclusion & exclusion

*            *              *              *       *         *       *

Pr   ⌝a     + Pr⌝   *   + Pr   ⌝*          - Pr ⌝ a ⋀⌝    *   - Pr ⌝ * ⋀⌝   *
…
d          b       c                  d         d   b       c

VLDB 2007                  Matching Twigs in Probabilistic XML
From a Document to Branches
A document satisfies a
r
*         *               conjunction of negated twig

Pr ⌝      *   ⋀ ⌝*                                 of the
branches iff each
doc. branch satisfies
d     b       c           the conjunction

Good news: Document branches are independent!

VLDB 2007                 Matching Twigs in Probabilistic XML
Using Previous Computations on Children
*       *                         *        *                               *       *
Pr   ⌝ * ⋀⌝ *                 x Pr    ⌝ * ⋀⌝ *                     x Pr   ⌝ * ⋀⌝ *
d   b       c                     d    b           c                       d   b       c
r                                                      r
r

Cut the roots from both twig and doc. branches:

⌝d ⋀ ⌝b                         ⌝d ⋀ ⌝b                             ⌝d ⋀ ⌝b
*           *                    *             *                       *           *
Pr                            x Pr                                x Pr
c                                      c                               c

VLDB 2007                       Matching Twigs in Probabilistic XML
Descendant Edges
• In the computation we described, we assumed that the
root has only child edges; it would not work otherwise!

The corresponding twig branches are replaced:

*                         *                   *

⌝   *            ≡        ⌝   *        ⋀ ⌝        *

b        c                b       c               *

b       c
VLDB 2007               Matching Twigs in Probabilistic XML
Missing Details
• Creating the list of twigs that are evaluated over
the subtree rooted at each visited node
• Different evaluation methods, depending on the
type of the visited node
– Ordinary node (sketched in the previous slides)
– Distributional node
• Independent distribution
• Mutually-exclusive distribution

• Dealing with node predicates of the twig

All the details of the algorithm are in the paper
VLDB 2007                 Matching Twigs in Probabilistic XML
Efficiency
The algorithm computes Pr(B(P)=true) in time

O(c|B|·|P|)

Is there an efficient algorithm under query-and-data
complexity (polynomial in the query also)?
No! Computing Pr(B(P)=true) is #P-complete
under query & data complexity!

Even if:                                                     ...
No desc. edges   Only independent distributions
VLDB 2007                Matching Twigs in Probabilistic XML
Talk Overview

1. Introduction
2. Twig Queries over Probabilistic XML
− XML and Twig Queries
− Probabilistic XML
− Querying Probabilistic XML (Complete Semantics)

3. Query Evaluation (Complete Semantics)
4. Finding Maximal Matches
5. Conclusion, Related and Future Work

VLDB 2007               Matching Twigs in Probabilistic XML
Standard Terminology
*
T0: a subtree of                                                                   A match m0 of T0 is a
twig T, includes           a                          e             T0           f partial match of T
the root
b                   c              d
T

m2 subsumes m1 if m2 includes the mappings of m1
r                                                                       r

0.5                                                      0.5              0.5                                                  0.5
0. 5                                                                  0. 5

0.

2
0.

2

0.
0.

7
7

a               e                           e        a                f   a             e                           e      a                f
That is, m1=m2           b                          b                                  d         b                          b                                d
over domain(m1)                   0.8
0.8
0.6
0.4
0.8
0.8
0.6
0.4

c                 d         c                d
m1                   c                 d         c              d
m2
VLDB 2007                  Matching Twigs in Probabilistic XML
Ordinary Data:

∄ m0, such that m0 ≠ m and m0 subsumes m

Probabilistic Data:                  In other words, m is maximal
• Pr(m) ≥ threshold                with a sufficient probability

• ∀ m0, if m0 ≠ m and m0 subsumes m, then
Pr(m0) < threshold
VLDB 2007             Matching Twigs in Probabilistic XML
The Computational Problem

Input: A probabilistic document P, a twig pattern T,
a threshold p≥0
Goal: Find all maximal matches of T in P w.r.t. p

VLDB 2007          Matching Twigs in Probabilistic XML
Complexity of Finding Maximal Matches
• It is trivial to show that maximal matches can be found
efficiently under data complexity
• Unlike the case of complete matches (NP-complete),

Maximal matches can be computed
efficiently under query-and-data complexity

Evaluation Algorithm
• The algorithm runs with incremental polynomial time
• All the details are in the paper …

VLDB 2007            Matching Twigs in Probabilistic XML
Talk Overview

1. Introduction
2. Twig Queries over Probabilistic XML
− XML and Twig Queries
− Probabilistic XML
− Querying Probabilistic XML (Complete Semantics)

3. Query Evaluation (Complete Semantics)
4. Finding Maximal Matches
5. Conclusion, Related and Future Work

VLDB 2007               Matching Twigs in Probabilistic XML
Paper Summary
• Query evaluation over probabilistic XML is
investigated
– Known data model
– Twig patterns (node predicates, child & desc. edges)
– Complete & maximal semantics, projection
• Evaluation algorithm for Boolean queries
– Also used for evaluating queries with projection
– Efficient under data complexity
• An algorithm for finding the maximal matches
– Efficient under query-and-data complexity
• Analysis of the complexity of querying prob. XML

VLDB 2007            Matching Twigs in Probabilistic XML
Complexity Results
Query & Data
Data Complexity
Complexity

w/o projection          Poly.            NP-complete
Complete
semantics
Boolean            Poly.            #P-complete

w/ projection          Poly.            #P-complete

w/o projection          Poly.             Inc. Poly.
Maximal
semantics
w/ projection          Poly.               Open

VLDB 2007          Matching Twigs in Probabilistic XML
Other Models of Probabilistic XML
The complexity results in the different prob. XML
models are a part of our ongoing research
Fuzzy trees [Abiteboul & Senellart, 2006]
Query Evaluation: #P-Complete

Our ProTDB [Nierman and Jagadish, 2002]
model  Query Evaluation: Tractable

Simple prob. trees [Abiteboul & Senellart, 2006]
Query Evaluation: Tractable

PXML [Hung, Getoor & Subrahmanianm, 2003]
Query Evaluation:
Tree docs.: Tractable, DAG docs.: #P-hard

Query evaluation: Complete semantics w/ projection
VLDB 2007               Matching Twigs in Probabilistic XML
Ongoing and Future Work
Implementing a system for representing and
querying probabilistic XML
Optimization of the proposed algorithms
– We already obtained significant improvements, both
experimentally and analytically
Extending the expressiveness of the model of
probabilistic XML
– New types of distributional nodes
– Ongoing work: A combination of ProTDB [Nierman and
Jagadish, 2002] and PXML [Hung, Getoor &
Subrahmanianm, 2003]
Combining incompleteness and projection
VLDB 2007                Matching Twigs in Probabilistic XML
Thank you!

Questions?

VLDB 2007   Matching Twigs in Probabilistic XML

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 3 posted: 9/1/2012 language: Unknown pages: 55