Docstoc

Mining Tree Queries in a Graph

Document Sample
Mining Tree Queries in a Graph Powered By Docstoc
					Mining Tree-Query Associations
          in a Graph


   Bart Goethals
        University of Antwerp, Belgium
   Eveline Hoekx
   Jan Van den Bussche
        Hasselt University, Belgium
                           Graph Data

A (directed) graph over a set of nodes N is a set G of
   edges: ordered pairs ij with ij  N.




Snapshot of a graph representing the complete metabolic pathway of a
human.


                                                                       2
                       Graph Mining

Transactional category
   – dataset: set of many small graphs (transactions)
   – frequency: transactions in which the pattern occurs (at least once)
   – ILP: Warmr
  [AGM, FSG, TreeMiner, gSpan, FFSM]


Single graph category
   – dataset: single large graph
   – frequency: copies of the pattern in the large graph
  [Subdue, Vanetik-Gudes-Shimony, SEuS, SiGraM, Jeh-Widom]


Focus on pattern mining, few work on association rule mining!
                                                                            3
                        Our work

• Single graph category
• Pattern + association rule mining
• Patterns with:
  – Existential nodes
  – Parameters
• Occurrence of the pattern in G is any
  homomorphism from the pattern in G.
• So far only considered in the ILP (transactional)
  setting


                                                      4
            Example of a pattern




frequency   x   z 5z  G  z8 G  zx 
G
                                                          5
   Patterns are conjunctive queries.



                  select distinct G3.to as x
                  from G G1, G G2, G G3
                  where G1.from=5 and G1.to=G2.from
                          and G1.to=G3.from and G2.to=8




frequency   x   z 5z  G  z8 G  zx 
G
                                                          6
Example of an Association Rule




                                 7
    Features of the presented algorithms



•   Pattern mining phase + association mining phase
•   Restriction to trees => efficient algorithms
•   Equivalence checking
•   Apply theory of conjunctive database queries
•   Database oriented implementation




                                                      8
                 Outline rest of talk

•   Formal problem definition
•   Algorithms:
    1. Pattern Mining
      •   Overall approach
      •   Outer loop: incremental
      •   Inner loop: levelwise
      •   Equivalence checking
  2. Association Rule Mining
• Result management
• Experimental results
• Future work
                                        9
      Formal definition of a tree pattern.

A tree pattern is a tree P whose nodes are called variables,
     and:
1. some variables marked as existential 
2. some variables are parameters (labeled with a constant)
3. remaining variables are called distinguished




                                                               10
       Formal definition of a tree query.

A tree query Q is a pair (H,P) where:
1. P is a tree pattern, the body of Q
2. H is a tuple of distinguished variables and parameters of
     P. All distinguished variables of P must appear at least
     once in H, the head of Q




                                                                11
        Formal definition of a matching

A matching of a pattern P in a graph G is a homomorphism
h: P  G, with hza, for parameters labeled a.




                                                           12
Example: Matching

                z y   z x




                              13
Example: Matching

                z y   z x




                              14
Example: Matching

                 z y    z x

              h 0    8   4




                                   15
Example: Matching

                 z y    z x

              h 0    8   4
              h 0    8   8




                                   16
Example: Matching

                 z y    z x

              h 0    8   4
              h 0    8   8
              h 0    8   4




                                   17
Example: Matching

                 z y    z x

              h 0    8   4
              h 0    8   8
              h 0    8   4
              h4 0    8   5




                                   18
Example: Matching

                 z y    z x

              h 0    8   4
              h 0    8   8
              h 0    8   4
              h4 0    8   5
              h5 0  8 8




                                   19
         Formal definition of frequency

We define the answer set of Q in G as follows:
               QGf(H)|f is a matching of P in G



The frequency of Q in G is #answers in the answer set.




                                                          20
Example: Matching

                  z y    z x

               h 0    8   4
              h 0    8   8
               h 0    8   4
           
               h4 0    8   5
               h5 0  8 8


                frequency 

                                    21
Problem statement 1: Tree query mining


Given a graph G and a threshold k, find all tree queries
   that
have frequency at least k in G, those queries are called
frequent.




                                                           22
     Formal definition of an association
                     rule

An association rule (AR) is of the form Q1  Q2 with Q1 and Q2
tree queries. The AR is legal if Q2  Q1. The confidence of the
AR in a graph G is defined as the frequency of Q2 divided by
     the
frequency of Q1.




                                                                  23
Problem statement 2: Association rule mining



 • Input: a graph G, minsup, a tree query Qleft frequent in
   G, minconf
 • Output: all tree queries Q such that Qleft  Q is a legal
   and confident association rule in G.




                                                               24
                 Outline rest of talk

•   Formal problem definition
•   Algorithms:
    1. Pattern Mining
      •   Overall approach
      •   Outer loop: incremental
      •   Inner loop: levelwise
      •   Equivalence checking
  2. Association Rule Mining
• Result management
• Experimental results
• Future work
                                        25
            Pattern Mining Algorithm
Outer loop:                                            x1
  Generate, incrementally, all possible
                                                       x2
  trees of increasing sizes. Avoid
  generation of isomorphic trees.                 x3        x4

Inner loop:
  For each newly generated tree, generate all queries based
  on that tree, and test their frequency.

                     5            x

                                 
                                            ...
   x1       x2   x1       x2   5        

                                                                 26
                      Outer loop

• It is well known how to efficiently generate all trees
  uniquely up to isomorphism

• Based on canonical form of trees.

• [Scions, Li-Ruskey, Zaki, Chi-Young-Muntz]




                                                           27
       Inner loop: Levelwise approach

• A query Q is characterized by
    Q set of existential nodes
    Q set of parameters
   – Labeling Qof the parameters by constants.

• Q   specializes Q   if  ,   
  and  agrees with  on .

• If Q specializes Q then freqQ  freqQ

• Most general query: T = (, , )

                                                                    28
      Inner loop: Candidate generation

•   CanTab is a candidate query
    FreqTabis a frequent query

•   Q’=’’ is a parent of Q= if either:
         ’ and  has precisely one more node than ’, or
         ’ and  has precisely one more node than ’


•   Join Lemma:
    Each candidacy table can be computed by taking the
    natural join of its parent frequency tables.


                                                                29
       Inner loop: Frequency counting

• Each candidacy table can be computed by a single SQL
  query. (ref. Join lemma).

• Suppose: Gfromto table in the database, then each
  frequency table can be computed with a single SQL query.
     
    
           » formulate in SQL and count
     
           » formulate   in SQL E
           » natural join of E with CanTab
           » group by 
           » count each group

                                                             30
Inner loop: Example

            x
            x x
            x0 x8




                                31
                     Inner loop: Example

                                           x
                                           x x
                                           x0 x8



• Join expression:

CanTab{x}{x,x} = FreqTabxx⋈ FreqTab   xx⋈ FreqTabx x




                                                                                 32
                  Inner loop: Example

                                        x
                                        x x
                                        x0 x8


• SQL expression E for x  

      select distinct G1.from as x1, G2.to as x3,
              G3.to as x4
      from G G1, G G2, G G3
      where G1.to = G2.from and G3.from = G2.from


                                                            33
                   Inner loop: Example

                                        x
                                        x x
                                        x0 x8



• SQL expression for filling the frequency table:
        select distinct E.x1, E.x3, count(E.x4)
        from E, CanTab{x2}{x1,x3} as CT
        where E.x1 = CT.x1 and E.x3 = CT.x3
        group by E.x1, E.x3
        having count(E.x4) >= k

                                                            34
                 Equivalent queries

Queries Q and Q are equivalent if same answer sets on all
graphs G (up to renaming of the distinguished variables)




•   2 cases of equivalent queries:
       1. Q1 has fewer nodes than Q2
       2. Q1 and Q2 have the same number of nodes




                                                              35
                Equivalence theorem

Two queries are equivalent if and only if there are containment
mappings between them in both directions.




A containment mapping from Q to Q is a h: QQ that
maps distinguished variables of Q one-to-one to distinguished
variables of Q, and maps parameters of Q to parameters of Q,
preserving labels




                                                                  36
       Case : Q fewer nodes than Q2
Redundancy lemma:
Let Q be a tree query without selected nodes. Then Q has a
redundancy if and only if it contains a subtree C in the form of a
linear chain of  nodes (possibly just a single node), such that the
parent of C has another subtree that is at least as deep as C.


                              Redundant
                                                   Q1     x
                              subtree
                                                          x

                                                          x

                                                                       37
 Case : Q and Q same number of nodes

• Q and Q must be isomorphic.

• Canonical form of queries: refine the canonical ordering of
  the underlying unlabeled tree, taking into account node
  labels.




                                                                38
     Association Mining Algorithm


• Input: a graph G, minsup, a tree query Qleft frequent in
  G, minconf
• Output: all tree queries Q such that Qleft  Q is a legal
  and confident association rule in G.




                                                              39
         Containment mappings

• For each tree query, generate all containment mappings
from Qleft to Q, ignoring parameter assignments.




                                                       40
                 Instantiations

• For each containment mapping, generate all parameter
assignments such that Qleft  Q is frequent and
confident.




                                                         41
    Equivalent Association rules

• Equivalence checking of association rules is as
  hard as general graph isomorphism testing.




                                                    42
             Outline rest of talk


• Result management
• Experimental results
• Future work




                                    43
              Result management


• Output: frequency tables stored in a relational database.

• Browser




                                                              44
45
Experimental results: Real-life datasets


• Food web nodes54 edges0




              frequency = 176



                                           46
Experimental results: Real-life datasets


• Food web nodes54 edges0
       (x1,x2,x3,x4,x5)       (x1,x2,x4,x2,x5)

             x1                    x1
                          
        x2        x4               x2

        x3        x5          101 x4 x5

                  confidence = 11%

                                                 47
   Experimental results: Performance


• Fully implemented on top of IBM DB2
• Preliminary performance results:
  – pattern mining algorithm:
     • adequate performance
     • huge number of patterns
     • constant overhead per discovered pattern
  – association mining algorithm:
     • very fast
     • constant overhead per discovered rule


                                                  48
                  Future work



• Applications: scientific data mining
• Loosen restriction to trees




                                         49
                  References

• Bart Goethals, Eveline Hoekx and Jan Van den Bussche,
  Mining Tree Queries in a Graph, in Proceedings of the
  eleventh ACM SIGKDD International conference on
  Knowledge Discovery and Data Mining, p 61-69, ACM
  Press 2005
• Eveline Hoekx and Jan Van den Bussche, Mining for Tree-
  Query Associations in a Graph, to appear in Proceedings of
  the 2006 IEEE International Conference on Data Mining
  (ICDM 2006)




                                                          50

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:53
posted:4/19/2012
language:
pages:50