Overview of

Document Sample
Overview of Powered By Docstoc
					Frequent Subgraph Mining


             Jianlin Feng
          School of Software
      SUN YAT-SEN UNIVERSITY
            June 12, 2010


                               1
  Modeling Data With Graphs…
  Going Beyond Transactions
                                   Data Instance         Graph Instance
Graphs are suitable for
                                          Element            Vertex
capturing arbitrary
relations between the          Element’s Attributes          Vertex Label
various elements.
                                 Relation Between            Edge
                                    Two Elements

                                  Type Of Relation           Edge Label

                                 Relation between            Hyper Edge
                                 a Set of Elements

  Provide enormous flexibility for modeling the underlying data as they allow the
 modeler to decide on what the elements should be and the type of relations to be
                                     modeled
Graph, Graph, Everywhere




                                                      from H. Jeong et al Nature 411, 41 (2001)
     Aspirin      Yeast protein interaction network




                              Co-author network
     Internet                                                                                     3
Frequent Subgraph Discovery
-Proposed in ICDM 2001
Given
  D : a set of undirected, labeled graphs
  σ : support threshold ; 0 < σ <= 1


Find all connected, undirected graphs that are
  subgraphs in at-least σ . | D | of input
  graphs
     Subgraph isomorphism


                                                 4
Example: Frequent Subgraphs
GRAPH DATASET




          (A)         (B)         (C)
FREQUENT PATTERNS
(MIN SUPPORT IS 2)


                (1)         (2)



May 16, 2012                            5
    EXAMPLE (II)
GRAPH DATASET




FREQUENT PATTERNS
(MIN SUPPORT IS 2)



May 16, 2012         6
Terminology-I

   A graph G(V,E) is made of two sets
       V: set of vertices
       E: set of edges
   Assume undirected, labeled graphs
       Lv: set of vertex labels
       LE: set of edge labels




                                         7
Terminology-II

   A graph is said to be connected if there is a
    path between every pair of vertices
   A graph Gs (Vs, Es) is a subgraph of another
    graph G(V, E) iff
       Vs is subset of V and Es is subset of E
   Two graphs G1(V1, E1) and G2(V2, E2) are
    isomorphic if they are topologically identical
       There is a mapping from V1 to V2 such that each
        edge in E1 is mapped to a single edge in E2 and
        vice-versa
                                                          8
Example of Graph Isomorphism
                         ƒ(a ) = 1

                         ƒ(b ) = 6

                         ƒ(c ) = 8

                         ƒ(d ) = 3

                         ƒ(g ) = 5

                         ƒ(h ) = 2

                         ƒ(i ) = 4

                         ƒ(j ) = 7

                                     9
Terminology-III:
Subgraph isomorphism problem

   Given two graphs G1(V1, E1) and G2(V2, E2):
    find an isomorphism between G2 and a
    subgraph of G1
       There is a mapping from V1 to V2 such that each
        edge in E1 is mapped to a single edge in E2 and
        vice-versa
   NP-complete problem
       Reduction from max-clique or hamiltonian cycle
        problem

                                                          10
     FSG: Frequent Subgraph Discovery Algorithm
                                          Single edges
Follows an Apriori-style
level-by-level approach
and grows the patterns                    Double edges
one edge-at-a-time.


                                          3-candidates



                                           3-frequent
                                           subgraphs


                                          4-candidates



                                           4-frequent
                                           subgraphs
FSG: Frequent Subgraph Discovery Algorithm


   Key elements for FSG’s computational
    scalability
       Improved candidate generation scheme
       Use of TID-list approach for frequency counting
       Efficient canonical labeling algorithm




                                                          12
    FSG: Basic Flow of the Algo.

   Enumerate all single and double-edge
    subgraphs
   Repeat
     Generate all candidate subgraphs of size (k+1)
      from size-k subgraphs
     Count frequency of each candidate

     Prune subgraphs which don’t satisfy support
      constraint
    Until (no frequent subgraphs at (k+1) )

                                                       13
    FSG: Candidate Generation - I
   Join two frequent size-k subgraphs to get (k+1)
    candidate
       Common connected subgraph of (k-1) necessary
   Problem
       K different size (k-1) subgraphs for a given size-k
        graph
       If we consider all possible subgraphs, we will end up
           Generating same candidates multiple times
           Generating candidates that are not downward closed
           Significant slowdown
       Apriori doesn’t suffer this problem due to
        lexicographic ordering of itemset
                                                                 14
FSG: Candidate Generation - II

   Joining two size-k subgraphs may produce multiple
    distinct size-k
       CASE 1: Difference can be a vertex with same label




                                                             15
FSG: Candidate Generation - III




   CASE 2: Primary subgraph itself may have multiple
    automorphisms
   CASE 3: In addition to joining two different k-graphs,
    FSG also needs to perform self-join
                                                             16
FSG: Candidate Generation Scheme

   For each frequent size-k subgraph Fi , define
    primary subgraphs:    P(Fi) = {Hi,1 , Hi,2}
   Hi,1 , Hi,2 : two (k-1) subgraphs of Fi with
    smallest and second smallest canonical label
   FSG will join two frequent subgraphs Fi and Fj iff
                P(Fi) ∩ P(Fj) ≠ Φ

This approach (TKDE 2004) correctly generates all valid
  candidates and leads to significant performance
  improvement over the ICDM 2001 paper

                                                          17
FSG: Frequency Counting
   Naïve way
       Subgraph isomorphism check for each candidate against each graph
        transaction in database
       Computationally expensive and prohibitive for large datasets
   FSG uses transaction identifier (TID) lists
       For each frequent subgraph, keep a list of TID that support it
   To compute frequency of Gk+1
       Intersection of TID list of its subgraphs
       If size of intersection < min_support,
           prune Gk+1
       Else
           Subgraph isomorphism check only for graphs in the intersection
   Advantages
       FSG is able to prune candidates without subgraph isomorphism
       For large datasets, only those graphs which may potentially contain the
        candidate are checked


                                                                                  18
Canonical label of graph
   Lexicographically largest (or smallest) string obtained by
    concatenating upper triangular entries of adjacency
    matrix (after symmetric permutation)
   Uniquely identifies a graph and its isomorphs
       Two isomorphic graphs will get same canonical label




                                                              19
Use of canonical label

   FSG uses canonical labeling to
       Eliminate duplicate candidates
       Check if a particular pattern satisfies monotonicity.
   Naïve approach for finding out canonical
    label is O( |v| !)
       Impractical even for moderate size graphs




                                                            20
FSG: canonical labeling

   Vertex invariants
       Inherent properties of vertices that don’t change across
        isomorphic mappings
       E.g. degree or label of a vertex
   Use vertex invariants to partition vertices of a graph into
    equivalent classes
   If vertex invariants cause m partitions of V containing p1,
    p2, …, pm vertices respectively, then number of different
    permutations for canonical labeling
                   π (pi !)      ; i = 1, 2, …, m
which can be significantly smaller than |V| ! permutations

                                                                   21
FSG canonical label: vertex invariant
   Partition based on vertex degrees and labels

Example: number of permutations = 1 ! x 2! x 1! = 2
Instead of 4! = 24




                                                      22
Next steps

   What are possible applications that you can
    think of?
       Chemistry
       Biology
   We have only looked at “frequent subgraphs”
       What are other measures for similarity between two
        graphs?
       What graph properties do you think would be useful?
       Can we do better if we impose restrictions on
        subgraph?
           Frequent sub-trees
           Frequent sequences
           Frequent approximate sequences

                                                              23
References

   Jiawei Han. Graph mining: Part I Graph
    Pattern Mining.
   George Karypis. Mining Scientific Data Sets
    Using Graphs.
   Sangameshwar Patil. Introduction to Graph
    Mining.




                                                  24

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:5/17/2012
language:English
pages:24