Lecture Notes by dffhrtcv3

VIEWS: 32 PAGES: 290

									VLDB Database School (China) 2010
   August 3-7, 2010, Shenyang




     Lecture Notes
                Part 1


Mining and Searching Complex
         Structures



       Anthony K.H. Tung(邓锦浩)
          School of Computing
     National University of Singapore
      www.comp.nus.edu.sg/~atung
 Mining and Searching Complex Structures



                         Contents


Chapter 1: Introduction ------------------------------------------ 1

Chapter 2: High Dimensional Data ------------------------- 34

Chapter 3: Similarity Search on Sequences ------------ 110

Chapter 4: Similarity Search on Trees ------------------- 156

Chapter 5: Graph Similarity Search ---------------------- 175

Chapter 6: Massive Graph Mining ------------------------ 234
Mining and Searching Complex Structures                                                  Chapter 1 Introduction




                   Mining and Searching Complex
                            Structures
                                         Introduction
                              Anthony K. H. Tung(鄧锦浩)
                                  School of Computing
                             National University of Singapore
                              www.comp.nus.edu.sg/~atung




          Research Group Link: http://nusdm.comp.nus.edu.sg/index.html
          Social Network Link: http://www.renren.com/profile.do?id=313870900




            What is data mining?
            Really nothing different from what scientists had been doing for
                                                                Correct,
                            Generate                            useful
                            data                                model



                                       Collect data and verify or                Nobel
             Real World                construct model of real world
                                                                                 Prize
                                                             Output most likely model
                                                             based on some statistical
                                   Feed in data              measure
            What’s new?

                                                                              Systematically and
                                                                              efficiently test
                                                                              many statistical
                                                                              models




                                                    1
Mining and Searching Complex Structures                                      Chapter 1 Introduction




             Components of data mining
          Structure of model
             geneA=high and geneB=low ===> cancer
             geneA, geneB and geneC exhibit strong correlation
          Statistical Score for the model
             Accuracy of rule 1 is 90%
             Similarity function: Are they sufficiently similar group of records
             that support a certain model or hypothesis?
          Search method for the correct model parameters
             Given 200 genes, there could be 2^200 rules. Which rule give the
             best prediction power?
          Database access method
             Given 1 million records, how to quickly find relevant records to
             compute the accuracy of a rule?




                  The Apriori Algorithm

          • Bottom-up, breadth first                           a,b,c,e
            search
          • Only read is perform on
            the databases                          a,b,c a,b,e a,c,e b,c,e
          • Store candidates in
            memory to simulate the
            lattice search                   a,b    a,c       a,e    b,c     b,e   c,e
          • Iteratively follow the two
            steps:
            –generate candidates                          a     b        c   e
            –count and get actual
            frequent items
                                                    start           {}
                                               4




                                               2
Mining and Searching Complex Structures                                                                                                                                                                                      Chapter 1 Introduction




                   The K-Means Clustering Method

             • Given k, the k-means algorithm is implemented in 4
               steps:
                –Partition objects into k nonempty subsets
                –Compute seed points as the centroids of the clusters of the
                current partition. The centroid is the center (mean point) of the
                cluster.
                –Assign each object to the cluster with the nearest seed point.
                –Go back to Step 2, stop when no more new assignment.




                                                                                                                          5




                        The K-Means Clustering Method
                  • Example
                           10                                                                                                 10

                            9                                                                                                  9

                            8                                                                                                  8

                            7                                                                                                  7

                            6                                                                                                  6

                            5                                                                                                  5

                            4                                                                                                  4

                            3                                                                                                  3

                            2                                                                                                  2

                            1                                                                                                  1

                            0                                                                                                  0
                                0       1       2       3       4       5       6       7       8       9       10                 0       1       2       3       4       5       6       7       8       9       10




                            10                                                                                                 10

                                9                                                                                                  9

                                8                                                                                                  8

                                7                                                                                                  7

                                6                                                                                                  6

                                5                                                                                                  5

                                4                                                                                                  4

                                3                                                                                                  3

                                2                                                                                                  2

                                1                                                                                                  1

                                0                                                                                                  0
                                    0       1       2       3       4       5       6       7       8       9        10                0       1       2       3       4       5       6       7       8       9        10




                                                                                                                          6




                                                                                                                          3
Mining and Searching Complex Structures                                       Chapter 1 Introduction




             Training Dataset (Decision Tree)
                      Outlook     Temp          Humid    Wind        PlayTennis
                       Sunny         Hot         High    Weak           No
                       Sunny         Hot         High    Strong         No
                      Overcast       Hot         High    Weak           Yes
                        Rain         Mild        High    Weak           Yes
                        Rain         Cool       Normal   Weak           Yes
                        Rain         Cool       Normal   Strong         No
                      Overcast       Cool       Normal   Strong         Yes
                       Sunny         Mild        High    Weak           No
                       Sunny         Cool       Normal   Weak           Yes
                        Rain         Mild       Normal   Weak           Yes
                       Sunny         Mild       Normal   Strong         Yes
                      Overcast       Mild        High    Strong         Yes
                      Overcast       Hot        Normal   Weak           Yes
                        Rain         Mild        High
                                                   7
                                                         Strong         No




             Selecting the Next Attribute
                          S=[9+,5-]                               S=[9+,5-]
                          E=0.940                                 E=0.940
                          Humidity                                Wind


                      High       Normal                       Weak        Strong

                  [3+, 4-]           [6+, 1-]             [6+, 2-]            [3+, 3-]
                E=0.985           E=0.592                 E=0.811       E=1.0
               Gain(S,Humidity)                           Gain(S,Wind)
               =0.940-(7/14)*0.985                        =0.940-(8/14)*0.811
                – (7/14)*0.592                             – (6/14)*1.0
               =0.151                                     =0.048
                                                  8




                                                  4
Mining and Searching Complex Structures                                    Chapter 1 Introduction




              Selecting the Next Attribute
                                      S=[9+,5-]
                                      E=0.940
                                      Outlook

                                           Over
                              Sunny                    Rain
                                           cast

                        [2+, 3-]        [4+, 0]          [3+, 2-]
                         E=0.971       E=0.0             E=0.971
                            Gain(S,Outlook)
                            =0.940-(5/14)*0.971
                             -(4/14)*0.0 – (5/14)*0.0971
                            =0.247
                                                  9




            ID3 Algorithm
                       [D1,D2,…,D14]           Outlook
                         [9+,5-]

                                   Sunny      Overcast          Rain


          Ssunny=[D1,D2,D8,D9,D11]         [D3,D7,D12,D13] [D4,D5,D6,D10,D14]
                  [2+,3-]                    [4+,0-]         [3+,2-]
                        ?                         Yes                  ?
          Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
          Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
          Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

                                                  10




                                                  5
Mining and Searching Complex Structures                                                 Chapter 1 Introduction




             Decision Tree for PlayTennis

                                                 Outlook


                                  Sunny         Overcast           Rain


                    Humidity                          Yes                      Wind


              High           Normal                                 Strong            Weak

             No                    Yes                              No                     Yes
                                                       11




                  Can we fit what we learn into the framework?

                                         Apriori               K-means                   ID3
           task                  rule pattern discovery clustering               classification

          structure of the model association rules       clusters                decision tree
          or pattern
          search space           lattice of all possible choice of any k         all possible
                                 combination of items       points as center     combination of
                                 size= 2m                   size=infinity        decision tree
                                                                                 size= potentially
                                                                                 infinity
          score function         support, confidence        square error         accuracy,
                                                                                 information gain
          search /optimization   breadth first with         gradient descent     greedy
          method                 pruning

          data management        TBD                        TBD                  TBD
                                                       12
          technique




                                                       6
Mining and Searching Complex Structures                                  Chapter 1 Introduction




             Components of data mining(II)



                                     Models Enumeration
                                          Algorithm


                                  Statistical Score Function
                                  Similarity/Search Function
                                   Database Access Method



                                          Database




             Background knowledge
              • We assume you have some basic knowledge about data
                 mining, some of the slides here will be very useful for this
                 purpose
              • Association Rule Mining
              http://www.comp.nus.edu.sg/~atung/Renmin56.pdf
              • Classification and Regression
              http://www.comp.nus.edu.sg/~atung/Renmin67.pdf
              • Clustering
              http://www.comp.nus.edu.sg/~atung/Renmin78.pdf




                                               7
Mining and Searching Complex Structures                                                                Chapter 1 Introduction




            IT Trend
          Processors are cheap and will become cheaper(multi-core processor,
            graphic cards)
          Storage will be cheap but might not be fast
          Bandwidth will be growing
          What can we do with this?
             Play more realistic games!
                    Not exactly a joke since any technologies that speed up games can speed up scientific
                      simulation
             Smarter (more intensive) computation
             Can store more personal semantic/ontology
             People can collaborate more over the Internet (Flickr, Wikipedia) to make
               things more intelligent
          The AI dream now have the support of much better hardwares
          Essentially, data mining can be made much more simple for the man
            on the street
          Data mining should be human-centered, not machine centered
                  2010-7-31
                                                              15




          What is complex data?
            What is “simple” data? data?
                What are complex
                           tabular table, with small number of attributes (of the same type), no
            Test1 Regular Progress
                   Gene1                                                                   comments

                          values.
            Pos missing Fever
                   2.0                                                                     ……

            Neg        -0.3   Unconscious

            N.A        5.7

            High dimensional data: Lots of attributes with different data types with missing values




             Sequences/ time series                         Trees                             Graphs




                                                              8
Mining and Searching Complex Structures                                                                                                  Chapter 1 Introduction




              Why complex data?
            They come naturally in many applications. Bring research nearer to real
            world
            Lots of challenges which mean more fun!
            Some fundamental challenges:
                 How do you compare complex objects effectively and efficiently?
                 How do you find special subset in the data that is interesting?
             Test1 Gene1 Progress                                                                                                               comments
             Pos What new type of models and score function must you used?
                   2.0   Fever                                                                                                                  ……
             Neg
                 How do you handle noise and error ?
                   -0.3  Unconscious
             N.A       5.7



                                                               a                                       a

                                                      d        b       e                     b         c        d

                                               c          d                          c       d              b
                                                               c        d
                                                                                                    e
                                                               T1                                   T2




          Personalized Semantic for Personal Data Management
            everyone will own terabytes of data soon
            improve query/search interface by mining and extracting personalized semantics like
            entities and their relationship etc. by comparing them against high quality tagged databases




                                                           Query by            Query by                                    Query by
                                                           documents           audio/music         Query by video
                                                                                                                           photographs/images
           Wikipedia

                                                                                         singers
                                            authors
                             High Quality
                                Data                                                                                               semantic
                                                               actor/actress                                  songs
                               Sources
                                                      papers
                                                                                                                                   layer
                                                                                                                                places
                                                                                         movies

                                                                                     Personal Data




                                             documents              audio            video                 photographs/i       Webpage/Blogs/Bookmarks
                                                                    music                                  mages




                                                                                 9
Mining and Searching Complex Structures                                                                                     Chapter 1 Introduction




          Integrated Approach to Mining Software Engineering Data
            software engineering data: code base, change history, bug reports, runtime trace
            integrated into a data warehouse to support decision making and mining,
            Example: Which code module should I modify to create a new function? Which
            module need maintenance?


                        programming      defect detection        testing         debugging       maintenance       …

                                            software engineering tasks helped by data mining



                                                                 association/
                                         classification                             clustering     …
                                                                   patterns



                                                               Data Warehouse




                                 code          change             program           structural          bug
                                 bases         history             states            entities        reports/nl      …

                                                            software engineering data




          WikiScience
            Web 2.0: Facebook for scientists
            Collaborative platform for scientist to build scientific models/hypothesis and share
            data, applications

                      Based on some
                  articles, I make some
                  changes to Model A                                                                              supporting
                    to create Model B                                                                             articles tagged to
                                                                                                                  Model B


                                                                 Centralized,
                                                                  Centralized,
                           Model A                               Hybrid Model
                                                                  Hybrid Model                                 Model B
                           Model A                                                                             Model B
                                                                 C Constructed
                                                                  C Constructed
                                                                 by System
                                                                  by System
             supporting
             dataset tagged to
             Model A
                                                                                          This is my model of
                                                                                         the solar system base
                                                                                           on my supporting
                                                                                                 dataset




                                                                           10
Mining and Searching Complex Structures                                  Chapter 1 Introduction




          Hey, why not Cloud Computing, Map/Reduce?
              • These are platform for scaling up services to large
                number of users on large amount of data
              • But what exactly do you want to scale up?
              • Services that provide useful and semantically
                correct information to the users
              • We have too many scalable data mining
                algorithms that find nothing or too many things
              • Let’s focus on finding useful things first
                (assuming we have lot’s of processing power) and
                then try to scale it up




             Schedule of the Course
             Date/Time   Content
             Lesson 1    Introduction
             Lesson 2    Mining and Search High Dimensional Data I
             Lesson 3    Mining and Search High Dimensional Data II
             Lesson 4    Mining and Search High Dimensional Data III
             Lesson 5    Similarity Search for Sequences and Trees I
             Lesson 6    Similarity Search for Sequences and Trees III
             Lesson 7    Similarity Search for Graph I
             Lesson 8    Similarity Search for Graph II
             Lesson 9    Similarity Search for Graph III
             Lesson 10   Mining Massive Graph I
             Lesson 11   Mining Massive Graph II
             Lesson 12   Mining Massive Graph III




                                                    11
Mining and Searching Complex Structures                               Chapter 1 Introduction




              Focus of the course
              • Techniques that can handle high dimensional, complex
                structures
                –Providing semantics to similarity search
                –Shotgun and Assembly: Column/Feature Wise Processing using
                Inverted Index
                –Row-wise Enumeration
                –Using local properties to infer global properties
              • Throughout the course, please try to think of how these
                techniques are applicable across different type of complex
                structures




             Databases Queries

                 To start off, we will consider something very basic call
                 ranking queries since we need ranking any similarity search
                 (usually from most similar to most dissimilar)
                 In relational database, SQL returns all results at one go
                 How many tuples can be fitted in one screen?
                 How many tuples can you remember?
                 Options:
                   Summarize the results
                   Display representative tuples
                 How to select representative tuples?




                                                   12
Mining and Searching Complex Structures                                             Chapter 1 Introduction




             Retrieve Relevant Information

                 Search videos related to Shanghai Expo
                 Too many results: as long as you click “next”, there are 20
                 more new results
                 Are we interested in all results?
                 No, only most relevant ones
                 Search engines have to rank the results, out of which they
                 make money from




             Question: How to Select a Small Result Set

                 Selecting the most representative or most interesting results
                 is not trivial
                 Find an apartment with rental cheaper than 1000, the
                 cheaper the better
                   The result tuples can be sorted in the ascending order of rental prices,
                   those in front are more favorable
                 Find an apartment with rental cheaper than 1000 near NEU,
                 the lower the better, the nearer the better
                   Apartment with lower rent may not be near, nearer one may not be
                   cheap
                   Order by prices? Order by distances?




                                                   13
Mining and Searching Complex Structures                                      Chapter 1 Introduction




             Top-k Queries
                 Define a scoring function, which maps a tuple to a real
                 number, as a score
                 The higher the score is, the more favorable the tuple is
                 Define an integer k
                 Answer: k objects with highest scores
                 Different scoring function may give different top-k result
                                            Price        Distance to NEU
                       Apartment A          $800            500 meter
                       Apartment B         $1200            200 meters

                 Given k = 1, if the score function is defined as the sum of
                 price and distance, the first tuple is better; if it is defined as
                 the product, the second tuple is better




             Brute Force Top-k

                 Compute scores for each result tuple
                 Sort the tuples according to the descending order of the
                 scores
                 Select the first k tuples
                 What if the number of tuples is unlimited? Search engines
                 can give unlimited number of results
                 Even if the number of tuples is limited, it is too slow to
                 compute score for each tuple
                 We have to do it efficiently




                                                14
Mining and Searching Complex Structures                                     Chapter 1 Introduction




             Outline

                  Two well-known top-k algorithms
                    Fagin's Algorithm (FA)
                    The Threshold Algorithm (TA)
                  Take random access into consideration
                    No Random Access Algorithm (NRA)
                    The Combined Algorithm (CA)




             Monotonicity

                  A score function f is monotone if f(x1,x2,...,xm)≤f(y1,y2,...,ym)
                  whenever xi≤yi for every i
                  Select top-3 students with highest total score in mathematics,
                  physics and computer science:
              •

                     select name, math+phys+comp as score
                     from student
                     order by score desc limit 3

                  sum(x.math,x.phys,x.comp)≤sum(y.math,y.phys,y.comp) if
                  x.math≤y.math and x.phys≤y.phys and x.comp≤y.comp




                                                15
Mining and Searching Complex Structures                                     Chapter 1 Introduction




             Sorted Lists

                 We shall think of a database consisting of m sorted lists L1,
                 L2, … Lm

                        Lmath                    Lphys           Lcomp
                     Ann    98            Hugh         97     Kurt    96
                     Ben    96            Ryan         94     Ann     95
                     Kurt   93            Ann          92     Jane    95
                    Hugh    91            Kurt         91     Ben     93
                     Carl   90            Jane         89     Hugh    92
                      ...       ...        ...         ...     ...    ...
                      ...       ...        ...         ...     ...    ...
                      ...       ...        ...         ...     ...    ...




             Outline

                 Two well-known top-k algorithms
                   Fagin's Algorithm (FA)
                   The Threshold Algorithm (TA)
                 Take random access into consideration
                   No Random Access Algorithm (NRA)
                   The Combined Algorithm (CA)




                                                  16
Mining and Searching Complex Structures                                    Chapter 1 Introduction




             Fagin's Algorithm (I)

                 Do sequential access until there are at least k matches

                     Ann    98            Hugh        97     Kurt    96
                     Ben    96            Ryan        94     Ann     95
                     Kurt   93            Ann         92     Jane    95
                    Hugh    91            Kurt        91     Ben     93
                     Carl   90            Jane        89     Hugh    92
                      ...    ...           ...        ...     ...    ...
                      ...    ...           ...        ...     ...    ...
                      ...    ...           ...        ...     ...    ...


                 Sequential accesses are stopped when 3 students are seen, i.e.
                 Ann, Hugh and Kurt




             Fagin's Algorithm (II)

                 For each object that has been seen, do random accesses on
                 other lists to compute its score

                     Ann    98            Hugh        97     Kurt    96
                     Ben    96            Ryan        94     Ann     95
                     Kurt   93            Ann         92     Jane    95
                    Hugh    91            Kurt        91     Ben     93
                     Carl   90            Jane        89     Hugh    92
                      ...    ...           ...        ...     ...    ...
                      ...    ...           ...        ...     ...    ...
                      ...    ...           ...        ...     ...    ...


                Random accesses need to be done for Ben, Carl, Jane and
                Ryan




                                                 17
Mining and Searching Complex Structures                                      Chapter 1 Introduction




             Fagin's Algorithm (III)

                 Select the k objects with highest score as top-k result


                     Ann     98           Hugh        97      Kurt    96
                     Ben     96           Ryan        94      Ann     95
                     Kurt    93           Ann         92      Jane    95
                    Hugh     91           Kurt        91      Ben     93
                     Carl    90           Jane        89      Hugh    92
                      ...    ...           ...        ...      ...    ...
                      ...    ...           ...        ...      ...    ...
                      ...    ...           ...        ...      ...    ...




             Why is FA correct? (I)

                 There are at least k objects seen on all attributes when
                 sequential access is stopped
                 By monotonicity, those objects that are not seen do not have
                 higher score than the above k objects

                     Ann     98           Hugh        97      Kurt    96
                     Ben     96           Ryan        94      Ann     95
                     Kurt    93           Ann         92      Jane    95
                    Hugh     91           Kurt        91      Ben     93
                     Carl    90           Jane        89      Hugh    92
                      ...    ...           ...        ...      ...     ...
                      ...    ...           ...        ...      ...     ...
                      ...    ...           ...        ...      ...     ...




                                                 18
Mining and Searching Complex Structures                                      Chapter 1 Introduction




             Why is FA correct? (II)

                 For those that have been seen, it is either all attributes has
                 been seen, or random accesses are performed to know all
                 attributes
                 The k objects with highest scores are therefore the top-k
                 result

                     Ann     98           Hugh        97       Kurt    96
                     Ben     96           Ryan        94       Ann     95
                     Kurt    93           Ann         92       Jane    95
                    Hugh     91           Kurt        91       Ben     93
                     Carl    90           Jane        89       Hugh    92
                      ...    ...           ...        ...       ...    ...
                      ...    ...           ...        ...       ...    ...
                      ...    ...           ...        ...       ...    ...




             Outline

                 Two well-known top-k algorithms
                   Fagin's Algorithm (FA)
                   The Threshold Algorithm (TA)
                 Take random access into consideration
                   No Random Access Algorithm (NRA)
                   The Combined Algorithm (CA)




                                                 19
Mining and Searching Complex Structures                                        Chapter 1 Introduction




             The Threshold Algorithm (I)

                 Do sequential access on all lists. If an object is seen, do
                 random access to the other lists to compute its score

                     Ann     98           Hugh         97       Kurt    96
                     Ben     96           Ryan         94       Ann     95
                     Kurt    93            Ann         92       Jane    95
                    Hugh     91            Kurt        91       Ben     93
                     Carl    90           Jane         89      Hugh     92
                      ...    ...            ...        ...       ...    ...
                      ...    ...            ...        ...       ...    ...
                      ...    ...            ...        ...       ...    ...


                 Random accesses on Ann, Hugh and Kurt first, then on Ben
                 and Ryan




             The Threshold Algorithm (II)

                 Remember the k objects with highest scores, together with
                 their scores
                     Ann     98           Hugh         97       Kurt    96
                     Ben     96           Ryan         94       Ann     95
                     Kurt    93            Ann         92       Jane    95
                    Hugh     91            Kurt        91       Ben     93
                     Carl    90           Jane         89      Hugh     92
                      ...    ...            ...        ...       ...    ...
                      ...    ...            ...        ...       ...    ...
                      ...    ...            ...        ...       ...    ...

                 Score (Ann) = 285
                 Score (Hugh) = 280
                 Score (Kurt) = 280




                                                  20
Mining and Searching Complex Structures                                 Chapter 1 Introduction




             The Threshold Algorithm (III)

             • Let threshold value τ be the function value on last seen values
               on all sorted lists
             • As soon as at least k objects with score at least τ, then halt


                   Ann     98             Hugh    97      Kurt    96
                                                                         τ(1) = 291
                   Ben     96             Ryan    94      Ann     95     τ(2) = 285
                   Kurt    93             Ann     92      Jane    95     τ(3) = 280
                   Hugh    91             Kurt    91      Ben     93
                   Carl    90             Jane    89      Hugh    92
                    ...    ...             ...    ...      ...    ...
                    ...    ...             ...    ...      ...    ...
                    ...    ...             ...    ...      ...    ...




             Why is TA correct?
             • By monotonicity, those unseen objects do not have higher
               score than τ
             • For those that have been seen, random accesses are
               performed, the k objects with highest scores are therefore the
               top-k result

                   Ann     98             Hugh    97      Kurt    96
                                                                         τ(1) = 291
                   Ben     96             Ryan    94      Ann     95     τ(2) = 285
                   Kurt    93             Ann     92      Jane    95     τ(3) = 280
                   Hugh    91             Kurt    91      Ben     93
                   Carl    90             Jane    89      Hugh    92
                    ...    ...             ...    ...      ...    ...
                    ...    ...             ...    ...      ...    ...
                    ...    ...             ...    ...      ...    ...




                                                 21
Mining and Searching Complex Structures                                   Chapter 1 Introduction




             Comparing TA with FA

             • Number of sequential accesses
                   At the time FA stops sequential accesses, τ is guaranteed not
                   higher than the k objects seen on all sorted lists
             • Number of random accesses
                   TA requires m-1 random accesses for each object
                   But FA is expected to random access more objects
             • Size of buffers used
                   Buffer used by FA can be unbounded
                   TA only needs to remember k objects with k scores, and the
                   threshold value τ




             Outline

                 Two well-known top-k algorithms
                   Fagin's Algorithm (FA)
                   The Threshold Algorithm (TA)
                 Take random access into consideration
                   No Random Access Algorithm (NRA)
                   The Combined Algorithm (CA)




                                              22
Mining and Searching Complex Structures                                      Chapter 1 Introduction




             Random Access

                 Random accesses are impossible
                    Text retrieval: sorted lists are results of search engines
                 Random accesses are expensive
                    Sequential accesses on disk are orders of magnitude faster
                    than random accesses
                 We need to consider not using random accesses or using
                 them as few as possible




             No Random Access
                Without random access, all we know are the upper bounds

                        Lmath                    Lphys            Lcomp
                     Ann    98            Hugh         97      Kurt   96
                     Ben    96            Ryan         94      Ann    95
                     Kurt   93            Ann          92     Jane    95
                    Hugh    91            Kurt         91      Ben    93
                     Carl   90            Jane         89     Hugh    92
                      ...       ...        ...         ...      ...    ...
                      ...       ...        ...         ...      ...    ...
                      ...       ...        ...         ...      ...    ...


                Carl’s scores on physics and computer science are not higher
                than 89 and 92 respectively




                                                  23
Mining and Searching Complex Structures                                   Chapter 1 Introduction




             Lower and Upper Bounds
                If an object has not been seen on one attribute
                    Lower bound is 0
                    Upper bound is the last seen value
                     Ann    98             Hugh        97    Kurt   96
                     Ben    96             Ryan        94    Ann    95
                     Kurt   93             Ann         92    Jane   95
                    Hugh    91             Kurt        91    Ben    93
                     Carl   90             Jane        89    Hugh   92
                      ...    ...            ...        ...    ...   ...
                      ...    ...            ...        ...    ...   ...
                      ...    ...            ...        ...    ...   ...


                 The lower bound of Carl’s score on physics is 0
                 The upper bound of Carl’s score on physics is 89




             Worse and Best Scores (I)
                W (R): The worst possible score of tuple R
                B (R): The best possible score of tuple R

                     Ann    98             Hugh        97    Kurt   96
                     Ben    96             Ryan        94    Ann    95
                     Kurt   93             Ann         92    Jane   95
                    Hugh    91             Kurt        91    Ben    93
                     Carl   90             Jane        89    Hugh   92
                      ...    ...            ...        ...    ...   ...
                      ...    ...            ...        ...    ...   ...
                      ...    ...            ...        ...    ...   ...


                 W (Carl) = 90
                 B (Carl) = 90 + 89 + 92




                                                  24
Mining and Searching Complex Structures                                                         Chapter 1 Introduction




             Worse and Best Scores (II)
                W (R) ≤ Score of R ≤ B (R)
                W (R) and B (R) get updated as its value gets sequential
                accessed
                     Ann        98            Hugh         97            Kurt            96
                     Ben        96            Ryan         94            Ann             95
                     Kurt       93             Ann         92           Jane             95
                    Hugh        91             Kurt        91            Ben             93
                     Carl       90            Jane         89           Hugh             92
                      ...        ...            ...        ...              ...           ...
                      ...        ...            ...        ...              ...           ...
                      ...        ...            ...        ...              ...           ...

                                       Ann                   Hugh                 Kurt
                            W           98                       97                96
                            B           291                      291              291




             Worse and Best Scores (II)
                W (R) ≤ Score of R ≤ B (R)
                W (R) and B (R) get updated as its value gets sequential
                accessed
                     Ann        98            Hugh         97            Kurt            96
                     Ben        96            Ryan         94            Ann             95
                     Kurt       93             Ann         92           Jane             95
                    Hugh        91             Kurt        91            Ben             93
                     Carl       90            Jane         89           Hugh             92
                      ...        ...            ...        ...              ...           ...
                      ...        ...            ...        ...              ...           ...
                      ...        ...            ...        ...              ...           ...

                                Ann     Hugh               Kurt        Ben          Ryan
                     W      98→193        97                 96        96               94
                     B      291→287    291→288         291→286         285              285




                                                      25
Mining and Searching Complex Structures                                              Chapter 1 Introduction




             Worse and Best Scores (II)
                W (R) ≤ Score of R ≤ B (R)
                W (R) and B (R) get updated as its value gets sequential
                accessed
                     Ann          98           Hugh        97          Kurt    96
                     Ben          96           Ryan        94          Ann     95
                     Kurt         93           Ann         92          Jane    95
                    Hugh          91           Kurt        91          Ben     93
                     Carl         90           Jane        89          Hugh    92
                      ...         ...           ...        ...          ...    ...
                      ...         ...           ...        ...          ...    ...
                      ...         ...           ...        ...          ...    ...

                            Ann         Hugh     Kurt            Ben   Ryan   Jane
                    W 193→285            97    96→189            96     94     95
                    B 287→285 288→285 286→281 285→283 285→282                 280




             Outline

                 Two well-known top-k algorithms
                   Fagin's Algorithm (FA)
                   The Threshold Algorithm (TA)
                 Take random access into consideration
                   No Random Access Algorithm (NRA)
                   The Combined Algorithm (CA)




                                                      26
Mining and Searching Complex Structures                                              Chapter 1 Introduction




             No Random Access Algorithm (I)
                Maintain the last-seen values x1,x2,…,xm
                For every seen object, maintain its worst possible score, its
                known attributes and their values

                     Ann          98           Hugh        97          Kurt    96
                     Ben          96           Ryan        94          Ann     95
                     Kurt         93           Ann         92          Jane    95
                    Hugh          91           Kurt        91          Ben     93
                     Carl         90           Jane        89          Hugh    92
                      ...         ...           ...        ...          ...    ...
                      ...         ...           ...        ...          ...    ...
                      ...         ...           ...        ...          ...    ...

                 xmath = 96; xphys = 94; xcomp = 95
                 Ann:193:{<Math:98>;<Comp:95>}




             No Random Access Algorithm (II)
                Why not maintain the best possible score for each objects

                     Ann          98           Hugh        97          Kurt    96
                     Ben          96           Ryan        94          Ann     95
                     Kurt         93           Ann         92          Jane    95
                    Hugh          91           Kurt        91          Ben     93
                     Carl         90           Jane        89          Hugh    92
                      ...         ...           ...        ...          ...    ...
                      ...         ...           ...        ...          ...    ...
                      ...         ...           ...        ...          ...    ...

                            Ann         Hugh     Kurt            Ben   Ryan   Jane
                    W 193→285            97    96→189            96     94     95
                    B 287→285 288→285 286→281 285→283 285→282                 280

                            Too Frequently Updated!



                                                      27
Mining and Searching Complex Structures                                                           Chapter 1 Introduction




             No Random Access Algorithm (III)
                Let M be the kth largest W value
                An object R is viable if B (R) ≥ M
                     Ann             98              Hugh        97           Kurt      96
                     Ben             96              Ryan        94           Ann       95
                     Kurt            93              Ann         92          Jane       95
                    Hugh             91              Kurt        91           Ben       93
                     Carl            90              Jane        89          Hugh       92
                        ...          ...              ...        ...           ...          ...
                        ...          ...              ...        ...           ...          ...
                        ...          ...              ...        ...           ...          ...

                          Ann         Hugh         Kurt          Ben   Ryan           Jane
                    W 285            97→188       189→280    96→189     94             95          M = 189
                    B         285    285→280 281→280 283→280 282→278 280→277




             No Random Access Algorithm (III)
                Let M be the kth largest W value
                An object R is viable if B (R) ≥ M

                     Ann             98              Hugh        97           Kurt      96
                     Ben             96              Ryan        94           Ann       95
                     Kurt            93              Ann         92          Jane       95
                    Hugh             91              Kurt        91           Ben       93
                     Carl            90              Jane        89          Hugh       92
                        ...          ...              ...        ...           ...          ...
                        ...          ...              ...        ...           ...          ...
                        ...          ...              ...        ...           ...          ...

                              Ann          Hugh     Kurt         Ben   Ryan           Jane
                    W          285     188→280       280         188    94           95→184        M = 280
                    B          285     285→280       280    280→278 278→276 277→274




                                                            28
Mining and Searching Complex Structures                                      Chapter 1 Introduction




             No Random Access Algorithm (IV)
                Let set T contain objects with W (R) ≥ M
                Halt when
                  There are at least k objects seen on all sorted lists
                  No viable objects left outside set T


                        Ann     Hugh      Kurt        Ben   Ryan      Jane
                    W   285   188→280     280         188    94     95→184    M = 280
                    B   285   285→280     280    280→278 278→276 277→274


                 T = {Ann, Hugh, Kurt}




             Why is NRA correct?

                W (R) ≤ Score of R ≤ B (R) always holds
                If an object R is not viable, Score of R ≤ B (R) ≤ M, then
                there are at least k objects with scores not lower than R
                Therefore, if there is no viable object outside T and T
                contains at least k objects, T is the set of top-k result




                                                 29
Mining and Searching Complex Structures                                     Chapter 1 Introduction




             Comparing NRA with TA

             • Number of sequential accesses
                   The number of sequential accesses of NRA is at least the last
                   position of top-k result on all attributes
             • Number of random accesses
                   NRA is obviously 0
             • Size of buffers used
                   TA remembers k objects with k scores, and the threshold
                   value τ
                   NRA remembers all viable objects with its scores on all seen
                   attributes, and the last-seen value on all attributes




             How deep can NRA go?
                     Ann    98            Hugh        97      Kurt    96
                    Hugh    97            Kurt        96      Ann     95
                     Ben    60            Ryan        60      Jane    60
                    Ryan    60            Ben         60      Ben     60
                     Carl   60            Jane        60      Carl    60
                      ...    ...           ...        ...      ...    ...
                     Jane   60            Carl        60     Ryan     60
                     Kurt    0            Ann         0      Hugh     0


                 The set T can be identified quickly, but their scores will only
                 be certain at the end of lists
                 If we allow relatively fewer number of random accesses,
                 scanning the entire lists can be avoided




                                                 30
Mining and Searching Complex Structures                             Chapter 1 Introduction




             Outline

                 Two well-known top-k algorithms
                   Fagin's Algorithm (FA)
                   The Threshold Algorithm (TA)
                 Take random access into consideration
                   No Random Access Algorithm (NRA)
                   The Combined Algorithm (CA)




             The Combined Algorithm (I)

                 CA combines TA and NRA
                 cR: the cost of a random access
                 cS: the cost of a sequential access
                 h=
                 Run NRA, but every h steps to run random accesses, like TA
                 h = ∞ → never do random access, CA is then NRA




                                           31
Mining and Searching Complex Structures                                   Chapter 1 Introduction




             The Combined Algorithm (II)
                     Ann    98            Hugh        97     Kurt   96
                    Hugh    97            Kurt        96     Ann    95
                     Ben    60            Ryan        60    Jane    60
                    Ryan    60            Ben         60     Ben    60
                     Carl   60            Jane        60     Carl   60
                      ...    ...           ...        ...     ...   ...
                     Jane   60            Carl        60    Ryan    60
                     Kurt    0            Ann         0     Hugh     0



                 Random accesses for Ann, Hugh and Kurt quickly find out
                 the scores for Ann, Hugh and Kurt




             The Combined Algorithm (III)

                 In CA, by doing random accesses, we wish to either
                    Confirm an object is a top-k result, or
                    Prune a viable object
                 As the number of random accesses in CA is limited, various
                 heuristics can be made to optimize CA in terms of total cost




                                                 32
Mining and Searching Complex Structures                             Chapter 1 Introduction




             Reference
              • Ronald Fagin, Amnon Lotem, Moni Naor: Optimal
                aggregation algorithms for middleware. J. Comput. Syst.
                Sci. 66(4): 614-656 (2003)




                                           33
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




               Mining and Searching Complex
                        Structures
                          High Dimensional Data
                            Anthony K. H. Tung(鄧锦浩)
                                School of Computing
                           National University of Singapore
                            www.comp.nus.edu.sg/~atung




      Research Group Link: http://nusdm.comp.nus.edu.sg/index.html
      Social Network Link: http://www.renren.com/profile.do?id=313870900




         Outline

        • Sources of HDD
        • Challenges of HDD
        • Searching and Mining Mixed Typed Data
          –Similarity Function on k-n-match
          –ItCompress
        • Bregman Divergence: Towards Similarity Search on Non-metric
          Distance
        • Earth Mover Distance: Similarity Search on Probabilistic Data
        • Finding Patterns in High Dimensional Data




                               Mining and Searching Complex Structures




                                               34
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Sources of High Dimensional Data

            •   Microarray gene expression
            •   Text documents
            •   Images
            •   Features of Sequences, Trees and Graphs
            •   Audio, Video, Human Motion Database (spatio-
                temporal as well!)




                               Mining and Searching Complex Structures




          Challenges of High Dimensional Data
           • Indistinguishable
             –Distance between two nearest points and two furthest points
             could be almost the same
           • Sparsity
             –As a result of the above, data distribution are very sparse
             giving no obvious indication on where the interesting
             knowledge is
           • Large number of combination
             –Efficiency: How to test the number of combinations
             –Effectiveness: How do we understand and interpret so many
             combinations?



                               Mining and Searching Complex Structures




                                               35
Mining and Searching Complex                                                      Chapter 2 Structures High Dimensional Data




                Outline

        • Sources of HDD
        • Challenges of HDD
        • Searching and Mining Mixed Typed Data
          –Similarity Function on k-n-match
          –ItCompress
        • Bregman Divergence: Towards Similarity Search on Non-metric
          Distance
        • Earth Mover Distance: Similarity Search on Probabilistic Data
        • Finding Patterns in High Dimensional Data




                                                Mining and Searching Complex Structures




                  Similarity Search : Traditional Approach

            •   Objects represented by multidimensional vectors




                Elevation         Aspect       Slope   Hillshade (9am)       Hillshade (noon)         Hillshade (3pm)    …

                  2596             51           3            221                    232                    148
                   …


            •   The traditional approach to similarity search: kNN query
                       Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
                 ID         d1          d2       d3    d4       d5          d6       d7         d8         d9      d10   Dist

                 P1         1.1            1    1.2    1.6     1.1          1.6      1.2        1.2        1        1    0.93

                 P2         1.4         1.4     1.4    1.5     1.4          1        1.2        1.2        1        1    0.98
                 P3         1              1     1      1          1        1        2          1          2        2    1.73
                 P4         20          20       21    20       22          20       20         19         20      20    57.7

                 P5         19          21       20    20       20          21       18         20         22      20    60.5

                 P6         21          21       18    19       20          19       21         20         20      20    59.8

                                                Mining and Searching Complex Structures




                                                                       36
Mining and Searching Complex                                             Chapter 2 Structures High Dimensional Data




             Deficiencies of the Traditional Approach

             •     Deficiencies
                   –Distance is affected by a few dimensions with high dissimilarity
                   –Partial similarities can not be discovered

             •     The traditional approach to similarity search: kNN query
                            Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)


             ID        d1         d2       d3       d4      d5     d6    d7     d8        d9   d10   Dist

              P1      1.1         1
                                 100      1.2       1.6     1.1    1.6   1.2   1.2        1     1    0.93
                                                                                                     99.0

              P2      1.4        1.4      1.4       1.5     1.4    1
                                                                  100    1.2   1.2        1     1    99.0
                                                                                                     0.98
              P3       1          1        1        1        1     1     2      1
                                                                               100        2     2    1.73
                                                                                                     99.0

              P4       20         20       21       20      22     20    20     19        20   20    57.7

              P5       19         21       20       20      20     21    18     20        22   20    60.5

              P6       21         21       18       19      20     19    21     20        20   20    59.8




                                                Mining and Searching Complex Structures




            Thoughts

        • Aggregating too many dimensional differences into a single value
          result in too much information loss. Can we try to reduce that loss?
        • While high dimensional data typically give us problem when in
          come to similarity search, can we turn what is against us into
          advantage?
        • Our approach: Since we have so many dimensions, we can
          compute more complex statistics over these dimensions to
          overcome some of the “noise” introduce due to scaling of
          dimensions, outliers etc.




                                                Mining and Searching Complex Structures




                                                                  37
Mining and Searching Complex                                                     Chapter 2 Structures High Dimensional Data




                        The N-Match Query : Warm-Up

               •    Description
                    –Matches between two objects in n dimensions. (n ≤ d)
                    –The n dimensions are chosen dynamically to make the two objects match best.

               •    How to define a “match”
                    –Exact match
                    –Match with tolerance δ

               •    The similarity search example
                            Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
                                                                           n=6

                      ID        d1       d2       d3       d4     d5       d6       d7    d8    d9           d10              Dist

                      P1        1.1      1
                                        100       1.2      1.6    1.1      1.6     1.2    1.2   1                1            0.2

                      P2        1.4      1.4      1.4      1.5    1.4       1
                                                                           100     1.2    1.2   1                1            0.4
                                                                                                                              0.98
                      P3         1        1        1        1     1         1         2    1
                                                                                          100   2                2            1.73
                                                                                                                               0

                      P4        20       20       21       20     22       20       20    19    20           20               19

                      P5        19       21       20       20     20       21       18    20    22           20               19

                      P6        21       21       18       19     20       19       21    20    20           20               19

                                                 Mining and Searching Complex Structures




                   The N-Match Query : The Definition

           •       The n-match difference
                   Given two d-dimensional points P(p1, p2, …, pd) and Q(q1, q2, …, qd), let δi
                   = |pi - qi|, i=1,…,d. Sort the array {δ1 , …, δd} in increasing order and let
                   the sorted array be {δ1’, …, δd’}. Then δn’ is the n-match difference
                                                                                       y
                   between P and Q.
                                                                                                                         1-match=A
                                                                                                10           E
           •       The n-match query
                                                                                                 8           D           2-match=B
                   Given a d-dimensional database DB, a query point Q and an
                   integer n (n≤d), find the point P ∈ DB that has the smallest                  6
                   n-match difference to Q. P is called the n-match of Q.                       4    A
                                                                                                                     B
                                                                                                2                                   C
           •       The similarity search example
                           Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)                    7
                                                                                  8
                                                                                n=6             Q        2       4        6    8        10   x

                      ID        d1       d2       d3       d4     d5       d6       d7    d8    d9           d10              Dist

                      P1        1.1      1
                                        100       1.2      1.6    1.1      1.6     1.2    1.2   1                1            0.2
                                                                                                                              0.6

                      P2        1.4      1.4      1.4      1.5    1.4       1
                                                                           100     1.2    1.2   1                1            0.4
                                                                                                                              0.98
                      P3         1        1        1        1     1         1         2    1
                                                                                          100   2                2            1.73
                                                                                                                               0
                                                                                                                               1

                      P4        20       20       21       20     22       20       20    19    20           20               19

                      P5        19       21       20       20     20       21       18    20    22           20               19

                      P6        21       21       18       19     20       19       21    20    20           20               19

                                                 Mining and Searching Complex Structures




                                                                      38
Mining and Searching Complex                                                        Chapter 2 Structures High Dimensional Data




          The N-Match Query : Extensions
         •       The k-n-match query
                 Given a d-dimensional database DB, a query point Q, an integer k, and an
                 integer n, find a set S which consists of k points from DB so that for any
                 point P1 ∈ S and any point P2∈ DB-S, P1’s n-match difference is smaller
                 than P2’s n-match difference. S is called the k-n-match of Q.         y
         •       The frequent k-n-match query                                                                                 2-1-match={A,D}
                                                                                                         10              E
                 Given a d-dimensional database DB, a query point Q, an integer
                 k, and an integer range [n0, n1] within [1,d], let S0, …, Si be                             8           D    2-2-match={A,B}
                 the answer sets of k-n0-match, …, k-n1-match, respectively,                                 6
                 find a set T of k points, so that for any point P1 ∈ T and any point                        4   A
                 P2 ∈ DB-T, P1’s number of appearances in S0, …, Si is larger                                                 B                  C
                                                                                                             2
                 than or equal to P2’s number of appearances in S0, …, Si .
         •       The similarity search example                                                           Q           2        4     6        8       10   x
                      Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)                            n=6
                     ID        d1       d2       d3        d4       d5        d6           d7    d8     d9           d10                Dist

                      P1       1.1      1
                                       100       1.2      1.6       1.1       1.6          1.2   1.2     1               1              0.2

                      P2       1.4      1.4      1.4      1.5       1.4        1
                                                                              100          1.2   1.2     1               1              0.4
                                                                                                                                        0.98
                      P3          1      1        1         1        1         1           2      1
                                                                                                 100     2               2              1.73
                                                                                                                                         0

                      P4       20       20       21        20       22        20           20    19     20               20             19

                      P5       19       21       20        20       20        21           18    20     22               20             19

                      P6       21       21       18        19       20        19           21    20     20               20             19
                                                Mining and Searching Complex Structures




                 Cost Model

             •      The multiple system information retrieval model
                   –Objects are stored in different systems and scored by each system
                   –Each system can sort the objects according to their scores
                   –A query retrieves the scores of objects from different systems and then combine them using some
                   aggregation function
                                          Q : color=“red” & shape=“round” & texture “cloud”




                              System 1: Color                      System 2: Shape                      System 3: Texture

                           Object ID    Score                   Object ID          Score               Object ID              Score
                              1           0.4
                                          0.4                        1              1.0
                                                                                    1.0                      1                    1.0
                                                                                                                                  1.0
                              2           2.8
                                          2.8                        2
                                                                     5              1.5
                                                                                    5.5                      2                    2.0
                                                                                                                                  2.0
                              3
                              5           3.5
                                          6.5                        3
                                                                     2              5.5
                                                                                    7.8                      3                    5.0
                                                                                                                                  5.0
                              3
                              4           6.5
                                          9.0                        4
                                                                     3              7.8
                                                                                    9.0                      4
                                                                                                             5                    8.0
                                                                                                                                  9.0
                              4
                              5           9.0
                                          3.5                        5
                                                                     4              9.0
                                                                                    1.5                      5
                                                                                                             4                    9.0
                                                                                                                                  8.0


             •      The cost
                   –Retrieval of scores – proportional to the number of scores retrieved

             •      The goal
                   –To minimize the scores retrieved

                                                Mining and Searching Complex Structures




                                                                         39
Mining and Searching Complex                                                                       Chapter 2 Structures High Dimensional Data




                  The AD Algorithm

          •       The AD algorithm for the k-n-match query
                  –Locate the query’s attributes value in every dimension
                  –Retrieve the objects’ attributes value from the query’s attributes in both directions
                  –The objects’ attributes are retrieved in Ascending order of their Differences to the query’s attributes. An n-match is found
                  when it appears n times.

                                                     2-2-match 3.0 ( 3.0 , 7.0 , 4.0 )
                                                            shape=“round” )
                                          Q : color=“red” &Q : (of Q ,:7.0 , 4.0& texture “cloud”

                             System 1: Color
                                   d1                                          2:
                                                                        Systemd2 Shape                                      3:
                                                                                                                     System d3 Texture

                          Object ID        Score
                                            Attr                      Object ID          Score
                                                                                          Attr                     Object ID         Score
                                                                                                                                      Attr
                               1             0.4                          1               1.0                            1            1.0
                               2             2.8                3.0       5               1.5                            2            2.0                4.0

                               5             3.5                          2               5.5          7.0               3            5.0
                               3             6.5                          3               7.8                            5            8.0
                               4             9.0                          4               9.0                            4            9.0


                  Auxiliary structures                                             d1                         d2                         d3
                        Next attribute to retrieve g[2d]
                                                                         2 , 0.2
                                                                         1 , 2.6        3 ,, 3.5
                                                                                        5 0.5       2 , 1.5        4 , 0.8
                                                                                                                   3 , 2.0     2 , 2.0        3 1.0
                                                                                                                                              5 ,, 4.0

                        Number of appearances appear[c]                                   1           2              3           4               5

                                                                                          0           0
                                                                                                      2
                                                                                                      1              0
                                                                                                                     2
                                                                                                                     1           0               0
                                                                                                                                                 1
                        Answer set S
                                                        {
                                                 { 3 , {23} }

                                                        Mining and Searching Complex Structures




                   The AD Algorithm : Extensions

              •     The AD algorithm for the frequent k-n-match query
                    –The frequent k-n-match query
                       • Given an integer range [n0, n1], find k-n0-match, k-(n0+1)-match, ... , k-n1-
                       match of the query, S0, S1, ... , Si.
                       • Find k objects that appear most frequently in S0, S1, ... , Si.

                    –Retrieve the same number of attributes as processing a k-n1-match query.

              •     Disk based solutions for the (frequent) k-n-match query

                    –Disk based AD algorithm
                        • Sort each dimension and store them sequentially on the disk
                        • When reaching the end of a disk page, read the next page from disk

                    –Existing indexing techniques
                       • Tree-like structures: R-trees, k-d-trees
                       • Mapping based indexing: space-filling curves, iDistance
                       • Sequential scan
                       • Compression based approach (VA-file)




                                                        Mining and Searching Complex Structures




                                                                                   40
Mining and Searching Complex                                                    Chapter 2 Structures High Dimensional Data




                     Experiments : Effectiveness
           •   Searching by k-n-match
               –COIL-100 database
               –54 features extracted, such as color histograms, area moments

                                                                         k-n-match query, k=4
                                                                                                                 kNN query
                                                                        n          Images returned
                                                                                                           k     Images returned
                                                                        5           36, 42, 78, 94
                                                                                                          10     13, 35, 36, 40, 42
                                                                        10          27, 35, 42, 78
                                                                                                                 64, 85, 88, 94, 96
                                                                        15           3, 38, 42, 78
                                                                        20          27, 38, 42, 78
                                                                        25          35, 40, 42, 94
                                                                        30          10, 35, 42, 94
                                                                        35          35, 42, 94, 96
                                                                        40          35, 42, 94, 96
                                                                        45          35, 42, 94, 96
                                                                        50          35, 42, 94, 96

                Searching by frequent k-n-                                  Data sets (d)        IGrid   HCINN   Freq. k-n-match
                match                                                   Ionosphere (34)         80.1%    86%          87.5%
                      UCI Machine learning repository
                      Competitors:                                     Segmentation (19)        79.9%    83%          87.3%
                           IGrid                                             Wdbc (30)          87.1%    N.A.         92.5%
                           Human-Computer Interactive NN
                           search (HCINN)                                     Glass (9)         58.6%    N.A.         67.8%
                                                                                Iris (4)        88.9%    N.A.         89.6%
                                             Mining and Searching Complex Structures




                     Experiments : Efficiency
           •   Disk based algorithms for the Frequent k-n-mach query
               –Texture dataset (68,040 records); uniform dataset (100,000 records)
               –Competitors:
                   • The AD algorithm
                   • VA-file
                   • Sequential scan




                                             Mining and Searching Complex Structures




                                                                  41
Mining and Searching Complex                                            Chapter 2 Structures High Dimensional Data




                      Experiments : Efficiency (continued)
           •   Comparison with other similarity search techniques
               –Texture dataset ; synthetic dataset
               –Competitors:
                   • Frequent k-n-match query using the AD algorithm
                   • IGrid
                   • scan




                                        Mining and Searching Complex Structures




                 Future Work(I)
      • We now have a natural way to handle similarity search for
        data with categorical , numerical and attributes. Investigating
        k-n-match performance on such mixed-type data is currently
        under way
      • Likewise, applying k-n-match on data with missing or
        uncertain attributes will be interesting
      • Query={1,1,1,1,1,1,1,M,No,R}
                 ID      d1      d2      d3      d4     d5        d6       d7   d8    d9   d10

                 P1      1.1     1      1.2     1.6     1.1       1.6     1.2   M    Yes   R

                 P2      1.4     1.4    1.4     1.5     1.4       1       1.2   F    No    B

                 P3      1       1       1       1       1        1        2    M    No    B

                 P4      20      20      21      20     22        20       20   M    Yes   G

                 P5      19      21      20      20     20        21       18   F    Yes   R

                 P6      21      21      18      19     20        19       21   F    Yes   Y

                                        Mining and Searching Complex Structures




                                                             42
Mining and Searching Complex                                   Chapter 2 Structures High Dimensional Data




                Future Work(I)

      • We now have a natural way to handle similarity search for
        data with categorical , numerical and attributes. Investigating
        k-n-match performance on such mixed-type data is currently
        under way
      • Likewise, applying k-n-match on data with missing or
        uncertain attributes will be interesting
      • Query={1,1,1,1,1,1,1,M,No,R}
                ID   d1    d2    d3     d4     d5        d6       d7   d8    d9   d10

                P1         1     1.2    1.6    1.1       1.6     1.2   M          R

                P2   1.4   1.4          1.5              1       1.2   F    No    B

                P3    1    1      1      1      1                 2    M    No    B

                P4   20    20           20     22        20       20   M          G

                P5   19    21    20     20     20                 18        Yes   R

                P6   21          18            20                 21   F    Yes   Y

                                 Mining and Searching Complex Structures




               Future Work(II)
        • In general, three things affect the result from a similarity search:
          noise, scaling and axes orientation. K-n-match reduce the effect of
          noise. Ultimate aim is to have a similarity function that is robust
          to noise, scaling and axes orientation
        • Eventually will look at creating mining algorithms using k-n-
          match




                                 Mining and Searching Complex Structures




                                                    43
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




               Outline

        • Sources of HDD
        • Challenges of HDD
        • Searching and Mining Mixed Typed Data
          –Similarity Function on k-n-match
          –ItCompress
        • Bregman Divergence: Towards Similarity Search on Non-metric
          Distance
        • Earth Mover Distance: Similarity Search on Probabilistic Data
        • Finding Patterns in High Dimensional Data




                               Mining and Searching Complex Structures




              Motivation


                                                 query
               Large
                                                  results
              Data Sets



                    Ever-increasing data collection rates of modern
                    enterprises and the need for effective, guaranteed-
                    quality approximate answers to queries
                    Concern: compress as much as possible.


                               Mining and Searching Complex Structures                22




                                               44
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




        Conventional Compression Method
           • Try to find the optimal encoding of arbitrary strings for
             the input data:
             –Huffman Coding
             –Lempel-Ziv Coding (gzip)
           • View the whole table as a large byte string
           • Statistical or dictionary based
           • Operate at the byte level




                               Mining and Searching Complex Structures                23




              Why not just “syntactic”?

         • Do not exploit the complex dependency patterns in the table
         • Individual retrieval of tuple is difficult
         • Do not utilize lossy compression




                               Mining and Searching Complex Structures                24




                                               45
Mining and Searching Complex                            Chapter 2 Structures High Dimensional Data




           Semantic compression methods

         • Derive a descriptive model M
         • Identify the data values which can be derived from M (within
           some error tolerance), which are essential for deriving, and
           which are the outliers
         • Derived values need not to be stored, only the outliers need




                                Mining and Searching Complex Structures                25




            Advantages
        • More Complex Analysis
           –Example: detect correlation among columns
        • Fast Retrieval
           –Tuple-wise access
        • Query Enhancement
           –Possible to answer query directly from discover semantic
           –Compress in way which enhanced answering of some complex
           queries, eg. “Go Green: Recycle and Reuse Frequent Patterns”, C.
           Gao, B. C. Ooi, K. L. Tan and A. K. H. Tung. ICDE’2004.

            Choose a combination of compression methods
            based on semantic and syntactic information
                                Mining and Searching Complex Structures                26




                                                46
Mining and Searching Complex                               Chapter 2 Structures High Dimensional Data




          Fascicles
          • Key observation
            –Often, numerous subsets of records in T have similar values for
            many attributes


            Protocol   Duration   Bytes Packets           • Compress data by storing
                http       12       20K     3
                http       16       24K     5             representative values (e.g.,
                http       15       20K     8             “centroid”) only once for each
               http        19       40K    11             attribute cluster
               http        26       58K    18
                ftp        27      100K    24
                ftp        32      300K    35
                                                          • Lossy compression:
                ftp        18        80K   15             information loss is controlled by
                                                          the notion of “similar values” for
                                                          attributes (user-defined)


                                   Mining and Searching Complex Structures                           27




          ItCompress: Compression Format
                                                      Representative Rows (Patterns)
                  Original Table
                                                        RRid age salary          credit   sex
            age salary credit sex
                                                           1       30     90k    good       F
            20    30k     poor     M
                                                           2       70     35k     poor      M
            25    76k     good     F
                                                          Compressed Table
            30    90k     good     F
                                                                                 Outlying
            40    100k    poor     M                       RRid bitmap
                                                                                  value
            50    110k    good     F                           2        0111        20
            60    50k     good     M                           1        1111
            70    35k     poor     F                           1        1111
            75    15k     poor     M                           1        0100    40, poor, M
         Error Tolerance:                                      1        0111        50
            age salary credit sex                              1        0010    60, 50k, M
             5     25k      0       0                          2        1110        F
                                                                                                28

                                                               2        1111
                                   Mining and Searching Complex Structures




                                                   47
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Some definitions

          • Error tolerance
            –Numeric attributes
                • The upper bound that x’ can be different from x
                • x ∈ [ x’-ei, x’+ei ]
             –Categorical attributes
                • The upper bound on the probability that the compressed
                value differs from actual value
                • Given an actual value x and its error tolerance ei, the
                compressed value x’ should satisfy: Prob( x=x’ ) ≥ 1 - ei




                               Mining and Searching Complex Structures                    29




          Some definitions

         • Coverage
            –Let R be a row in the table T, and Pi be a pattern
            –The coverage of Pi on R :
                  cov( Pi , R ) = number of attributes X i in which
                  R[ X i ] is match by Pi [ X i ]
         • Total coverage
            –Let P be a set of patterns P1,…,Pk; and the table T
            contains n rows R1,…,Rn
            –
                  totalcov ( P, T ) =       ∑ cov( P
                                           i =1..n
                                                           max   ( Ri ), Ri )
                                                                                     30


                               Mining and Searching Complex Structures




                                               48
Mining and Searching Complex                                 Chapter 2 Structures High Dimensional Data




          ItCompress: basic algorithm

        • First randomly choose k rows as initial patterns
        • Scan the table T:            Phase1
           –For each row R, compute the coverage of each pattern on it,
           then try to find Pmax(R)
           –Allocate R to its most covered pattern
        • After each iteration, re-compute all patterns’ Phase2
          attributes, always using the most frequent values
        • Iterate until sum of total coverage does not increase



                                   Mining and Searching Complex Structures                  31




            Example: the 1st iteration begins


            age salary credit       sex                 RRid age salary      credit   sex
            20    30k       poor    M                    1      20    30k    poor     M
            25    76k       good     F                   2      25    76k    good     F
            30    90k       good     F
            40   100k       poor    M
            50   110k       good     F
            60    50k       good    M
            70    35k       poor     F
            75    15k       poor    M
         Error Tolerance:
            age salary credit       sex
             5    25k        0       0                                                      32

                                   Mining and Searching Complex Structures




                                                   49
Mining and Searching Complex                               Chapter 2 Structures High Dimensional Data




              Example: Phase 1
                                                   RRid age salary           credit   sex
            age salary credit       sex                1    20     30k       poor     M
            20    30k       poor    M                  2    25     76k       good      F
            25    76k       good     F
                                                           age salary        credit   sex
            30    90k       good     F
                                                            20     30k       poor     M
            40    100k      poor    M
                                                            40    100k       poor     M
            50    110k      good     F
                                                            60     50k       good     M
            60    50k       good    M
                                                            70     35k       poor      F
            70    35k       poor     F
                                                            75     15k       poor     M
            75    15k       poor    M
                                                           age salary credit          sex
         Error Tolerance:
                                                            25     76k       good      F
            age salary credit       sex
                                                            30     90k       good      F
             5    25k        0       0                                                      33
                                                            50    110k       good      F
                                   Mining and Searching Complex Structures




                 Example: Phase 2
                                                 RRid      age    salary credit       sex
            age salary credit       sex            1       20                         M
                                                           70      30k       poor     M
            20    30k       poor    M
                                                   2       25
                                                           25      90k
                                                                   76k       good     F
                                                                                      F
            25    76k       good     F
            30    90k       good     F                     age salary        credit   sex

            40    100k      poor    M                       20     30k       poor     M

            50    110k      good     F                      40    100k       poor     M

            60    50k       good    M                       60     50k       good     M

            70    35k       poor     F                      70     35k       poor      F

            75    15k       poor    M                       75     15k       poor     M

         Error Tolerance:                                  age salary credit          sex
                                                            25     76k       good     F
            age salary credit       sex
                                                            30     90k       good     F
             5    25k        0       0                                                      34

                                                            50    110k       good     F

                                   Mining and Searching Complex Structures




                                                   50
Mining and Searching Complex                            Chapter 2 Structures High Dimensional Data




          Convergence(I)
          • Phase 1:
            –When we assign the rows to their most coverage patterns:
                • For each row, the coverage increases or maintain
                  So the total coverage also increases or maintain
          • Phase 2:
            –When we re-compute the attribute values for the patterns:
                • For each pattern, the coverage increases or maintains
                  So the total coverage also increases or maintains




                                Mining and Searching Complex Structures                35




          Convergence(II)
           • In both Phase 1&2, the total coverage is either increased
             or maintained, and it has a obvious upper bound (cover
             the whole table)

              The algorithm will converge eventually




                                Mining and Searching Complex Structures                36




                                                51
Mining and Searching Complex                            Chapter 2 Structures High Dimensional Data




          Complexity
         • Phase 1:
           –In l iterations, we need to go through the n rows in the table and
           match each row against the k patterns(2m comparisons,)
              The running time complexity is O(kmnl) where m is the
           number of attributes
         • Phase 2:
           –Computing each new pattern Pi will require going through all
           the domain values/intervals of each value
              Assuming the total number of domain values/intervals is d, the
           running time complexity is O(kdl)

            The total time complexity is O(kmnl+kdl)



                                Mining and Searching Complex Structures                37




          Advantages of ItCompress
         • Simplicity and Directness
           –Two phases process of Fascicle and Spartan
               • Find rules/patterns
               • Compress database using discovered rules/patterns
           –ItCompress optimize the compression directly without finding
           rules/patterns that may not be useful (a.k.a microeconomic approach)
         • Less constraints
           –Do not need patterns to be matched completely or rules that apply
           globally
         • Easily tuned parameters




                                Mining and Searching Complex Structures                38




                                                52
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Performance Comparison
      • Algorithms
         –ItCompress, ItCompress+gzip
         –Fascicles, Fascicles+gzip
         –SPARTAN+gzip
      • Platform
         –ItCompress,Fascicles: AMD Duron 700Mhz, 256MB Memory
         –SPARTAN: Four 700Mhz Pentium CPU, 1GB Memory)
      • Datasets
         –Corel: 32 numeric attributes, 35000 rows, 10.5MB
         –Census: 7 numeric, 7 categorical, 676000 rows, 28.6MB
         –Forest-cover: 10 numeric, 44 categorical, 581000 rows, 75.2MB

                               Mining and Searching Complex Structures                39




          Effectiveness (Corel)




                               Mining and Searching Complex Structures                40




                                               53
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Effectiveness (Census)




                               Mining and Searching Complex Structures                41




          Effectiveness (Forest Cover)




                               Mining and Searching Complex Structures                42




                                               54
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Efficiency




                               Mining and Searching Complex Structures                43




               Varying k




                               Mining and Searching Complex Structures                44




                                               55
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Varying Sample Ratio




                               Mining and Searching Complex Structures                45




          Adding Noises (Census)




                               Mining and Searching Complex Structures                46




                                               56
Mining and Searching Complex                                Chapter 2 Structures High Dimensional Data




            Effect of Corruption                                              20%
                                                                              Corruption?

          A1    A2    A3       A4      A5      A6     A7      A8      A9      A10 A11 A12




                                                                                            47

                                    Mining and Searching Complex Structures




                Effect of Corruption                                          20%
                                                                              Corruption?

          A1    A2    A3       A4      A5      A6     A7      A8      A9      A10 A11 A12




                                                                                            48

                                    Mining and Searching Complex Structures




                                                    57
Mining and Searching Complex                             Chapter 2 Structures High Dimensional Data




          Findings
           •    ItCompress is
               –More efficient than SPARTAN
               –More effective than Fascicles
               –Insensitive to parameter setting
               –Robust to noises




                                 Mining and Searching Complex Structures                49




          Future work

               • Can we perform mining on the compressed datasets using
                 only the patterns and the bitmap ?
                 –Example: Building Bayesian Belief Network
               • Is ItCompress a good “bootstrap” semantic compression
                 algorithm ?


                                         ItCompress
                                                                      Compressed
                  database                                             database




                                Other Semantic
                             Compression Algorithms
                                                                                        50




                                 Mining and Searching Complex Structures




                                                 58
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




         Outline

        • Sources of HDD
        • Challenges of HDD
        • Searching and Mining Mixed Typed Data
          –Similarity Function on k-n-match
          –ItCompress
        • Bregman Divergence: Towards Similarity Search on Non-metric
          Distance
        • Earth Mover Distance: Similarity Search on Probabilistic Data
        • Finding Patterns in High Dimensional Data




                               Mining and Searching Complex Structures




          Metric v.s. Non-Metric
           • Euclidean distance dominates DB queries
           • Similarity in human perception




           • Metric distance is not enough!




           2010-7-31           Mining and Searching Complex Structures                52




                                               59
Mining and Searching Complex                                         Chapter 2 Structures High Dimensional Data




          Bregman Divergence

                       h




                       (q,f(q))
                                              convex function f(x)

                                                                  (p,f(p))
                                                                         Bregman divergence
                                                                         Df(p,q)



                                  q                                  p
                                         Euclidean dist.




           2010-7-31                       Mining and Searching Complex Structures                  53




          Bregman Divergence
           • Mathematical Interpretation
             –The distance between p and q is defined as the difference
             between f(p) and the first order Taylor expansion at q




                                      f(x) at p            first order Taylor expansion at q




           2010-7-31                       Mining and Searching Complex Structures                  54




                                                            60
Mining and Searching Complex                                      Chapter 2 Structures High Dimensional Data




          Bregman Divergence
           • General Properties
             –Non-Negativity
                        • Df(p,q)≥0 for any p, q
                –Identity of Indiscernible
                        • Df(p,p)=0 for any p
                –Symmetry and Triangle Inequality
                        • Do NOT hold any more




           2010-7-31                    Mining and Searching Complex Structures                      55




          Examples

                         Distance             f(x)                Df(p,q)              Usage

                       KL-Divergence        x logx               p log (p/q)        distribution,
                                                                                  color histogram
                        Itakura-Saito        -logx             p/q-log (p/q)-1     signal, speech
                           Distance
                          Squared               x2                 (p-q)2         Euclidean space
                         Euclidean
                       Von-Nuemann      tr(X log X – X)        tr(X logX – X      symmetric matrix
                         Entropy                               logY – X + Y)




           2010-7-31                    Mining and Searching Complex Structures                      56




                                                          61
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Why in DB system?
           • Database application
             –Retrieval of similar images, speech signals, or time series
             –Optimization on matrices in machine learning
             –Efficiency is important!
           • Query Types
             –Nearest Neighbor Query
             –Range Query




           2010-7-31           Mining and Searching Complex Structures                57




          Euclidean Space
           • How to answer the queries
             –R-Tree




           2010-7-31           Mining and Searching Complex Structures                58




                                               62
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Euclidean Space
           • How to answer the queries
             –VA File




           2010-7-31           Mining and Searching Complex Structures                59




          Our goal
           • Re-use the infrastructure of existing DB system to support
             Bregman divergence
             –Storage management
             –Indexing structures
             –Query processing algorithms




           2010-7-31           Mining and Searching Complex Structures                60




                                               63
Mining and Searching Complex                            Chapter 2 Structures High Dimensional Data




          Basic Solution
           • Extended Space
             –Convex function f(x) = x2



           point       D1      D2                     point        D1     D2      D3

               p       0       1                        p+          0     1        1

               q       0.5     0.5                      q+         0.5    0.5     0.5

               r       1       0.8                       r+         1     0.8    1.64

               t       1.5     0.3                       t+        1.5    0.3    3.15




           2010-7-31            Mining and Searching Complex Structures                 61




          Basic Solution
           • After the extension
             –Index extended points with R-Tree or VA File
             –Re-use existing algorithms with lower and upper bounds on
             the rectangles




           2010-7-31            Mining and Searching Complex Structures                 62




                                                64
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          How to improve?
           • Reformulation of Bregman divergence
           • Tighter bounds are derived
           • No change on index construction or query processing
             algorithm




           2010-7-31           Mining and Searching Complex Structures                63




          A New Formulation

                       h

               h’                                            query vector vq




                                                            Df(p,q)+Δ


                           q                            p
                                  D*f(p,q)




           2010-7-31           Mining and Searching Complex Structures                64




                                               65
Mining and Searching Complex                                  Chapter 2 Structures High Dimensional Data




          Math. Interpretation
           • Reformulation of similarity search queries
             –k-NN query: query q, data set P, divergence Df
                       • Find the point p, minimizing



                –Range query: query q, threshold θ, data set P
                       • Return any point p that




           2010-7-31                  Mining and Searching Complex Structures                65




          Naïve Bounds
           • Check the corners of the bounding rectangles




           2010-7-31                  Mining and Searching Complex Structures                66




                                                      66
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Tighter Bounds
           • Take the curve f(x) into consideration




           2010-7-31           Mining and Searching Complex Structures                67




          Query distribution
           • Distortion of rectangles
             –The difference between maximum and minimum distances
             from inside the rectangle to the query




           2010-7-31           Mining and Searching Complex Structures                68




                                               67
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Can we improve it more?
           • When Building R-Tree in Euclidean space
             –Minimize the volume/edge length of MBRs
             –Does it remain valid?




           2010-7-31           Mining and Searching Complex Structures                69




          Query distribution
           • Distortion of bounding rectangles
             –Invariant in Euclidean space (triangle inequality)
                –Query-dependent for Bregman Divergence




           2010-7-31           Mining and Searching Complex Structures                70




                                               68
Mining and Searching Complex                                  Chapter 2 Structures High Dimensional Data




          Utilize Query Distribution
           • Summarize query distribution with O(d) real number
           • Estimation on expected distortion on any bounding
             rectangle in O(d) time
           • Allows better index to be constructed for both R-Tree and
             VA File




           2010-7-31                  Mining and Searching Complex Structures                71




          Experiments
           • Data Sets
             –KDD’99 data
                       • Network data, the proportion of packages in 72 different
                       TCP/IP connection Types
                –DBLP data
                       • Use co-authorship graph to generate the probabilities of the
                       authors related to 8 different areas




           2010-7-31                  Mining and Searching Complex Structures                72




                                                      69
Mining and Searching Complex                                     Chapter 2 Structures High Dimensional Data




          Experiment
           • Data Sets
             –Uniform Synthetic data
                       • Generate synthetic data with uniform distribution
                –Clustered Synthetic data
                       • Generate synthetic data with Gaussian Mixture Model




           2010-7-31                  Mining and Searching Complex Structures                   73




          Experiments
           • Methods to compare


                                    Basic              Improved              Query
                                                        Bounds            Distribution
               R-Tree                 R                     R-B                 R-BQ

               VA File                V                     V-B                 V-BQ

           Linear Scan                                     LS

              BB-Tree                                      BBT




           2010-7-31                  Mining and Searching Complex Structures                   74




                                                      70
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Existing Solution
           • BB-Tree (L. Clayton, ICML 2009)
             –Memory-based indexing tree
             –Construct with k-means clustering
             –Hard to update
             –Ineffective in high-dimensional space




           2010-7-31           Mining and Searching Complex Structures                75




          Experiments
           • Index Construction Time




           2010-7-31           Mining and Searching Complex Structures                76




                                               71
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Experiments
           • Varying dimensionality




           2010-7-31           Mining and Searching Complex Structures                77




          Experiments
           • Varying dimensionality (cont.)




           2010-7-31           Mining and Searching Complex Structures                78




                                               72
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Experiments
           • Varying data cardinality




           2010-7-31           Mining and Searching Complex Structures                79




          Conclusion
           • A general technique on similarity for Bregman Divergence
           • All techniques are based on existing infrastructure of
             commercial database
           • Extensive experiments to compare performances with R-
             Tree and VA File with different optimizations




           2010-7-31           Mining and Searching Complex Structures                80




                                               73
Mining and Searching Complex                            Chapter 2 Structures High Dimensional Data




         Outline

        • Sources of HDD
        • Challenges of HDD
        • Searching and Mining Mixed Typed Data
          –Similarity Function on k-n-match
          –ItCompress
        • Bregman Divergence: Towards Similarity Search on Non-metric
          Distance
        • Earth Mover Distance: Similarity Search on Probabilistic Data
        • Finding Patterns in High Dimensional Data




                                Mining and Searching Complex Structures




          Motivation
           • Probabilistic data is ubiquitous
             –To represent the data uncertainty (WSN, RFID, moving
             object monitoring)
             –To compress data (image processing)
           • Histogram is a good way to represent the prob. data
             –Easy to capture
             –Is very useful in image representation
                 •   Colors
                 •   Textures
                 •   Gradient
                 •   Depth




                                Mining and Searching Complex Structures




                                                74
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Motivation
           • Similarity search is important for managing prob. data
             –Given a threshold θ, can answer which sensors’ readings
             are similar with sensor A (range query)
             –Can answer which k pictures are similar (top-k query)
           • Similarity function for prob. data should be carefully
             chosen
             –Bin by bin methods
                 • L1 and L2 norms
                 • χ2 distance
               –Cross-bin methods
                 • Earth Mover’s Distance (EMD)
                 • Quadratic form



                               Mining and Searching Complex Structures




          Outline
           •   Motivation
           •   Introduction to Earth Mover’s Distance (EMD)
           •   Related works
           •   Indexing the probabilistic data based on EMD
           •   Experimental results
           •   Conclusion and future work




                               Mining and Searching Complex Structures




                                               75
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Introduction to Earth Mover’s Dist
           • Bin by bin vs. cross bin




          Bin-by-bin
                                                                Not good!




          Cross bin
                                                                 Good!
                                                                 Can handle
                                                                 distribution shift
                               Mining and Searching Complex Structures




          Introduction to Earth Mover’s Dist
           • What is EMD?
             –Earth (泥土)
             –Mover (搬运)
             –Distance (代价)
             –Can be understood as 搬运泥土的代价
           • See an example…




                               Mining and Searching Complex Structures




                                               76
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Moving Earth




                                          ≠

                               Mining and Searching Complex Structures




          Moving Earth




                                          ≠

                               Mining and Searching Complex Structures




                                               77
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Moving Earth




                                          =

                               Mining and Searching Complex Structures




          The Difference?



                                 (amount moved)




                                          =

                               Mining and Searching Complex Structures




                                               78
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          The Difference?



          Difference             (amount moved) * (distance moved)




                                          =

                               Mining and Searching Complex Structures




          Linear programming




      P

            m bins
                                        (distance moved) * (amount moved)


      Q                  All movements


            n bins




                               Mining and Searching Complex Structures




                                               79
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Linear programming




      P

          m clusters
                                        (distance moved) * (amount moved)


      Q

          n clusters




                               Mining and Searching Complex Structures




          Linear programming




      P

          m clusters
                                              * (amount moved)


      Q

          n clusters




                               Mining and Searching Complex Structures




                                               80
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Linear programming




      P

          m clusters



      Q

          n clusters




                               Mining and Searching Complex Structures




          Constraints


                         1. Move “earth” only from P to Q
      P

          m clusters
                                   P’

      Q

          n clusters               Q’


                               Mining and Searching Complex Structures




                                               81
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Constraints


                         2. Cannot send more “earth” than
      P                  there is

          m clusters
                                   P’

      Q

          n clusters               Q’


                               Mining and Searching Complex Structures




          Constraints


                         3. Q cannot receive more “earth”
      P                  than it can hold

          m clusters
                                   P’

      Q

          n clusters               Q’


                               Mining and Searching Complex Structures




                                               82
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Constraints


                         4. As much “earth” as possible
      P                  must be moved

          m clusters
                        P’

      Q

          n clusters    Q’


                               Mining and Searching Complex Structures




          The Formal Definition of EMD
           • Earth Mover’s Distance (EMD)
             –the minimum amount of work needed to change one
             histogram into another




           • Challenge of EMD
             –O(N^3logN)




                               Mining and Searching Complex Structures




                                               83
Mining and Searching Complex                                    Chapter 2 Structures High Dimensional Data




           Related Works
           •   Filter-and-refine framework
               –[1] Approximation Techniques for
               Indexing the Earth Mover's Distance in
               Multimedia Databases. ICDE 2006
                   • Cannot handle high
                   dimensional histograms

               –[2] Efficient EMD-based Similarity
               Search in Multimedia Databases via
               Flexible Dimensionality Reduction.
               SIGMOD 2008
                   • Based on scan framework and
                   influence the scalability
           •   Use scanning scheme to
               process queries
               –Merit: can obtain a good order to access
               when execute the k-NN queries and thus
               can minimize the number of candidates
               –Demerit: need to scan the whole dataset
               to obtain the order and thus low algo.
               scalability



                                        Mining and Searching Complex Structures




          Related Works
           •   Related works
               –Based on the filter-and-refine framework
               –Based on scanning method and low scalability
           •   Our work
               –Also based on the filter-and-refine method
               –But avoid to scan the whole data set
                   • Use B+ trees
                   • And thus can obtain high scalability
           •   Our contributions
               –To the best of our knowledge, the 1st paper to index the high
               dimensional prob. data based on the EMD
               –Proposed algorithms of processing the similarity query based on B+ tree
               filter
               –Improve the efficiency and scalability of EMD-based similarity search




                                        Mining and Searching Complex Structures




                                                           84
Mining and Searching Complex                                   Chapter 2 Structures High Dimensional Data




      Indexing the probabilistic data
      based on EMD
      •   Our intuition:
          –primal-dual theory in linear programming


      •   Primal problem (EMD)




      •   Dual problem




                                       Mining and Searching Complex Structures




                Indexing the probabilistic data based on
                EMD




            •    Good properties of dual space
                 –Constrains of dual space are independent of prob. data points (i.e., p and
                 q in this example)
                     • Thus, give any feasible solution (π, Ф) in dual space we can derives a
                    lower bound for EMD(p, q)
                    • Lower bound can help to filter out the not-hit histograms.
                 –given any feasible solution (π, Ф) in dual space, a histogram p can be
                 mapped as a value, using the operation of
                    • Can index histograms using B+ tree



                                       Mining and Searching Complex Structures




                                                       85
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




      Indexing the probabilistic data based on EMD
           • 1. Mapping Construction
             –Key and counter key




                                  Key                    Counter key
              –Assuming p is a histogram in DB, given a feasible solution
              (π, Ф), we calculate the Key for each record in DB
              –We can index those keys using B+ tree
              –For each feasible solution (π, Ф), a B+ tree can be
              constructed



                               Mining and Searching Complex Structures




          Answering Range Query
           • Range query based on B+ index
              –Given any feasible solution (π, Ф) , we construct a B+ tree
              using keys of histograms
              –Given a query histogram, we calculate its counter key using
              the operation of
              –Given a similarity search threshold θ, we have proved that
              all candidate histogram’s key can be bounded by



              –To further filter the candidates, we use L B+ tree and make
              an intersection among their candidate results



                               Mining and Searching Complex Structures




                                               86
Mining and Searching Complex                             Chapter 2 Structures High Dimensional Data




          Answering KNN Query

                                                     •   K-NN query based on B+ index
                                                         –Given a query q, we issue search on
                                                         each B+ tree Tl with key(q, Фl)
                                                         –We create two cursors for each tree and
                                                         let them to fetch records from different
                                                         directions (one left and one right)
                                                         –Whenever record r has already been
                                                         accessed by all B+ tree, it can be output
                                                         as a candidate for k-NN query




                                Mining and Searching Complex Structures




          Experimental Setup
           • 3 real data set
             –RETINA1
                 • an image data set consists of 3932 feline retina scans labeled
                 with various antibodies.
              –IRMA
                 • contains 10000 radiography images from the Image Retrieval
                 in Medical Application (IRMA) project
             –DBLP
           • With parameter setting




                                Mining and Searching Complex Structures




                                                87
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




        Experimental Results on
        Query CPU Time




                               Mining and Searching Complex Structures




          Experimental Results on
          Scalability


                      sigmod
                        our




                               Mining and Searching Complex Structures




                                               88
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Conclusions
           • We present a new indexing scheme for the general
             purposes of similarity search on Earth Mover's Distance
           • Our index method relies on the primal-dual theory to
             construct mapping functions from the original
             probabilistic space to one-dimensional domain
           • Our B+ tree-based index framework has
             –High scalability
             –High efficiency
             –can handle High dimensional data




                               Mining and Searching Complex Structures




        Outline

        • Sources of HDD
        • Challenges of HDD
        • Searching and Mining Mixed Typed Data
          –Similarity Function on k-n-match
          –ItCompress
        • Bregman Divergence: Towards Similarity Search on Non-metric
          Distance
        • Earth Mover Distance: Similarity Search on Probabilistic Data
        • Finding Patterns in High Dimensional Data




                               Mining and Searching Complex Structures




                                               89
Mining and Searching Complex                                  Chapter 2 Structures High Dimensional Data




                    A Microarray Dataset
                                                      1000 - 100,000 columns

                              Class       Gene1       Gene2     Gene3         Gene4     Gene       Gene    Ge
                                                                                        5          6
               Sample1        Cancer
               Sample2        Cancer
        100-
        500          .
        rows         .
                     .
               SampleN-1      ~Cance
                              r
               SampleN        ~Cance
                              r

        • Find closed patterns which occur frequently among genes.
        • Find rules which associate certain combination of the
          columns that affect the class of the rows
           –Gene1,Gene10,Gene1001 -> Cancer
                                      Mining and Searching Complex Structures




                    Challenge I
      • Large number of patterns/rules
        –number of possible column combinations is extremely high
      • Solution: Concept of a closed pattern
        –Patterns are found in exactly the same set of rows are grouped together
        and represented by their upper bound
      • Example: the following patterns are found in row 2,3 and 4

                                                 upper
                                 aeh             bound                i                 ri         Class
                                                 (closed                  1   a ,b,c,l,o,s         C
                                                 pattern)                 2   a ,d, e , h ,p,l,r   C
               ae                 ah                                      3   a ,c, e , h ,o,q,t   C
                                                         eh               4   a , e ,f, h ,p,r     ~C
                                                                          5   b,d,f,g,l,q,s,t      ~C
                         e                        h
                             lower bounds                             “a” however not part of
                                                                      the group
                                      Mining and Searching Complex Structures




                                                      90
Mining and Searching Complex                               Chapter 2 Structures High Dimensional Data




          Challenge II
        • Most existing frequent pattern discovery algorithms perform
          searches in the column/item enumeration space i.e. systematically
          testing various combination of columns/items
        • For datasets with 1000-100,000 columns, this search space is
          enormous
        • Instead we adopt a novel row/sample enumeration algorithm for
          this purpose. CARPENTER (SIGKDD’03) is the FIRST
          algorithm which adopt this approach




                                   Mining and Searching Complex Structures




            Column/Item Enumeration Lattice

        • Each nodes in the lattice represent
          a combination of columns/items                                                a,b,c,e
        • An edge exists from node A to B if
          A is subset of B and A differ from                                 a,b,c a,b,e a,c,e b,c
          B by only 1 column/item
        • Search can be done
                                          breadth first           a,b        a,c       a,e    b,c     b

                               i               ri        Class
                                   1   a,b,c,l,o,s       C
                                                                                   a     b        c
                                   2   a,d,e,h,p,l,r     C
                                   3   a,c,e,h,o,q,t     C
                                   4   a,e,f,h,p,r       ~C
                                   5   b,d,f,g,l,q,s,t   ~C
                                                                             start           {}
                                   Mining and Searching Complex Structures




                                                   91
Mining and Searching Complex                                  Chapter 2 Structures High Dimensional Data




               Column/Item Enumeration Lattice
       • Each nodes in the lattice represent
         a combination of columns/items
                                                                                           a,b,c,e
       • An edge exists from node A to B if
         A is subset of B and A differ from
         B by only 1 column/item                                              a,b,c a,b,e a,c,e b,c
       • Search can be done depth first
       • Keep edges from parent to child
         only if child is the prefix of parent                      a,b       a,c         a,e     b,c     b

                                i               ri          Class
                                    1   a,b,c,l,o,s         C
                                                                                    a        b        c
                                    2   a,d,e,h,p,l,r       C
                                    3   a,c,e,h,o,q,t       C
                                    4   a,e,f,h,p,r         ~C
                                    5   b,d,f,g,l,q,s,t     ~C
                                                                              start               {}
                                    Mining and Searching Complex Structures




        General Framework for Column/Item Enumeration

                               Read-based                   Write-based                 Point-based


      Association Mining   Apriori[AgSr94],                Eclat,                         Hmine
                                 DIC                 MaxClique[Zaki01],
                                                     FPGrowth [HaPe00]


      Sequential Pattern       GSP[AgSr96]                     SPADE
          Discovery                                       [Zaki98,Zaki01],
                                                             PrefixSpan
                                                             [PHPC01]

         Iceberg Cube      Apriori[AgSr94]                                      BUC[BeRa99], H-
                                                                                Cubing [HPDW01]




                                    Mining and Searching Complex Structures




                                                    92
Mining and Searching Complex                              Chapter 2 Structures High Dimensional Data




          A Multidimensional View

      types of data                                                     others
      or knowledge                                       other interest
                                                         measure
         associative
         pattern                               constraints

                                      pruning method
          sequential
          pattern       compression method


                       closed/max
          iceberg      pattern
          cube
                                                                            lattice transversal/
                                                                            main operations

                               read           write           point
                                  Mining and Searching Complex Structures




              Sample/Row Enumeration Algorihtms

        • To avoid searching the large column/item enumeration space, our
          mining algorithm search for patterms/rules in the sample/row
          enumeration space
        • Our algorithms does not fitted into the column/item enumeration
          algorithms
        • They are not YAARMA (Yet Another Association Rules Mining
          Algorithm)
        • Column/item enumeration algorithms simply does not scale for
          microarray datasets




                                  Mining and Searching Complex Structures




                                                  93
Mining and Searching Complex                                     Chapter 2 Structures High Dimensional Data




          Existing Row/Sample Enumeration Algorithms

        • CARPENTER(SIGKDD'03)
          –Find closed patterns using row enumeration
        • FARMER(SIGMOD’04)
          –Find interesting rule groups and building classifiers based on them
        • COBBLER(SSDBM'04)
          –Combined row and column enumeration for tables with large
          number of rows and columns
        • Topk-IRG(SIGMOD’05)
          –Find top-k covering rules for each sample and build classifier
          directly
        • Efficiently Finding Lower Bound Rules(TKDE’2010)
          –Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao.
          What is Unequal among the Equals? Ranking Equivalent Rules from
          Gene Expression Data. Accepted in TKDE
                                       Mining and Searching Complex Structures




           Concepts of CARPENTER
                           ij R (ij )
                                                     C       ~C
          i              ri         Class       a    1,2,3   4
                                                b    1       5
              1   a,b,c,l,o,s       C                                                 C         ~C
                                                c    1,3
              2   a,d,e,h,p,l,r     C           d    2       5                   a    1,2,3     4
              3   a,c,e,h,o,q,t     C           e    2,3     4                   e    2,3       4
              4   a,e,f,h,p,r       ~C          f            4,5                 h    2,3       4
              5   b,d,f,g,l,q,s,t   ~C          g            5
                                                h    2,3     4                       TT|{2,3}
              Example Table                     l    1,2     5
                                                o    1,3
                                                p    2       4
                                                q    3       5
                                                r    2       4
                                                s    1       5
                                                t    3       5

                                            Transposed Table,TT
                                       Mining and Searching Complex Structures




                                                       94
Mining and Searching Complex                                       Chapter 2 Structures High Dimensional Data




                                                                                                  ij    R (ij )

                    Row Enumeration                                                               a
                                                                                                        C
                                                                                                        1,2,3 4
                                                                                                                ~C

                                                                                                  b     1       5
                                                                                                  c     1,3
                                                                                                  d     2       5
                                                                                                  e     2,3     4
                                              123                   12345                         f             4,5
                                               {a}      1234          {}
                                                        {a}                                       g             5
                               12             124                                                 h     2,3     4
                              {al}             {a}      1235                                      l     1,2     5
                                                         {}                 ij   R (ij )
                               13             125                                                 o     1,3
                              {aco}           {l}                                C       ~C       p     2       4
                1                                       1245                a    1,2,3 4          q     3       5
                               14              134       {}
             {abclos}          {a}             {a}               TT|{1} b        1       5        r     2       4
                                                                                                  s     1       5
                               15              135                          c    1,3
                              {bls}             {}      1345                                      t     3       5
                                                         {}                 l    1,2 5
                               23               145                         o    1,3
                2                                {}                                               ij    R (ij )
                              {aeh}                                         s    1       5
             {adehplr}                                                                                  C       ~C
                                24              234     2345
                                               {aeh}     {}                                       a     1,2,3 4
                              {aehpr}
                                                                                       TT|{12} l
       {}
                                                                                                        1,2 5
                3              25               235
                              {dl}              {}                         ij    R (ij )
             {acehoqt}
                                                245                              C       ~C
                                34              {}
                              {aeh}                                        a     1,2,3 4
                4                                               TT|{123}
                                                                   {124}
             {aefhpr}                            345
                                35               {}
                               {q}


                5               45
             {bdfglqst}        {f}
                                          Mining and Searching Complex Structures




            Pruning Method 1

        •   Removing rows that appear in all tuples
            of transposed table will not affect results
                                                                                                      C        ~C
                                                                                              a       1,2,3    4
                                                                                              e       2,3      4
                                                                                              h       2,3      4
                                      r2 r3                    r2 r3 r4
                                                                                                  TT|{2,3}
                                     {aeh}                      {aeh}



                         r4 has 100% support in the conditional table of
                         “r2r3”, therefore branch “r2 r3r4” will be
                         pruned.




                                          Mining and Searching Complex Structures




                                                          95
Mining and Searching Complex                                 Chapter 2 Structures High Dimensional Data




                  Pruning method 2

                                      123
                                       {a}        1234
                                                  {a}
                                                                 • if a rule is discovered
                                                             12345
                                                               {}
                           12
                          {al}        124
                                       {a}        1235
                                                                   before, we can prune
                                                   {}
                           13         125
                                      {l}
                                                                   enumeration below this
                          {aco}
                1          14          134
                                       {a}
                                                  1245
                                                   {}              node
             {abclos}      {a}
                           15          135                         –Because all rules below
                          {bls}         {}        1345
                                                   {}              this node has been
                           23           145
                2         {aeh}
                                         {}                        discovered before
             {adehplr}                  234
                            24         {aeh}      2345             –For example, at node 34, if
       {}                 {aehpr}                  {}

                3          25           235                        we found that {aeh} has
                          {dl}          {}
             {acehoqt}
                                        245                        been found, we can prune
                            34          {}               C       ~Coff all branches below it
                          {aeh}
                4
                                         345      a      1,2,3   4
             {aefhpr}       35           {}
                           {q}                    e      2,3     4
                                                  h      2,3     4
                5           45
                           {f}
                                                                      TT|{3,4}
             {bdfglqst}
                                    Mining and Searching Complex Structures




            Pruning Method 3: Minimum Support


        • Example: From TT|{1}, we can see                                       ij R (ij )
          that the support of all possible
          pattern below node {1} will be at                                         C ~C
          most 5 rows.
                                                                      TT|{1}
                                                                                 a 1,2,3 4
                                                                                 b 1 5
                                                                                 c 1,3
                                                                                 l 1,2 5
                                                                                 o 1,3
                                                                                 s 1 5


                                    Mining and Searching Complex Structures




                                                      96
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




           From CARPENTER to FARMER
         • What if classes exists ? What more can we
           do ?
         • Pruning with Interestingness Measure
            –Minimum confidence
            –Minimum chi-square
         • Generate lower bounds for classification/
           prediction




                               Mining and Searching Complex Structures




              Interesting Rule Groups
      • Concept of a rule group/equivalent class
        –rules supported by exactly the same set of rows are grouped together
      • Example: the following rules are derived from row 2,3 and 4 with
        66% confidence


                                                                i                 ri         Class
                                          upper                     1   a ,b,c,l,o,s         C
                      aeh--> C(66%)
                                          bound                     2   a ,d, e , h ,p,l,r   C
                                                                    3   a ,c, e , h ,o,q,t   C
      ae-->C (66%)    ah--> C(66%)       eh-->C (66%)               4   a , e ,f, h ,p,r     ~C
                                                                    5   b,d,f,g,l,q,s,t      ~C

                                                               a-->C however is not in
            e-->C (66%)          h-->C (66%)
                                                               the group
                        lower bounds

                               Mining and Searching Complex Structures




                                               97
Mining and Searching Complex                                Chapter 2 Structures High Dimensional Data




            Pruning by Interestingness Measure
            • In addition, find only interesting rule groups (IRGs) based
              on some measures:
              –minconf: the rules in the rule group can predict the class on
              the RHS with high confidence
              –minchi: there is high correlation between LHS and RHS of
              the rules based on chi-square test
            • Other measures like lift, entropy gain, conviction etc. can
              be handle similarly




                                    Mining and Searching Complex Structures




                                                                                        ij   R (ij )
                                                                                             C       ~C

      Ordering of Rows: All Class C before ~C                                           a
                                                                                        b
                                                                                             1,2,3 4
                                                                                             1       5
                                                                                        c    1,3
                                                                                        d    2       5
                                                                                        e    2,3     4
                                      123                    12345                      f            4,5
                                       {a}        1234         {}
                                                  {a}                                   g            5
                           12         124                                               h    2,3     4
                          {al}         {a}        1235                                  l    1,2     5
                                                   {}                ij   R (ij )
                           13         125                                               o    1,3
                          {aco}       {l}                                 C       ~C    p    2       4
                1                                 1245               a    1,2,3 4       q    3       5
                           14          134         {}
             {abclos}      {a}         {a}                TT|{1} b        1       5     r    2       4
                                                                                        s    1       5
                           15          135                           c    1,3
                          {bls}         {}        1345                                  t    3       5
                                                   {}                l    1,2 5
                           23           145                          o    1,3
                2                        {}                                             ij   R (ij )
                          {aeh}                                      s    1       5
             {adehplr}                                                                       C       ~C
                            24          234       2345
                                       {aeh}       {}                                   a    1,2,3 4
                          {aehpr}
                                                                                TT|{12} l
       {}
                                                                                             1,2 5
                3          25           235
                          {dl}          {}                          ij    R (ij )
             {acehoqt}
                                        245                               C       ~C
                            34          {}
                          {aeh}                                     a     1,2,3 4
                4                                        TT|{123}
                                                            {124}
             {aefhpr}                    345
                            35           {}
                           {q}


                5           45
             {bdfglqst}    {f}
                                    Mining and Searching Complex Structures




                                                    98
Mining and Searching Complex                            Chapter 2 Structures High Dimensional Data




                Pruning Method: Minimum Confidence



        • Example: In TT|{2,3} on the right,                              C          ~C
          the maximum confidence of all rules                      a      1,2,3,6    4,5
          below node {2,3} is at most 4/5                          e      2,3,7      4,9
                                                                   h      2,3        4

                                                                          TT|{2,3}




                                Mining and Searching Complex Structures




                Pruning method: Minimum chi-square


                                                                          C          ~C
              Same as in computing
              maximum confidence                                   a      1,2,3,6    4,5
                                                                   e      2,3,7      4,9
                                                                   h      2,3        4

                                                                          TT|{2,3}
                C              ~C             Total
          A     max=5          min=1          Computed
          ~A    Computed       Computed       Computed
                Constant       Constant       Constant



                                Mining and Searching Complex Structures




                                                99
Mining and Searching Complex                                    Chapter 2 Structures High Dimensional Data




           Finding Lower Bound, MineLB

                                                                  –Example: An upper bound
                                                                  rule with antecedent A=abcde
                        a,b,c,d,e
                                                                  and two rows (r1 : abcf ) and
                                                                  (r2 : cdeg)
                                        ad ae     bd    be        –Initialize lower bounds {a, b,
                  abc
                                cde                               c, d, e}
                                                                  –add “abcf”--- new lower
                                                                  {d ,e}
      a                                      e
              b           c         d                             –Add “cdeg”--- new lower
                                                                  bound{ad, bd, ae, be}

       Candidate lower bound: ad, ae, bd, be,
      Candidate lower bound: ad, ae, bd, be cd, ce
       Removed since d,e are still lower them
      Kept since no lower bound overridebound

                                        Mining and Searching Complex Structures




           Implementation

          • In general, CARPENTER FARMER can be                                   ij   R (ij )
            implemented in many ways:                                                  C       ~C
                                                                                  a    1,2,3 4
            –FP-tree                                                              b    1       5
            –Vertical format                                                      c    1,3
                                                                                  d    2       5
          • For our case, we assume the dataset can be                            e    2,3     4
            fitted into the main memory and used                                  f            4,5
                                                                                  g            5
            pointer-based algorithm similar to BUC                                h    2,3     4
                                                                                  l    1,2     5
                                                                                  o    1,3
                                                                                  p    2       4
                                                                                  q    3       5
                                                                                  r    2       4
                                                                                  s    1       5
                                                                                  t    3       5



                                        Mining and Searching Complex Structures




                                                       100
Mining and Searching Complex                              Chapter 2 Structures High Dimensional Data




          Experimental studies

         • Efficiency of FARMER
           –On five real-life dataset
               • lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALL-
               AML leukemia (ALL), Colon Tumor(CT)
            –Varying minsup, minconf, minchi
            –Benchmark against
               • CHARM [ZaHs02] ICDM'02
               • Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99
         • Usefulness of IRGs
           –Classification




                                Mining and Searching Complex Structures




          Example results--Prostate

                  100000
                                 FA RM ER
                   10000         Co lumnE
                    1000         CHA RM

                     100

                      10

                       1
                           3      4          5             6             7   8      9

                                                 mi ni mum sup p o r t




                                Mining and Searching Complex Structures




                                                 101
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Example results--Prostate

                 1200
                                                     FA RM ER:minsup=1:minchi=10
                 1000
                                                     FA RM ER:minsup =1
                  800

                  600

                  400

                  200

                    0
                        0      50        70           80         85      90      99

                                     minimum confidence(%)


                               Mining and Searching Complex Structures




         Top k Covering Rule Groups

        • Rank rule groups (upper bound) according to
           – Confidence
           – Support
        • Top k Covering Rule Groups for row r
           – k highest ranking rule groups that has row r as support and support
             > minimum support
        • Top k Covering Rule Groups =
          TopKRGS for each row




                               Mining and Searching Complex Structures




                                              102
Mining and Searching Complex                                      Chapter 2 Structures High Dimensional Data




            Usefulness of Rule Groups

        •   Rules for every row
        •   Top-1 covering rule groups sufficient to build CBA classifier
        •   No min confidence threshold, only min support
        •   #TopKRGS = k x #rows




                                          Mining and Searching Complex Structures




             Top-k covering rule groups

         • For each row, we find the most
           significant k rule groups:
                                                                       class          Items
             –based on confidence first
             –then support
                                                                       C1             a,b,c
         • Given minsup=1, Top-1
             –row 1: abc C1(sup = 2, conf= 100%)                       C1             a,b,c,d
             –row 2: abc C1

                                                                       C1             c,d,e
                • abcd C1(sup=1,conf = 100%)
             –row 3: cd C1(sup=2, conf = 66.7%)
                • If minconf = 80%, ?
             –row 4: cde C2 (sup=1, conf = 50%)                        C2             c,d,e




                                          Mining and Searching Complex Structures




                                                         103
Mining and Searching Complex                              Chapter 2 Structures High Dimensional Data




           Main advantages of Top-k coverage rule group

         • The number is bounded by the product of k and the number
           of samples
         • Treat each sample equally provide a complete description
           for each row (small)
         • The minimum confidence parameter-- instead k.
         • Sufficient to build classifiers while avoiding excessive
           computation




                                  Mining and Searching Complex Structures




          Top-k pruning
        • At node X, the maximal set of rows covered by rules to
          be discovered down X-- rows containing X and rows
          ordered after X.
           – minconf    MIN confidence of the discovered TopkRGs for all rows in the above
           set
           – minsup    the corresponding minsup
        • Pruning
           –If the estimated upper bound of confidence down X < minconf     prune
           –If same confidence and smaller support prune
        • Optimizations




                                  Mining and Searching Complex Structures




                                                  104
Mining and Searching Complex                               Chapter 2 Structures High Dimensional Data




           Classification based on association rules
         • Step 1: Generate the complete set of association rules
           for each class ( minimum support and minimum
           confidence.)
            –CBA algorithm adopts apriori-like algorithm -fails at this step on microarray
            data.
         • Step 2:Sort the set of generated rules
         • Step 3: select a subset of rules from the sorted rule
           sets to form classifiers.




                                   Mining and Searching Complex Structures




           Features of RCBT classifiers

                     Problems                                           RCBT
        To discover, store, retrieve and            Mine those rules to be used for
        sort a large number of rules                classification.e.g.Top-1 rule group
                                                    is sufficient to build CBA classifier


        Default class not convincing for            Main classifier + some back-up
        biologists                                  classifiers

        Rules with the same                         A subset of lower bound rules—
        discriminating ability, how to              integrate using a score
        integrate?                                  considering both confidence and
                                                    support.
        Upper bound rules: specific
        Lower bound rules: general


                                   Mining and Searching Complex Structures




                                                  105
Mining and Searching Complex                                      Chapter 2 Structures High Dimensional Data




          Experimental studies
           • Datasets: 4 real-life data
           • Efficiency of Top-k Rule mining
             –Benchmark: Farmer, Charm, Closet+
           • Classification Methods:
             –CBA (build using top-1 rule group)
             –RCBT (our proposed method)
             –IRG Classifier
             –Decision trees (single, bagging, boosting)
             –SVM




                                          Mining and Searching Complex Structures




      Runtime v.s. Minimum support on ALL-AML dataset

                                 10000
                                                                FARMER
                                                   FARMER(minconf=0.9)
                                  1000         FARMER+prefix(minconf=0.9)
                                                                   TOP1
                                   100                           TOP100
                    Runtime(s)




                                    10

                                     1

                                   0.1

                                  0.01
                                         17        19       21      22              23   25
                                                          Minimum Support


                                          Mining and Searching Complex Structures




                                                         106
Mining and Searching Complex                                                Chapter 2 Structures High Dimensional Data




          Scalability with k

                                   100
                                                                      PC
                                                                     ALL

                                    10
                     Runtime(s)




                                     1




                                   0.1
                                      100          300       500            600          800      1000
                                                                      k




                                             Mining and Searching Complex Structures




        Biological meaning –Prostate Cancer Data
                                   Frequncy of Occurrence
                                  1800
                                            W72186
                                  1600


                                  1400


                                  1200
                                                             AF017418
                                  1000
                                            AI635895

                                  800
                                                   X14487
                                  600                        AB014519
                                         M61916
                                  400                                                            Y13323

                                  200


                                    0
                                     0       200       400   600          800     1000    1200    1400    1600

                                                                   Gene Rank


                                             Mining and Searching Complex Structures




                                                              107
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data




          Classification results




                               Mining and Searching Complex Structures




          Classification results




                               Mining and Searching Complex Structures




                                              108
Mining and Searching Complex                                  Chapter 2 Structures High Dimensional Data




            References
        •       Anthony K. H. Tung, Rui Zhang, Nick Koudas, Beng Chin Ooi. "Similarity Search:
                A Matching Based Approach", VLDB'06
        •       H. V. Jagadish, Raymond T. Ng, Beng Chin Ooi, Anthony K. H. Tung, "ItCompress:
                An Iterative Semantic Compression Algorithm". International Conference on Data
                Engineering (ICDE'2004), Boston, 2004.
        •       Zhenjie Zhang, Beng Chin Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.
                Similarity Search on Bregman Divergence: Towards Non-Metric Indexing. In the
                Proceedings of the 35th International Conference on Very Large Data Bases(VLDB),
                Lyon, France August 24-28, 2009.
        •       Jia Xu, Zhenjie Zhang, Anthony K. H. Tung, and Ge Yu. "Efficient and Effective
                Similarity Search over Probabilistic Data Based on Earth Mover's Distance". to
                appear in VLDB 2010, a preliminary version on Technical Report TRA5-10,
                National University of Singapore. [Codes & Data]
        •       Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering
                Rule Groups for Gene Expression Data". In Proceedings SIGMOD'05,Baltimore,
                Maryland 2005
        •       Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao. What is Unequal
                among the Equals? Ranking Equivalent Rules from Gene Expression Data.
                Accepted in TKDE




                                      Mining and Searching Complex Structures




            Optional References:
            •     Feng Pan, Gao Cong, Anthony K. H. Tung, Jiong Yang, Mohammed Zaki,
                  "CARPENTER: Finding Closed Patterns in Long Biological Datasets",
                  In Proceedings KDD'03, Washington, DC, USA, August 24-27, 2003.
            •     Gao Cong, Anthony K. H. Tung, Xin Xu, Feng Pan, Jiong Yang.
                  "FARMER: Finding Interesting Rule Groups in Microarray Datasets".
                  Iin SIGMOD'04, June 13-18, 2004, Maison de la Chimie, Paris, France.
            •     Feng Pang, Anthony K. H. Tung, Gao Cong, Xin Xu. "COBBLER:
                  Combining Column and Row Enumeration for Closed Pattern
                  Discovery". SSDBM 2004 Santorini Island Greece.
            •     Gao Cong, Kian-Lee Tan, Anthony K.H. Tung, Feng Pan. “Mining
                  Frequent Closed Patterns in Microarray Data”. In IEEE International
                  Conference on Data Mining, (ICDM). 2004
            •     Xin Xu, Ying Lu, Anthony K.H. Tung, Wei Wang. "Mining Shifting-and-
                  Scaling Co-Regulation Patterns on Gene Expression Profiles". ICDE
                  2006.




                                      Mining and Searching Complex Structures




                                                     109
Mining and Searching Complex Structures                         Chapter 3 Similarity Search on Sequences




                      Searching and Mining Complex
                                Structures
                          Similarity Search on Sequences
                                     Anthony K. H. Tung(鄧锦浩)
                                         School of Computing
                                    National University of Singapore
                                     www.comp.nus.edu.sg/~atung




          Research Group Link: http://nusdm.comp.nus.edu.sg/index.html
          Social Network Link: http://www.renren.com/profile.do?id=313870900




             Types of sequences
             Symbolic vs Numeric
                We only touch discrete symbols here. Sequences of number are called time
                series and is a huge topic by itself!
             Single dimension vs multi-dimensional
                Example: Yueguo Chen, Shouxu Jiang, Beng Chin Ooi, Anthony K. H. Tung.
                "Querying Complex Spatial-Temporal Sequences in Human Motion Databases"
                accepted and to appear in 24th IEEE International Conference on Data Engineering
                (ICDE) 2008
             Single long sequence vs multiple sequences




              2010-7-31




                                                     110
Mining and Searching Complex Structures                Chapter 3 Similarity Search on Sequences




              Outline
              • Searching based on a disk based suffix tree
              • Approximate Matching Using Inverted List (Vgrams)
              • Approximate Matching Based on B+ Tree (BED Tree)




              2010-7-31




              Suffix
                          Suffixes of acacag$:

                             1.   acacag$
                             2.   cacag$
                             3.   acag$
                             4.   cag$
                             5.   ag$
                             6.   g$
                             7.   $


              2010-7-31




                                                 111
Mining and Searching Complex Structures                                               Chapter 3 Similarity Search on Sequences




              Suffix Trie

              E.g. consider the string S = acacag$                                                           g
              Suffix Trie: a ties of            $                                                  a    c                 $
              all possible suffices of S    7                                                                                 6
                                                                                  c        g                     a
                          Suffix
                                                                          a                    $                     g
                    1     acacag$                                                                            c
                    2     cacag$
                                                                      c       g                5
                                                                  a                                      a               $
                    3     acag$
                    4     cag$                                                        $                                  4
                                                              g                                         g
                    5     ag$                                                         3
                    6     g$                              $                                            $
                    7     $                               1                                             2
              2010-7-31




              Suffix Tree (I)
              Suffix tree for S=acacag$: merge nodes with only one child
                                                                                                             1 2 3 4 5 6 7
                                                                                                       S= a c a c a g $
                                       $                   a          c                g $
                                                                          a
                                7                                                                  6
                                            c        g
                                           a          $                                                  “ca” is an
          Path-label of                                                 c                                edge label
                                       v                               a              g
          node v is “aca”                                             g
          Denoted as α(v)
                                                      5                                $
                                  ac            g                     $
                                 g                                                                       This is a
                                $                $                    2               4                  leaf edge
                                1               3


              2010-7-31




                                                                  112
Mining and Searching Complex Structures                                             Chapter 3 Similarity Search on Sequences




              Suffix Tree (II)
              Suffix tree has exactly n leaves and at most n edges
              The label of each edge can be represented using 2 indices
              Thus, suffix tree can be represented using O(n log n) bits

                                                                                                        1 2 3 4 5 6 7
                                            7,7
                                        $           1,1 a       c       6,7   g$                   S= a c a c a g $
                                                                    a
                              7          2,3
                                                                    2,3             6
                                          c         g
                                         a          $           4,7
                                                      6,7         c                       Note: The end index of every
                                  4,7
                                                                 a          g 6,7         leaf edge should be 7, the last
                                                      5         g            $
                                ac       6,7                    $                         index of S. Thus, for leaf edges,
                               g             g                                            we only need to store the start
                              $               $                 2             4           index.
                              1                 3

              2010-7-31




              Generalized suffix tree
              Build a suffix tree for two or more strings
              E.g. S1 = acgat#, S2 = cgt$



                                                            a       c
                                     #            $                     g     g           t
                          6             4           c t
                                               g                  a                   a
                                                                 t t$                         t
                                              a        #                             t         $    #      $
                                            #t                  #                   #
                                            1             4     2             1     3          2    5       3


              2010-7-31




                                                                        113
Mining and Searching Complex Structures                            Chapter 3 Similarity Search on Sequences




              Straightforward construction of suffix tree

              Consider S = s1s2…sn where sn=$

              Algorithm:
                Initialize the tree we only a root
                For i = n to 1
                          Includes S[i..n] into the tree

              Time: O(n2)




              2010-7-31




              Example of construction
              S=acca$



             Init                                          For-loop

                                                                   c               a c
                               a          c
             $            $           $ a a                  a
                                                           $ $                 $ c
                                                               a c
                               $        $ $                                               c
                                                               $ a                a a
                                                                               $ c $      a
                                                                  $               $        $
              5           5 4          5 4 3           5 4 3 2              5 4 1 3       2

                 I5           I4           I3                 I2                    I1
              2010-7-31




                                                      114
Mining and Searching Complex Structures                   Chapter 3 Similarity Search on Sequences




                          Construction of generalized suffix tree
              S’= c#



                     Init                               For-loop

                      a c                     a c                      a c

              $  c    c                 #$ c     c                 #$ c     c
                  a a
               $ c $ a                       a a
                                           $ c $ a                      a a
                                                                      $ c $ a$ #
                  $    $                     $    $                     $
            5 4 1 3 2               2   5 4 1 3 2            2     5 4 1 3 2 1

                          I1                   J2                        J1
              2010-7-31




              Property of suffix tree
              Fact: For any internal node v in the suffix tree, if
                the path label of v is α(v)=ap, then
                    there exists another node w in the suffix tree such that
                    α(w)=p.

              Proof: Skip the proof.

              Definition of Suffix Link:
                    For any internal node v, define its suffix link sl(v) = w.




              2010-7-31




                                                 115
Mining and Searching Complex Structures                                 Chapter 3 Similarity Search on Sequences




              Suffix Link example
              S=acacag$


                                          $               a     c         g $
                                                                    a
                                  7                                             6
                                            c        g
                                           a          $
                                                                  c
                                                                 a       g
                                                      5         g         $
                                     ac         g               $
                                    g
                                   $             $              2         4
                                  1             3

              2010-7-31




           Can we construct a suffix tree in O(n)
           time?
              Yes. We can construct it in O(n) time and O(n) space
                    Weiner’s algorithm [1973]
                          Linear time for constant size alphabet, but much space
                    McGreight’s algorithm [JACM 1976]
                          Linear time for constant size alphabet, quadratic space
                    Ukkonen’s algorithm [Algorithmica, 1995]
                          Online algorithm, linear time for constant size alphabet, less space
                    Farach’s algorithm [FOCS 1997]
                          Linear time for general alphabet
                    Hon,Sadakane, and Sung’s algorithm [FOCS 2003]
                          O(n) bit space O(n logen) time for 0<e<1
                          O(n) bit space O(n) time for suffix array construction
              But they are all in-memory algorithm that does not
                guarantee locality of processing

              2010-7-31




                                                          116
Mining and Searching Complex Structures                              Chapter 3 Similarity Search on Sequences




                               Trellis Algorithm
                      A novel disk-based suffix tree construction
                          algorithm designed specifically for DNA
                          sequences
                      Scales gracefully for very large genome sequences
                          (i.e. human genome)
                      Unlike existing algorithms,
                               Trellis exhibits no data skew problem
                               Trellis recovers suffix links quickly
                               Trellis has fast construction and query time
                      Trellis is a 4-step algorithm
                   2010-7-31




          Trellis: Algorithm Overview
             1. Variable-length prefixes: e.g. AA, ACA, ACC, …
                     R0      R1                              Rr-1
              S

                               TR0      TR1                                       TRr-1
                                              2. Prefixed Suffix Sub-trees
                                                                                TPi
          TR0,P0                                TR1,Pm-1                               3. Tree
                                                                                       Merging
                                                           Disk    TR0,Pi             TRr-1,Pi



            4. Suffix Link Recovery (optional)
                   2010-7-31




                                                             117
Mining and Searching Complex Structures                                                                                                 Chapter 3 Similarity Search on Sequences




                         1. Variable-length Prefix Creation

                         Goal: Separate the complete suffix tree by prefixes of
                         suffixes, such that each subtree can reside entirely in the
                         available memory
                                                             Frequency of Length-2 Prefixes for
                                                                     Human Genome

                                           300,000,000
                                                                AA                                                                 TT
                                                                                                                                               Main Idea:
                                           250,000,000
                                                                               AT
                                                                                                                                               Expand prefixes
                               Frequency




                                           200,000,000                    AG        CA        CT                              TG
                                                                                                                    TA
                                                                                                   GA                    TC
                                           150,000,000               AC                  CC                  GGGT
                                           100,000,000
                                            50,000,000
                                                                                                        GC
                                                                                                                                               only as needed
                                                                                          CG
                                                    0
                                                         0                     5                   10                    15               20
                                                                                               Prefixes




             2010-7-31




                 2. Suffix Tree
                 Partitioning
             1. Variable-length prefixes: e.g. AA, ACA, ACC, …
                     R0      R1                              Rr-1
              S

                               TR0                             TR1                                                                                     TRr-1

                                                                                               2. Prefixed Suffix Sub-trees
          TR0,P0                                                                    TR1,Pm-1                                       • Use Ukkonen’s method because
                                                                                                                                   of Its efficiency: O(n) time &space
                                                                                                                                   • Discard suffix links when store the
                                                                                                                                   subtrees on disk
                                                                                                   Disk                            • Store enough information so that a
                                                                                                                                   subtree can be rebuilt quickly, e.g. edge
                                                                                                                                   starting index, edge length, node parent,
                                                                                                                                   etc.

                   2010-7-31




                                                                                                             118
Mining and Searching Complex Structures                             Chapter 3 Similarity Search on Sequences




               3. Suffix Tree Merging
             1. Variable-length prefixes: e.g. AA, ACA, ACC, …
                     R0      R1                              Rr-1
              S

                               TR0   TR1                                         TRr-1
                                           2. Prefixed Suffix Sub-trees
                                                                               TPi
          TR0,P0                               TR1,Pm-1                               3. Tree
                                                                                      Merging
                                                          Disk    TR0,Pi             TRr-1,Pi




                   2010-7-31




               Merge Algorithm

                         T1                        T2

                       A                   G        T
                                 C




                   Case 1: No common prefix




                   2010-7-31




                                                            119
Mining and Searching Complex Structures                   Chapter 3 Similarity Search on Sequences




              Merge Algorithm

                    T1                    T2
                                  T
                  A       C
                              G




              Case 1: No common prefix




              2010-7-31




              Merge Algorithm

                    T1                    T2         T1                    T2
                                  T                  A      CAAT           CAGGC
                  A       C
                              G




              Case 1: No common prefix          Case 2: Has common prefix




              2010-7-31




                                               120
Mining and Searching Complex Structures                   Chapter 3 Similarity Search on Sequences




              Merge Algorithm

                    T1                    T2         T1                     T2
                                                             CA
                                   T                 A
                  A        C                                 AT       GGC
                               G




              Case 1: No common prefix          Case 2: Has common prefix




              2010-7-31




                          4. Suffix Link Recovery
              Some internal nodes have suffix links from the
                Ukkonen’s algorithm in Step #1
              Some internal nodes are created in the merging
                step and do not have suffix links
              Discard all suffix link information from step #1
                and stored suffix trees on disk (does not help
                speed this step up, so discard to simplify)
              Should suffix links are required, use the suffix
                link recovery algorithm to rebuild them


              2010-7-31




                                               121
Mining and Searching Complex Structures                                                                             Chapter 3 Similarity Search on Sequences




                                        4. Suffix Link Recovery (cont.)
                           For each prefixed suffix tree, recursively call this function
                              from the tree’s root.
                           x: an internal node
                           L: be edge label between x and parent(x)

                                    RECOVER(x, L)
                                       if (x == root) sl(x)      x;
                                       else {
                                             1. p = parent(x);
                                            2. q = sl(p); //get suffix link of p, and load the prefix tree
                                                            for q from disk if not in memory
                                            3. Skip/count using L to locate sl(x) under q; }
                                       for (each internal child y of x)
                                            RECOVER(y, edge-label(x,y));
                           2010-7-31




                           Experimental Results

                                             Construction Time                                                 Construction and Link Recovery Time
                                    Trellis vs TOP-Q and DynaCluster                                                      Trellis vs TDD

                           1000                                                                              400
             Time (mins)




                                                                                               Time (mins)




                             100                                                                             300
                                                                                                             200
                              10
                                                                                                             100
                                1                                                                              0
                                    0      20     40     60      80     100      120                               200       400           600          800            1000
                                                Sequence Length (Mbp)                                                              Sequence Length (Mbp)

                            TOP-Q (mins)        DynaCluster (mins)      Trellis (mins)                        TDD        Trellis        Link Recovery         Total Trellis



            • Memory: 512 MB                                                                   • Memory: 512MB
            • TOP-Q and DynaCluster parameters were                                            Human genome suffix tree
            set as recommended in their papers                                                 (size ~3Gbp, using 2GB of memory)
                                                                                                       Trellis             TDD: 12.8hr
                                                                                                             • Without
                                                                                                             links: 4.2hr
                           2010-7-31                                                                         • With links:
                                                                                                             5.9hr




                                                                                         122
Mining and Searching Complex Structures                                                                                   Chapter 3 Similarity Search on Sequences




                                                  Experimental Results (cont.)
                                     Disk Space Usage

                                                            Disk-based Suffix Tree Size
                                                                  Trellis vs TDD
                                                                                                                        On average, Trellis uses about
                                                                                                                        27 bytes per character indexed while
                                                 30
                                                                                                                        TDD uses about 9.7 bytes.
                                     Size (GB)




                                                 20

                                                 10
                                                                                                                        For the human genome, TDD uses
                                                  0
                                                      200       400            600             800             1000
                                                                                                                        about 19.3 bytes/char because it
                                                                     Sequence Length (Mbp)                              requires 64-bit environment to index
                                                                           Trellis       TDD
                                                                                                                        larger sequences.

                                                                                                                        Trellis remains at 27 bytes/char for
                                                                                                                        the human genome.
                                                      Human Genome
                                                         Trellis                         TDD
                                                                                                                        Disk-space vs query time tradeoff
                                                          72GB                           54GB
            2010-7-31




                                                  Experimental Results (cont.)
                                     Query time (without suffix links)
                                                                                                                       TDD
                                                             Trellis vs TDD                                                 • smaller suffix trees
                                             Query Times on the Human Genome Suffix Tree                                    • edge length must be determined
                                                                                                                               by examining all children nodes
                                   8000                                                              Trellis                • each internal node only has a
                                                                                                     TDD
                                   4000                                                                                       pointer to its first child, i.e. children
               Query Length (bp)




                                   1000
                                                                                                                              must be linearly scanned during
                                                                                                                              a query search
                                   600
                                                                                                                       Trellis
                                   200
                                                                                                                            • larger suffix trees
                                    80                                                                                      • edge length stored locally with its
                                    40                                                                                        respective node
                                        0.000                0.050          0.100          0.150               0.200        • all children locations stored locally,
                                                                     Query Time (secs)                                        so each child can be accessed in a
                                                                                                                              constant time, i.e. no linear scan
                                                                                                                              needed
                                                        Hence, faster query time!
            2010-7-31




                                                                                                                123
Mining and Searching Complex Structures                                                         Chapter 3 Similarity Search on Sequences




                Experimental Results
                (cont.)
                                 S[150]

                                                                         xαG                                C

                                                                          Query length = 100

                   xα                                    α                             • Uses suffix links to move
                                                                                       across the tree to search for
                                                                                       the next query
                        v                                sf(v)                         • Mimics the behavior of
                 A           G      A                               G                  exact match anchor search
                                                               CA                      during a genome alignment
                2010-7-31




                            Experiment Results (cont.)
              Query time (with suffix links)
                                                           Trellis: Without Suffix Links vs With Suffix Links
                                                           Query Times on the Human Genome Suffix Tree


                                                        8000
                                                                                                     With Suffix Links
                                                        4000                                         Without Suffix Links
                                    Query Length (bp)




                                                        1000

                                                        600

                                                        200

                                                         80

                                                         40

                                                           0.000        0.010      0.020      0.030          0.040          0.050
                                                                                 Query Time (secs)




            2010-7-31




                                                                                124
Mining and Searching Complex Structures                 Chapter 3 Similarity Search on Sequences




                          Summary
              Trellis builds a disk-based suffix tree based on
                    A partitioning method via variable-length prefixes
                    A suffix subtree merging algorithm
              Trellis is both time and space efficient
              Trellis quickly recovers suffix links
              Faster than existing leading methods in both
                construction and query time




              2010-7-31




              Outline
              • Searching based on a disk based suffix tree
              • Approximate Matching Using Inverted List (Vgrams)
              • Approximate Matching Based on B+ Tree (BED Tree)




              2010-7-31




                                               125
Mining and Searching Complex Structures                              Chapter 3 Similarity Search on Sequences




                          Example 1: a movie database


                                                   Tom




                                 Find movies starred Samuel Jackson
                                 Star                        Title                Year   Genre
                           Keanu Reeves        The Matrix                         1999    Sci-Fi
                           Samuel Jackson      Star Wars: Episode III - Revenge   2005    Sci-Fi
                                               of the Sith
                           Schwarzenegger      The Terminator                     1984    Sci-Fi
                           Samuel Jackson      Goodfellas                         1990   Drama
                           …                   …                                   …       …
            2010-7-31




                        How about Schwarrzenger?




                                        The user doesn’t know the exact spelling!


                                 Star                        Title                Year   Genre
                           Keanu Reeves        The Matrix                         1999   Sci-Fi
                           Samuel Jackson      Star Wars: Episode III - Revenge   2005   Sci-Fi
                                               of the Sith
                           Schwarzenegger      The Terminator                     1984   Sci-Fi
                           Samuel Jackson      Goodfellas                         1990   Drama
                           …                   …                                   …       …
            2010-7-31




                                                        126
Mining and Searching Complex Structures                                 Chapter 3 Similarity Search on Sequences




                            Relax Condition




                                   Find movies with a star “similar to” Schwarrzenger.

                                       Star                     Title                Year   Genre
                               Keanu Reeves       The Matrix                         1999   Sci-Fi
                               Samuel Jackson     Star Wars: Episode III - Revenge   2005   Sci-Fi
                                                  of the Sith
                               Schwarzenegger     The Terminator                     1984   Sci-Fi
                               Samuel Jackson     Goodfellas                         1990   Drama
                               …                  …                                   …      …
            2010-7-31




                Edit Distance
                Given two strings A and B, edit A to B with the
                  minimum number of edit operations:
                        Replace a letter with another letter
                        Insert a letter
                        Delete a letter
                E.g.
                        A = interestings                         _i__nterestings
                        B = bioinformatics                       bioinformatic_s
                                                                 101101101100110
                        Edit distance = 9



                2010-7-31




                                                           127
Mining and Searching Complex Structures                Chapter 3 Similarity Search on Sequences




              Edit Distance Computation
              Instead of minimizing the number of edge operations, we
                can associate a cost function to the operations and
                minimize the total cost. Such cost is called edit distance.
              For the previous example, the cost function is as follows:
                A= _i__nterestings
                B= bioinformatic_s
                     101101101100110
                    Edit distance = 9                            _   A   C   G   T
                                                             _       1   1   1   1
                                                             A   1   0   1   1   1
                                                             C   1   1   0   1   1
                                                            G    1   1   1   0   1
                                                             T   1   1   1   1   0
              2010-7-31




                  Needleman-Wunsch algorithm (I)
              Consider two strings S[1..n] and T[1..m].
              Define V(i, j) be the score of the optimal
                alignment between S[1..i] and T[1..j]
              Basis:
                    V(0, 0) = 0
                    V(0, j) = V(0, j-1) + δ(_, T[j])
                          Insert j times
                    V(i, 0) = V(i-1, 0) + δ(S[i], _)
                          Delete i times




              2010-7-31




                                                 128
Mining and Searching Complex Structures                            Chapter 3 Similarity Search on Sequences




                Needleman-Wunsch algorithm (II)

              Recurrence: For i>0, j>0

                                          ⎧V (i − 1, j − 1) + δ ( S [i ], T [ j ])   Match/mismatch
                                          ⎪
                          V (i, j ) = max ⎨ V (i − 1, j ) + δ ( S [i ], _)           Delete
                                          ⎪ V (i, j − 1) + δ (_, T [ j ])
                                          ⎩                                          Insert

              In the alignment, the last pair must be either
                match/mismatch, delete, insert.
                                  xxx…xx                  xxx…xx              xxx…x_
                                       |                       |                   |
                                  xxx…yy                  yyy…y_              yyy…yy
                             match/mismatch             delete           insert
              2010-7-31




              Example (I)

                                     _       A      G        C     A      T          G    C
                              _      0      -1 -2 -3 -4 -5 -6 -7
                              A      -1
                              C      -2
                              A      -3
                              A      -4
                              T      -5
                              C      -6
                              C      -7
              2010-7-31




                                                       129
Mining and Searching Complex Structures                 Chapter 3 Similarity Search on Sequences




              Example (II)

                                _    A    G         C   A    T     G    C
                           _    0    -1 -2 -3 -4 -5 -6 -7
                          A    -1     2   1         0   -1 -2 -3 -4
                          C    -2     1   1         ?
                                                    3   2
                          A    -3
                          A    -4
                          T    -5
                          C    -6
                          C    -7
              2010-7-31




              Example (III)

                                _    A    G         C   A    T     G    C
                           _    0    -1 -2 -3 -4 -5 -6 -7
                          A    -1     2   1         0   -1 -2 -3 -4
                          C    -2     1   1         3   2    1     0    -1
                          A    -3     0   0         2   5    4     3    2
                          A    -4 -1 -1             1   4    4     3    2
                          T    -5 -2 -2             0   3    6     5    4
                          C    -6 -3 -3             0   2    5     5    7
                          C    -7 -4 -4 -1              1    4     4    7
              2010-7-31




                                              130
Mining and Searching Complex Structures                       Chapter 3 Similarity Search on Sequences




              “q-grams” of strings



                              universal




                                       2-grams




              2010-7-31




              q-gram inverted lists


                                               at         4
                                               ch         0      2
                      id   strings
                                               ck         1      3
                      0    rich
                                     2-grams
                                               ic         0      1     2      4
                      1    stick
                                               ri         0
                      2    stich
                                               st         1      2     3      4
                      3    stuck
                                               ta         4
                      4    static
                                               ti         1      2     4
                                               tu         3
                                               uc         3



              2010-7-31




                                                    131
Mining and Searching Complex Structures                         Chapter 3 Similarity Search on Sequences




                         Searching using inverted lists
            Query: “shtick”, ED(shtick, ?)≤1
               sh ht ti ic ck                                     # of common grams >= 3

                                            at    4
                                            ch    0     2
            id          strings
                                            ck    1     3
            0           rich
                                  2-grams
                                            ic    0     1           2       4
            1           stick
                                            ri    0
            2           stich
                                            st    1     2           3       4
            3           stuck
                                            ta    4
            4           static
                                            ti    1     2           4
                                            tu    3
                                            uc    3

            2010-7-31




                         2-grams -> 3-grams?
            Query: “shtick”, ED(shtick, ?)≤1
               sht hti tic ick                                     # of common grams >= 1

                                            ati   4
                                            ich   0         2
            id          strings             ick   1
            0           rich                ric   0
            1           stick     3-grams   sta   4
            2           stich               sti   1     2
            3           stuck               stu   3
            4           static              tat   4
                                            tic   1         2           4
                                            tuc   3
            2010-7-31
                                            uck   3




                                                  132
Mining and Searching Complex Structures            Chapter 3 Similarity Search on Sequences




                 Observation 1: dilemma of choosing “q”
              Increasing “q” causing:
                Longer grams Shorter lists
                Smaller # of common grams of similar strings
                                  at    4
                                  ch    0       2
             id strings
                                  ck    1       3
             0   rich
                          2-grams
                                  ic    0       1       2    4
             1   stick
                                  ri    0
             2   stich
                                  st    1       2       3    4
             3   stuck
                                  ta    4
             4   static
                                  ti    1       2       4
                                  tu    3
                                  uc    3
              2010-7-31




                          Observation 2: skew distributions of gram
                          frequencies
           DBLP: 276,699 article titles
           Popular 5-grams: ation (>114K times), tions, ystem, catio




              2010-7-31




                                             133
Mining and Searching Complex Structures                      Chapter 3 Similarity Search on Sequences




              VGRAM: Main idea
                   Grams with variable lengths (between qmin and qmax)
                     zebra
                             ze(123)
                          corrasion
                             co(5213), cor(859), corr(171)
                   Advantages
                     Reduce index size ☺
                     Reducing running time ☺
                     Adoptable by many algorithms ☺




              2010-7-31




              Challenges
             Generating variable-length grams?
             Constructing a high-quality gram dictionary?
             Relationship between string similarity and their
              gram-set similarity?
             Adopting VGRAM in existing algorithms?




              2010-7-31




                                                    134
Mining and Searching Complex Structures         Chapter 3 Similarity Search on Sequences




                 Challenge 1: String       Variable-length grams?

             Fixed-length 2-grams

                                 universal



                  Variable-length grams
                                                   [2,4]-gram dictionary
                                 universal             ni
                                                       ivr
                                                       sal
                                                       uni
                                                       vers
              2010-7-31




                          Representing gram dictionary as
                          a trie


                          ni
                          ivr
                          sal
                          uni
                          vers




              2010-7-31




                                          135
Mining and Searching Complex Structures                   Chapter 3 Similarity Search on Sequences




              Challenge 2: Constructing gram
              dictionary
            Step 1: Collecting frequencies of grams with length in [qmin,
               qmax]


                                st     0, 1, 3
                                sti    0, 1
                                stu    3
                                stic    0, 1
                                stuc    3




                                                       Gram trie with frequencies
              2010-7-31




              Step 2: selecting grams
              Pruning trie using a frequency threshold T (e.g., 2)




              2010-7-31




                                                 136
Mining and Searching Complex Structures                 Chapter 3 Similarity Search on Sequences




              Step 2: selecting grams (cont)

                                                        Threshold T = 2




              2010-7-31




              Final gram dictionary




                                          [2,4]-grams

              2010-7-31




                                            137
Mining and Searching Complex Structures              Chapter 3 Similarity Search on Sequences




                 Challenge 3: Edit operation’s effect on grams


                                                       Fixed length: q
                           universal



               k operations could affect k * q grams




              2010-7-31




                   Deletion affects variable-length grams


                Not affected              Affected      Not affected



                            i-qmax+1           i     i+qmax- 1
                                          Deletion




              2010-7-31




                                              138
Mining and Searching Complex Structures                 Chapter 3 Similarity Search on Sequences




              Grams affected by a deletion

                                   Affected?


                               i-qmax+1           i     i+qmax- 1
                                             Deletion
                                                                [2,4]-grams
                                  Deletion                          ni
                                                                    ivr
                           universal
                                                                    sal
                                                                    uni
                            Affected?                               vers
              2010-7-31




                          Grams affected by a deletion (cont)
                                                 Affected?


                               i-qmax+1        i        i+qmax- 1
                                          Deletion




                     Trie of grams
              2010-7-31                                  Trie of reversed grams




                                                 139
Mining and Searching Complex Structures                    Chapter 3 Similarity Search on Sequences




            # of grams affected by each operation


                            Deletion/substitution                  Insertion

                             0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0
                             _u_n_i_v_e_r_s_a_l_




              2010-7-31




              Max # of grams affected by k operations

                             Vector of s = <2,4,6,8,9>

                          With 2 edit operations, at most 4 grams can be affected



                      Called NAG vector (# of affected grams)
                      Precomputed and stored




              2010-7-31




                                                    140
Mining and Searching Complex Structures                Chapter 3 Similarity Search on Sequences




              Summary of VGRAM index




              2010-7-31




              Challenge 4: adopting VGRAM
              Easily adoptable by many algorithms

              Basic interfaces:
              String s grams
              String s1, s2 such that ed(s1,s2) <= k    min # of their
                 common grams




              2010-7-31




                                             141
Mining and Searching Complex Structures                          Chapter 3 Similarity Search on Sequences




             Lower bound on # of common grams

               Fixed length (q)

                                    universal

                      If ed(s1,s2) <= k, then their # of common grams
                         >=:
                                                     (|s1|- q + 1) – k * q



              Variable lengths: # of grams of s1 – NAG(s1,k)
              2010-7-31




                     Example: algorithm using inverted lists
            Query: “shtick”, ED(shtick, ?)≤1
                          sh   ht    tick
             2-grams                                               2-4 grams
               …           Lower bound = 3                           …
               ck              1     3                               ck        1      3
               ic              0     1        2       4              ic        1      4
               …                                                     ich       0      2
               ti              1     2        4                      …
               …                         id       strings            tic       2      4
                                         0        rich               tick      1
                                         1        stick              …
                                         2        stich
                                         3        stuck             Lower bound = 1

              2010-7-31
                                         4        static




                                                      142
Mining and Searching Complex Structures                    Chapter 3 Similarity Search on Sequences




              PartEnum + VGRAM
              PartEnum, fixed q-grams:
                                  ed(s1,s2) <= k
                     hamming(grams(s1),grams(s2)) <= k * q

              VGRAM:
                                         ed(s1,s2) <= k
                               hamming(VG (s1),VG(s2)) <= NAG(s1,k) +
                                            NAG(s2,k)




              2010-7-31




              PartEnum + VGRAM (naïve)

                                   R
                                                                 S




                                                        Bm(S) = max(NAG(s,k))

                          Bm(R) = max(NAG(r,k))


                   • Both are using the same gram dictionary.
                   • Use Bm(R) + Bm(S) as the new hamming bound.
              2010-7-31




                                                  143
Mining and Searching Complex Structures                  Chapter 3 Similarity Search on Sequences




             PartEnum + VGRAM (optimization)
                          R
                                                               S
                              R1 with Bm(R1)




                              R2 with Bm(R2)



                              R3 with Bm(R3)         Bm(S) = max(NAG(s,k))


            • Group R based on the NAG(r,k) values
            • Join(R1,S) using Bm(R1) + Bm(S)
            • Similarly, Join(R2,S), Join(R3,S)
            • Local bounds tighter     better signatures generated
            • Grouping S also possible.
               2010-7-31




              Outline
              • Searching based on a disk based suffix tree
              • Approximate Matching Using Inverted List (Vgrams)
              • Approximate Matching Based on B+ Tree (BED Tree)




              2010-7-31




                                               144
Mining and Searching Complex Structures             Chapter 3 Similarity Search on Sequences




              Approximate String Search
              Information Retrieval
                 Web search query with string “Posgre SQL” instead of
                 “Postgre SQL”
              Data Cleaning
                 “13 Computing Road” is the same as “#13 Comput’ng Rd”?
              Bioinformatics
                 Find out all protein sequences similar to
                 “ACBCEEACCDECAAB”




              2010-7-31


                                                                           71




              Edit Distance
            Edit distance on strings

                          13 Computing Drive
                                    3 deletions
                          13 Computing Dr           Edit distance: 5
                                      1 replacement
                          13 Comput’ng Dr
                                    1 insertion
                          #13 Comput’ng Dr
            Normalized edit distance
                         ED(s1,s2)                      5
                           MaxLength(s1,s2)             18
              2010-7-31


                                                                           72




                                              145
Mining and Searching Complex Structures                 Chapter 3 Similarity Search on Sequences




              Existing Solution
            Q-Gram
                                                  Q=3
                                  Postgre
                      ##P #Po Pos ost stg tgr gre re# e##

                                     Posgre
                          ##P #Po Pos osg sgr gre re#           e##

                 Observation: If ED(s1,s2)=d, they agree on at least
                 min(|s1|,|s2|)+Q-1-d*(Q+1) grams




              2010-7-31


                                                                               73




              Existing Solution
              Inverted List
                                          Postgre




             ##P           #Po   Pos    osg       sgr     gre         re$   e$$

                                       Posgre

              2010-7-31


                                                                               74




                                                146
Mining and Searching Complex Structures                    Chapter 3 Similarity Search on Sequences




              Limitations
              Inverted List Method
                Limited queries supported

                                  Range Query Join Query       Top-K Query   Top-K Join
                 Edit Distance       Y          Y                  N             N
                    Uncontrollable memory consumption
               Normalized ED         N          N                  N             N
                    Concurrency protocol




              2010-7-31


                                                                                     75




              Our Contributions
              Bed-Tree
                 Wide support on different queries and distances
                                  Range Query Join Query       Top-K Query   Top-K Join
                  Edit Distance        Y             Y              Y            Y
                                      Y
                 Normalized EDbuffer size
                  Adjustable                         Y
                                            and low I/O cost        Y            Y

                    Highly concurrent
                    Easy to implement
                    Competitive performance




              2010-7-31


                                                                                     76




                                                 147
Mining and Searching Complex Structures                  Chapter 3 Similarity Search on Sequences




              Basic Index Framework
           Bed-Tree Framework
           Index Construction
           follows standard B+                           Estimate the minimal
           tree                   Query: Posgre          distance to query and
                                                         prune B+ tree nodes


                                                                  Map all strings to a
                                                                  1D domain


                          Result: Postgre         Refine the result by exact
                                                  edit distance


              2010-7-31


                                                                                    77




              String Order Properties

              P1: Comparability
                    Given two string s1 and s2, we know the order of s1 and s2
                    under the specified string order
              P2: Lower Bounding
                    Given an interval [L,U] on the string order, we know a
                    lower bound on edit distance to the query string
                                  Query: Posgre
                                                                Candidates in the
                                                                sub-tree?




              2010-7-31


                                                                                    78




                                               148
Mining and Searching Complex Structures                         Chapter 3 Similarity Search on Sequences




              String Order Properties

              P3: Pairwise Lower Bounding
                    Given two intervals [L,U] and [L’,U’], we know the lower
                    bound of edit distance between s1 from [L,U] and s2 from
                    [L’,U’]
              P4: Length Bounding
                    Given an interval [L,U] on the string order, we know the
                    minimal length of the strings in the interval

                                                                      Potential join results?




              2010-7-31


                                                                                          79




              String Order Properties
              Properties v.s. supported queries and distances


                                  Range Query Join Query          Top-K Query     Top-K Join
                  Edit Distance     P1, P2          P1, P3           P1, P2          P1, P3
                 Normalized ED     P1, P2, P4      P1, P3, P4      P1, P2, P4     P1, P3, P4



                                                      Description
                            P1                       Comparability
                            P2                      Lower Bounding
                            P3                  Pair-wise Lower Bounding
                            P4                      Length Bounding

              2010-7-31


                                                                                          80




                                                   149
Mining and Searching Complex Structures                          Chapter 3 Similarity Search on Sequences




              Dictionary Order
              All strings are ordered alphabetically, satisfying P1, P2 and
                 P3


                          Search: Posgre with ED=1
                               Insertion: Postgre
                                                                   It’s between “pose”
                               pose      powder      sit           and “powder”




              2010-7-31


                                                                                         81




              Dictionary Order
              All strings are ordered alphabetically, satisfying P1, P2 and
                 P3


                          Search: Posgre with ED=1
                                                                      Not pruning
                              pose    powder       sit                anything!


                                                                      Pruning happens
                                 power       put           sad        only when long
                                                                      prefix exists



              2010-7-31


                                                                                         82




                                                    150
Mining and Searching Complex Structures                         Chapter 3 Similarity Search on Sequences




              Gram Counting Order
                                Jim Gray


                                                                      Hash all grams to
                                                                      4 buckets


                 2010-7-31




                                                                      Count the grams
                                                                      in binary

                          1 1     1 0          0 1        1 1




              Gram Counting Order
              Transform the count vector to a bit string with z-order




                                                                           Encode with z-
                                                                           order




                                        Order the strings
                                        with this signature

              2010-7-31


                                                                                          84




                                                       151
Mining and Searching Complex Structures                   Chapter 3 Similarity Search on Sequences




              Gram Counting Order
              Lower Bounding                                    Query: Jim Gary
                          “11011011” to “11011101”

                          Prefix: “11011???”
                                                                signature:
                                                                (4,1,2,2)




                                                                Minimal edit
                                                                distance: 1
              2010-7-31


                                                                                 85




              Gram Location Order
             Extension of Gram Counting Order
               Include positional information of the grams
                                  Jim Gray           Grace Hopper

                 Allow better estimation of mismatch grams
                 Harder to encode




              2010-7-31


                                                                                 86




                                                152
Mining and Searching Complex Structures              Chapter 3 Similarity Search on Sequences




              Experiment Settings
              Data




              Five Index Schemes
                 Bed-Tree: BD, BGC, BGL
                 Inverted List: Flamingo, Mismatch
              Default Setting
                 Q=2, Bucket=4, Page Size=4KB


              2010-7-31


                                                                            87




              Empirical Observations
              How good is Bed-Tree?
                With small threshold, Inverted Lists are better
                When threshold increases, Bed-Tree is not worse




                                            153
Mining and Searching Complex Structures             Chapter 3 Similarity Search on Sequences




              Empirical Observations
              Which string order is better?
               Gram counting order is generally better
               Gram Location order: tradeoff between gram content
               information and position information




              Conclusion
              A new B+ tree index scheme
                All similarity queries supported
                Both edit distance and normalized distance
                General transaction and concurrency protocol
                competitive efficiencies




              2010-7-31


                                                                           90




                                            154
Mining and Searching Complex Structures           Chapter 3 Similarity Search on Sequences




              References
                Benjarath Phoophakdee, Mohammed J. Zaki:
                "Genome-scale disk-based suffix tree indexing".
                SIGMOD Conference 2007: 833-844
                Chen Li, Bin Wang, and Xiaochun Yang . "VGRAM:
                Improving Performance of Approximate Queries on
                String Collections Using Variable-Length Grams". In
                VLDB 2007.
              • Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi,
                and Divesh Srivastava, "B^{ed}-Tree: An All-Purpose
                Tree Index for String Similarity Search on Edit
                Distance". SIGMOD 2010.




                                          155
Mining and Searching Complex                     Chapter 4 Structures Similarity Search on Trees




                  Searching and Mining Complex
                            Structures
                           Similarity Search on Trees
                                Anthony K. H. Tung(鄧锦浩)
                                    School of Computing
                               National University of Singapore
                                www.comp.nus.edu.sg/~atung




         Research Group Link: http://nusdm.comp.nus.edu.sg/index.html
         Social Network Link: http://www.renren.com/profile.do?id=313870900




                  Outline

                Importance of Trees
                Distance between Trees
                Fast Edit Distance Approximation for Trees




                                                                                  2




                                           156
Mining and Searching Complex                    Chapter 4 Structures Similarity Search on Trees




                  Importance of Trees

                Between sequences and graphs
                Equivalent to acyclic graph
                Represents hierarchal structures
                Examples
                   XML documents
                   Programs
                   RNA structure




                                                                                 3




                  Types of Trees

                Is there a root?
                Are the nodes labeled?
                Are the children of a node ordered?




                                                                                 4




                                          157
Mining and Searching Complex                    Chapter 4 Structures Similarity Search on Trees




                  Outline

                Importance of Trees
                Distance between Trees
                Fast Edit Distance Approximation for Trees




                                                                                 5




                  Distance Measure

                Many ways to define distance
                Convert to standard types and adopt the distance metric there
                How many operations to transform one tree to another? (Edit
                  distance)
                Inverse of similarity
                  dist(S, T) = maxSim – sim(S,T)
                Relationship between different definitions?




                                                                                 6




                                          158
Mining and Searching Complex                     Chapter 4 Structures Similarity Search on Trees




                  Operations on Trees

                Relabel




                Delete




                Insert




                                                                                  7




                  Remarks on Edit Distance

             Ordered trees are tractable
             Approach based on dynamic programming
             NP-hard for unordered trees
             Approach is to impose restrictions so that DP can be used




                                                                                  8




                                           159
Mining and Searching Complex                      Chapter 4 Structures Similarity Search on Trees




                  Edit Script

              Edit script(S, T): sequence of operations to transform S to T
              Example
                 1. S=


                                               3. Insert c
                                                  Relabel f → a
                 2. Delete c
                                                  Relabel e → d




                                                                                   9




                  Edit Distance Mapping
                Edit distance mapping(S, T): alternative representation of edit
                  operations
                   relabel: v → w
                   delete: v → $
                   insert: $ → w
                Mapping corresponding to the script




                                                                                  10




                                            160
Mining and Searching Complex                           Chapter 4 Structures Similarity Search on Trees




                  Edit Distance for Ordered Trees

            Generalize the problem to forests.
            C(φ, φ) = 0
            C(S, φ) = C(S – v, φ) + cost(v → $)
            C(φ, T) = C(φ, T – w) + cost($ → w)
            C(S, T) = minimum of
               1. C(S – v, T) + cost(v → $)     [deleting v]
               2. C(S, T – w) + cost($ → w) [inserting w]
               3. C(S – tree(v), T – tree(w)) +
                  C(S(v) - v, T(w)) + cost(v → w)[relabel v → w]




                                                                                       11




                  Illustration of Case 3

            C(S – tree(v), T – tree(w)) +
              C(S(v), T(w)) + cost(v → w) [relabel v → w]




                  S - tree(v)   ...          v            T - tree(w)   ...    w

                                      S(v)                                    T(w)




                                                                                       12




                                                 161
Mining and Searching Complex                     Chapter 4 Structures Similarity Search on Trees




                  Algorithm Complexity

                Number of subproblems bounded by O(|S|2|T|2)
                Zhang and Shasha, 1989 showed that the number of relevant
                  subproblems is
                  O(|S||T|min(SD, SL) min(TD, TL)) and space is O(|S||T|)
                Further improvements, required decomposition of a rooted tree
                  into disjoint paths




                                                                                 13




                  Decomposition into Paths

                Concept of heavy and light nodes/edges
                  (Harel and Tarjan, 1984)
                Root is light, child with max size is heavy
                Removal of light edges partitions T into disjoint heavy paths
                Important property: light depth(v) ≤ log|T| + O(1)
                Complexity can be reduced to O(|S|2|T|log|T|)




                                                                                 14




                                           162
Mining and Searching Complex                       Chapter 4 Structures Similarity Search on Trees




                  Unordered Edit Distance

                NP-hard
                Special cases (in P)
                   T is a sequence
                   Number of leaves in T is logarithmic
                Impose additional constraints
                   Disjoint subtrees map to disjoint subtrees




                                                                                   15




                  Tree Inclusion

                Is there a sequence of deletion operations on S which can
                   transform it to T?
                Special case of edit distance which only allows deletions




                                                                                   16




                                             163
Mining and Searching Complex                     Chapter 4 Structures Similarity Search on Trees




                  Complexity of Tree Inclusion

                Ordered trees
                   Concept of embeddings (restriction of mappings)
                   O(|S||T|) using the algorithm of
                     Kilpelainen and Mannila
                Unordered trees
                   NP-complete (what did you expect ?)
                   Special cases




                                                                                 17




                  Related Problems on Trees
           Tree Alignment (covered in the survey paper)
           Robinson-Fould's Distance for leaf labeled trees, where edge =
             bipartition of leaves
           Tree Pattern Matching
           Maximum Agreement Subtree
           Largest Common Subtree
           Smallest Common Supertree
           Many are generalizations of problems on strings




                                                                                 18




                                           164
Mining and Searching Complex                      Chapter 4 Structures Similarity Search on Trees




                  Summary of Tree Distance

                Edit distance
                   Concept of edit mapping
                   Dynamic programming for ordered trees
                   Constrained edit distance for unordered trees
                Tree inclusion
                   Special case of edit distance
                   Specialized algorithms are more efficient
                   Useful for determining embedded trees




                                                                                  19




                  Outline

                Importance of Trees
                Distance between Trees
                Fast Edit Distance Approximation for Trees




                                                                                  20




                                            165
Mining and Searching Complex                                  Chapter 4 Structures Similarity Search on Trees




                    Similarity Measurement
                  Edit Distance EDist(T1, T2)
                          Edit Operation           e; cost γ(e),


                           a->b                             b->λ                         λ->b
                                                                    a            a




               si(ei1,ei2,…,eik) : T1->T2; cost(si)= ∑j γ(eij)
               EDist(T1,T2)=mini(cost(si)) unit cost: EDist(T1,T2)=min(k)

                Computational Complexity:
                 O (| T1 | × | T2 | × min(depth(T1 ), leaves(T1 )) × min(depth(T2 ), leaves(T2 )))

              7/31/2010                                                                              21




             Edit Operation Mapping

                      Edit operations mapping
                           One-to-one
                           Preserve sibling order
                           Preserve ancestor order
                                                        a                    a


                                               d       b      e         b    c       d


                                           c       d                c   d        b
                             M(T1,T2)                  c      d

                                                                            e
                                                       T1                   T2


              7/31/2010                                                                              22




                                                        166
Mining and Searching Complex                                                                   Chapter 4 Structures Similarity Search on Trees




                          Observation

             Edit operations do not change many sibling
              relationship
                                                          a                                             a
                                                                                c->λ

                                              c                    d
                                  b                                        e
                                                                                         b      f   g        h       i       d           e

                              f           g           h        i                Sibling relation:
                                                                                (b,c)->(b,f)
                                                                                (c,d)->(i,d)


                 Node: Varying number of children v.s. at most 2 siblings

              7/31/2010                                                                                                                            23




                    Binary Tree Representation
                                                                                                                             a
              Binary Tree Representation
                                                                                                                 b                           e
                Left-child, right sibling                                                                                            b

              Normalized Binary Tree                                                                         c           d       c       d


                                                      a(1,8)
                                                                                               a   b   b   b   b   c   d  d   d   e
                                                                                               b … c … c … c … e … ε … ε …ε … ε … ε
                                      b(2,3)
                                                                                               ε b     c   e   ε d     b  e   ε   ε
                                                                                 ε
                                                                   b(5,6)                                   T1
                    c(3,1)
                                              c(6,4)                        e(8,7)             1 …1 …0 … 1 … 0 … 2                       …0 …0 … 2 … 1
                   ε      d(4,2)
                                                                                                            T2
                          ε           ε           ε       d(7,5)            ε        ε
                                                                                               1 …0 …1 … 0 … 1 … 2                       …0 …1 … 0 … 1
                                                                                                                                         1
                                                              ε        ε
                                                  |Γ |
                   BBDist (T1 , T2 ) = ∑ | b1i − b2i | = 8                                   Triangular Inequality
                                                  i =1

              7/31/2010                                                                                                                            24




                                                                                         167
Mining and Searching Complex                                                                                                      Chapter 4 Structures Similarity Search on Trees




                              One Edit Operation Effect
                                      v’                                                                  v’



                              ...          ...                         ...                         ...                      ...                 ...
             w1         w2           wl           w l+m            w l+m+1         w1        w2          wl                 v     w l+m+1
                        ...                            ...                                   ...                                    ...
                                                                                                                      ...                                             Each node appears in
                                                                                                          w l+1             w l+m                                     at most two binary
                              v’                                                                           v’                                                         branches

                                    ...                                                                         ...
             w1                                                                         w1

                        w2          ...
                                                                                                   w2
                  ...                            ...                                                            ...
                                wl                                                                              wl                   v ...
                                            w l+1                                                        ...
                                                             ...
                                                                                                                                                      w l+m+1
                                                         w l+m                                                 w l+1
                                     ...                                                                                                              ...
                                                                   w l+m+1                                                  w l+2
                                                                                                                                          ...
                                                                             ...
                                                                                                                                                w l+m
                                                                                                                                                            ...

                                                                                                                                                                  ε

              7/31/2010                                                                                                                                                                      25




              Theorem

             1 insertion/deletion incurs at most 5 difference on BBDist
             1 rellabeling incurs at most 4 difference on BBDist
             T, T’, EDIST(T, T’) = k = ki + kd + kr ,
             BDist(T,T’) <= 4kr+5ki+5kd <= 5k;


                                                                    1/5 BDist is a lower
                                                                    bound of edit distance;




              7/31/2010                                                                                                                                                                      26




                                                                                                                  168
Mining and Searching Complex                                                                                     Chapter 4 Structures Similarity Search on Trees




                               Positional Binary Branch
                                   a(1,8)                                                            a(1,8)                                                               a
                                                                                                                                             a
                       b(2,3)                                                        b(2,5)
                                                                  ε                                                   ε
                                            b(5,6)                        c(3,1)                             c(7,6)                                           b
          c(3,1)                                                                                                                   d        b         e                   c        d
                            c(6,4)                   e(8,7)
                                                                              d(4,2)             ε                d(8,7)
          ε   d(4,2)                                                  ε
                                                                                                         ε                 ε
                                                                                                                                                          c   d               b
                                                                             ε          b(5,4)                                         d
              ε        ε       ε     d(7,5)          ε        ε                                                                c            c         d
                                                                             e(6,3)                  ε
                                        ε       ε                                ε            ε                                              T1               T2
                                                                                                                                                                      e
                                     B(T1)                                                           B(T2)
                                                                                                                                                  a
                               Incurs 0 difference for
                               BBDist(T1,T2)
                                                                                                                                   c         d        e

                       Positional binary branch: PosBiB(T(u))                                                                                     T’2

                           PosBiB(T1(e))=(BiB(e,ε,ε),8,7)                                                    ≠            PosBib(T2(e))=(BiB(e,ε,ε),6,3)


                                                                  Positional Binary Branch Distance
                   7/31/2010                                                                                                                                                  27




               Computational Complexity

                   D: dataset; |D|: dataset size;
                   Vector construction part:                                                                                                                       | D|
                                                                                                                                           time, space : O(∑ | Ti |)
                        Traverse the data trees for once                                                                                                          i =1

                        Optimistic bound computation:
                            time: each binary search O(|Ti|+|Tq|),
                                                                                 | D|
                                                                          O (∑ (| Ti | + | Tq |) log(min(| Ti |,| Tq |)))
                                                                                 i =1
                                                         totally:
                                                                              | D|

                                     space:                           O (∑ (| Ti | + | Tq |))
                                                                              i =1




                   7/31/2010                                                                                                                                                  28




                                                                                                         169
Mining and Searching Complex                                                                Chapter 4 Structures Similarity Search on Trees




                          Generalized Study

             Extend the sliding window to q level
             The images vector gives multiple level binary
               branch profiles.
                    (T,T’      [4*(q- 1)+1]*EDist(T,T’
             BDist_q(T,T’) <= [4*(q-1)+1]*EDist(T,T’)
                                           v’                                                            v’


                                                ...                                                           ...
                           w1                                                                w1

                                      w2        ...
                                                                                                  w2
                                ...                      ...                                                  ...
                                           wl                                                                 wl             v ...
                                                       w l+1                                           ...
                                                                ...
                                                                                                                                        w l+m+1
                                                               w l+m                                         w l+1
                                                 ...                                                                                    ...
                                                                      w l+m+1                                        w l+2
                                                                                                                              ...
                                                                                ...
                                                                                                                                    w l+m
                                                                                                                                              ...

              7/31/2010                                                                                                                             29




           Query Processing Strategy

             Filter-and-refine frameworks
                Lower bound distances filter out most objects
                            The lower bound computation is much succinct
                            Lower bound distance is a close approximation of
                          the real dist
                  Remaining objects be validated by real distance




              7/31/2010                                                                                                                             30




                                                                                      170
Mining and Searching Complex                                                                                                                                          Chapter 4 Structures Similarity Search on Trees




                          Experimental Settings
                   Compare with histogram methods[KKSS04]
                          Lower bound: feature vector distance (Leaf Distance Height
                          histogram vector, Degree histogram vector, Label histogram
                          vector)
                   Synthetic dataset:
                          Tree size, Fanout, Label, Decay factor
                   Real dataset: dblp XML document
                   Performance measure:
                          Percentage of data accessed:
                                                                                                       | false positive | + | true positive |
                                                                                                                                              × 100%
                                                                                                                    | dataset |
                          CPU time consumed
                          Space requirement
              7/31/2010                                                                                                                                                                                                                                                                         31




                          Sensitivity to the Data Properties
              Sensitivity test                                                       Range: N{}N{50,2.0}L8D0.05


                                                                        35                                                      0.5                                                                       Range: N{4,0,5}N{}L8D0.05
                                                   % of Accessed Data




                                                                                                                                            CPU Cost (Second)




                                                                        30
                                                                                                                                0.4
                                                                        25                                                                                                                  80                                                            3
                                                                                                                                                                       % of Accessed Data




                                                                                                                                                                                                                                                                            CPU Cost (Second)




                                                                        20                                                      0.3                                                         70                                                            2.5
                                                                                                                                                                                            60
                                                                        15                                                      0.2                                                                                                                       2
                                                                                                                                                                                            50
                                                                        10
                                                                                                                                0.1                                                         40                                                            1.5
                                                                         5                                                                                                                  30                                                            1
                                                                         0                                                      0                                                           20
                                                                                                                                                                                            10                                                            0.5
                                                                                 2            4           6              8
                                                                                                  Fanout
                                                                                                                                                                                             0                                                            0
                                                                             BiBranch %            Histo %               Result %
                                                                                                                                                                                                     25          50             75          125
                                                                             BiBranch              Sequ                                                                                                               Tree Size
                                                                                                                                                                                                    BiBranch %          His to %           Result %
                                                                                                                                                                                                    BiBranch            Sequ

                                                                                     KNN: N{}N{50,2.0}L8D0.05

                                                                                                                                                                                                           KNN: N{4,0.5}N{}L8D0.05
                                                            8                                                                   0.5
                                                            7
                              % of Accessed Data




                                                                                                                                      CPU Cost (Second)




                                                                                                                                0.4                                                         100                                                       3
                                                            6
                                                                                                                                                                       % of Accessed Data




                                                                                                                                                                                                                                                              CPU Cost (Second)




                                                                                                                                                                                             80                                                       2.5
                                                            5                                                                   0.3
                                                            4                                                                                                                                                                                         2
                                                                                                                                0.2                                                          60
                                                            3                                                                                                                                                                                         1.5
                                                            2                                                                   0.1                                                          40
                                                                                                                                                                                                                                                      1
                                                            1
                                                            0                                                                   0                                                            20                                                       0.5
                                                                             2            4               6          8                                                                        0                                                       0
                                                                                              Fanout
                                                                                                                                                                                                     25          50          75            125
                                                                        BiBranch %         Histo %            BiBranch         Sequ                                                                                Tree Size


                                                                                                                                                                                                  BiBranch %      Histo %            BiBranch     Sequ




                           mean(fanout): 2                                                                                8;                                                  mean(|T|): 25    125;
                           mean(|T|): 50;                                                                                                                                     mean(fanout): 4;
                           size(label): 8                                                                                                                                     size(label): 8
              7/31/2010                                                                                                                                                                                                                                                                         32




                                                                                                                                                                171
Mining and Searching Complex                                                                                                                             Chapter 4 Structures Similarity Search on Trees




                                          Sensitivity test (cont.)

                                                                          Range: N{4,0.5}N{50,2.0}L{}D0.05
                                                                                                                                                                                                                        KNN: N{4,0.5}N{50,2.0}L{}D0.05

                                                               35                                                              0.5                                                                   7                                                             0.45
                                          % of Accessed Data




                                                                                                                                     CPU Cost (Second)
                                                               30                                                                                                                                                                                                  0.4
                                                                                                                                                                                                     6




                                                                                                                                                                                % of Accessed Data




                                                                                                                                                                                                                                                                           CPU Cost (Second)
                                                                                                                               0.4
                                                               25                                                                                                                                                                                                  0.35
                                                                                                                                                                                                     5
                                                                                                                               0.3                                                                                                                                 0.3
                                                               20
                                                                                                                                                                                                     4                                                             0.25
                                                               15                                                              0.2                                                                   3                                                             0.2
                                                               10                                                                                                                                                                                                  0.15
                                                                                                                               0.1                                                                   2
                                                                5                                                                                                                                                                                                  0.1
                                                                                                                                                                                                     1                                                             0.05
                                                                0                                                              0
                                                                                                                                                                                                     0                                                             0
                                                                         8         16        32          64
                                                                                                                                                                                                                8             16        32               64
                                                                                   Label Number
                                                                                                                                                                                                                               Label Number
                                                                        BiBranch %     Histo %            Result %
                                                                        BiBranch       Sequ                                                                                                                  BiBranch %          Histo %         BiBranch          Sequ




                                                                          size(label): 8                 64; mean(|T|): 50; mean(fanout): 4



              7/31/2010                                                                                                                                                                                                                                                                              33




                                      Queries with Different Parameters
                                    Dblp data (avg. distance: 5.031)
                                    Range queries
                                    KNN (k:5-20)
                                                                             Range: DBLP
                                                                                                                                                                                                                                 KNN: DBLP

                                    100                                                                   0.35
                                                                                                          0.3                                                                   6                                                                                    0.35
               % of Accessed Data




                                                                                                                 CPU Cost (second)




                                     80
                                                                                                          0.25                                                                  5                                                                                    0.3
                                                                                                                                                                                                                                                                                               CPU Cost (second)




                                     60
                                                                                                                                                           % of Accessed Data




                                                                                                          0.2                                                                                                                                                        0.25
                                                                                                                                                                                4
                                     40                                                                   0.15                                                                                                                                                       0.2
                                                                                                          0.1                                                                   3
                                                                                                                                                                                                                                                                     0.15
                                     20
                                                                                                          0.05                                                                  2
                                                                                                                                                                                                                                                                     0.1
                                      0                                                                   0
                                                                                                                                                                                1                                                                                    0.05
                                                     1              2        3     4       5   7    10
                                                                                 Range                                                                                          0                                                                                    0
                                          BiBranch %                             Histo %           Result %
                                                                                                                                                                                                         5          7       10      12      15      17        20
                                          BiBranch                               Sequ                                                                                                                                                   k
                                                                                                                                                                                                         BiBranch %              Histo %         BiBranch          Sequ




              7/31/2010                                                                                                                                                                                                                                                                              34




                                                                                                                                         172
Mining and Searching Complex                                                                    Chapter 4 Structures Similarity Search on Trees




             Pruning Power of Different Level
                 Data distribution according to distances
                          Edit distance
                          Histogram distance
                          Binary branch distance: 2, 3, 4 level
                                                                                            DBLP


                                                             2000


                                                             1500
                                         Data Distribution




                                                             1000


                                                             500


                                                               0
                                                                       1   2      3    4    5    6    7    8   9   10   11   12
                                                                                                Distance
                                                                    Edit                   Histo               BiBranch(2)
                                                                    BiBranch(3)            BiBranch(4)




              7/31/2010                                                                                                           35




              Citations on the Paper

           Surprisingly, attract citations and questions from software
             engineering! Expect more impact along software mining
             direction soon.
           DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones - all 2
              versions »
              L Jiang, G Misherghi, Z Su, S Glondu - Proceedings of the 29th International
              Conference on Software …, 2007 - portal.acm.org
              Detecting code clones has many software engineering applications. Existing
              approaches either do not scale to large code bases or are not robust against minor
              code modifications. In this paper, we present an efficient ...
           Fast Approximate Matching of Programs for Protecting Libre/Open Source Software
              by Using Spatial … - all 2 versions »
              AJM Molina, T Shinohara - Source Code Analysis and Manipulation, 2007. SCAM
              2007. …, 2007 - doi.ieeecomputersociety.org
              To encourage open source/libre software development, it is desirable to have
              tools that can help to identify open source license violations. This paper
              describes the imple-mentation of a tool that matches open source programs ...



              7/31/2010                                                                                                           36




                                                                                      173
Mining and Searching Complex                          Chapter 4 Structures Similarity Search on Trees




                      References
            • Philip Bille . A survey on tree edit distance and related problems.
              Theoretical Computer Science. Volume 337 , Issue 1-3 (June 2005)
            • Rui Yang, Panos Kalnis, Anthony K. H. Tung: Similarity Evaluation on
              Tree-structured Data. SIGMOD 2005.

            • Optional References:
            • JP Vert. "A tree kernel to analyze phylogenetic profiles" - Bioinformatics,
              2002 - Oxford Univ Press




              7/31/2010




                                                174
Mining and Searching Complex Structures                     Chapter 5 Graph Similarity Search




                  Searching and Mining Complex
                            Structures
                              Graph Similarity Search
                               Anthony K. H. Tung(鄧锦浩)
                                   School of Computing
                              National University of Singapore
                               www.comp.nus.edu.sg/~atung




          Research Group Link: http://nusdm.comp.nus.edu.sg/index.html
          Social Network Link: http://www.renren.com/profile.do?id=313870900




              Outline

              • Introduction
              • Foundation
              • State of the Art on Graph Matching
                •Exact Graph Matching
                •Error-Tolerant Graph Matching
              • Search Graph Databases
                •Graph Indexing Methods
              • Our Works
                •Star Decomposition
                •Sorted Index For Graph Similarity Search




                                            175
Mining and Searching Complex Structures                                Chapter 5 Graph Similarity Search




              Smart Graphs


               Chemical compound Protein structure                 Program flow




               Coil                                                               Image




                      Fingerprint               Letter                 Shape




              Motivation

              • Why graph?
                •Graph is ubiquitous
                •Graph is a general model
                •Graph has diversity
                •Graph problem is complex and challenging
              • Why graph search?
                •Manifold application areas
                      •   2D and 3D image analysis
                      •   Video analysis
                      •   Document processing
                      •   Biological and biomedical applications




                                                   176
Mining and Searching Complex Structures                      Chapter 5 Graph Similarity Search




              Graph Search

              • Definition
                •Given a graph database D and a query graph Q, find all
                graphs in D supporting the users’ requirements:
                    •   The same as Q
                    •   Containing Q or contained by Q
                    •   Similarity to Q
                    •   Similarity to the subgraph of Q
              • Challenge
                •How to efficiently compare two graphs?
                •How to reduce the number of pairwise graph comparisons?




             How to efficiently compare two graphs?
              • The graph matching problem
                •Graph matching is the process of finding a
                correspondence between the vertices and the edges of two
                graphs that satisfies some (more or less stringent)
                constraints ensuring that similar substructures in one graph
                are mapped to similar substructures in the other.




                                                 177
Mining and Searching Complex Structures                     Chapter 5 Graph Similarity Search




              How to reduce the number of pairwise
               graph comparisons?
              • Scalability issue
                •A full database scan
                •Complex graph matching between a pair of graphs

              • Index mechanisms are needed




              Outline

              • Introduction
              • Foundation
              • State of the Art on Graph Matching
                •Exact Graph Matching
                •Error-Tolerant Graph Matching
              • Search Graph Databases
                •Graph Indexing Methods
              • Our Works
                •Star Decomposition
                •Sorted Index For Graph Similarity Search




                                            178
Mining and Searching Complex Structures                         Chapter 5 Graph Similarity Search




              Categories of Matching




              Exact Graph Matching

              • Graph Isomorphism
                •Two graphs G1=(V1,E1) and G2=(V2,E2) are isomorphic if
                there is a bijective function f: V1 → V2 such that for all
                u, v ∈ V1: {u,v} ∈ E1 ↔ {f(u),f(v)} ∈ E2




                              G1                           G2




                                             179
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search




              Exact Graph Matching

              • Induced Subgraph
                •A subset of the vertices of a graph together with all edges
                whose endpoints are both in this subset



              • Subgraph Isomorphism
                •An isomorphism holds between one of the two graphs and
                an induced subgraph of the other




              Graph Similarity Measure

              • Graph Edit Distance
                •The minimum amount of distortion that is needed to
                transform one graph into another
                •The edit operations ei can be deletions, insertions, and
                substitutions of vertices and edges



               G1




               G2




                                              180
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search




              Graph Similarity Measure

              • Graph Edit Distance
                •The minimum amount of distortion that is needed to
                transform one graph into another
                •The edit operations ei can be deletions, insertions, and
                substitutions of vertices and edges



               G1




               G2




              Graph Similarity Measure

              • Graph Edit Distance
                •The minimum amount of distortion that is needed to
                transform one graph into another
                •The edit operations ei can be deletions, insertions, and
                substitutions of vertices and edges



               G1




               G2




                                              181
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search




              Graph Similarity Measure

              • Graph Edit Distance
                •The minimum amount of distortion that is needed to
                transform one graph into another
                •The edit operations ei can be deletions, insertions, and
                substitutions of vertices and edges



               G1




               G2




              Graph Similarity Measure

              • Graph Edit Distance
                •The minimum amount of distortion that is needed to
                transform one graph into another
                •The edit operations ei can be deletions, insertions, and
                substitutions of vertices and edges


               G1




               G2




                                              182
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search




              Graph Similarity Measure

              • Graph Edit Distance
                •The minimum amount of distortion that is needed to
                transform one graph into another
                •The edit operations ei can be deletions, insertions, and
                substitutions of vertices and edges


               G1




               G2




              Graph Similarity Measure

              • Graph Edit Distance (GED)
                •Given two attributed graphs G1 = (V1,E1, Σ, l1) and G2 =
                (V2,E2, Σ, l2) , the GED between them is defined as



                •where T(G1,G2) denotes the set of edit paths transforming G1
                into G2, and c denotes the edit cost function measuring the
                c(ei) of edit operation ei
              • GED provides a general dissimilarity measure for graphs
              • Most works on inexact graph matching focusing on the
                GED computation problem




                                              183
Mining and Searching Complex Structures                     Chapter 5 Graph Similarity Search




              Outline

              • Introduction
              • Foundation
              • State of the Art on Graph Matching
                •Exact Graph Matching
                •Error-Tolerant Graph Matching
              • Search Graph Databases
                •Graph Indexing Methods
              • Our Works
                •Star Decomposition
                •Sorted Index For Graph Similarity Search




              Exact Matching Algorithms

              • Tree search based algorithms
                •Ullmann’s algorithm
                •VF and VF2 algorithm

              • Other algorithms
                •Nauty algorithm




                                            184
Mining and Searching Complex Structures                         Chapter 5 Graph Similarity Search




              Tree Search based Algorithms

              • Basic Idea
                •A partial match (initially empty) is iteratively expanded by
                adding new pairs of matched vertices
                 •The pair is selected using some necessary conditions,
                 usually also some heuristic condition to prune unfruitful
                 search paths
                 •The algorithm ends when it finds a complete matching, or no
                 further vertex pairs may be added (backtracking)
                 •For attributed graphs, the attributes of vertices and edges can
                 be used to constrain the desired matching




              The Backtracking Algorithm
                                                                             1
              • Depth-First Search (DFS):
                                                                      2     5        6
                •progresses by expanding the first child node of the search
                tree                                          3       4      7       8
                •going deeper and deeper until a goal node is found, or until
                it hits a node that has no children.
              • Branch and Bound (B&B):
                •BFS(breadth-first search)-like search for optimal solution
                •Branch is that a set of solution candidates is splitted into two
                or more smaller sets
                •bound is that a procedure upper and lower bounds             1

                                                                       2         3       4

                                                                5      6         7       8




                                               185
Mining and Searching Complex Structures                                    Chapter 5 Graph Similarity Search




              Tree Search based Algorithms

              • Ullmann’s Algorithm (DFS)
                •A refinement procedure based on matrix of possible future
                matched vertex pairs to prune unfruitful matches
                 •The simple enumeration algorithm for the isomorphisms
                 between a graph G1 and a subgraph of another graph G2 with
                 the adjacency matrices A1and A2
                 • An M’ matrix with |V1| rows and |V2 | columns can be used
                 to permute the rows and columns of A2 to produce a further
                 matrix P. If                 , then M’ specifies an
                 isomorphism between G1 and the subgraph of G2.
                            (a1 i , j = 1) ⇒ ( pi , j = 1)


                                                                         P = M ' ( M ' A2 )T




              Tree Search based Algorithms

              • Ullmann’s Algorithm
                •Example for permutation matrix
                •The elements of M’ are 1’s and 0’s, such that each row
                contains 1 and each column contains 0 or 1


                                             P = M ' (M ' A2 )T
                                                                                             T
                                                           ⎡          ⎡0     1    0   0⎤⎤
                                               ⎡1 0 0 0⎤ ⎢⎡1 0 0 0⎤ ⎢                    ⎥
                  G2                                                   1     0    1   1⎥⎥
                                             = ⎢0 0 1 0⎥ ⋅ ⎢⎢0 0 1 0⎥.⎢
                                               ⎢       ⎥ ⎢⎢         ⎥ ⎢0
                                                                                       ⎥
                                                                             1    0   0⎥⎥
                                               ⎣0 1 0 0⎥ ⎢⎢0 1 0 0⎥ ⎢0
                                               ⎢       ⎦ ⎣          ⎦        1    0
                                                                                       ⎥⎥
                                                                                      0⎦⎥
                                                           ⎢
                                                           ⎣          ⎣                  ⎦
                                                         ⎡0       0 1⎤
                                               ⎡1 0 0 0⎤ ⎢              ⎡0 0 1⎤
                                                          1       1 0⎥ ⎢
                                                                     ⎥ = 0 0 1⎥
                                             = ⎢0 0 1 0⎥.⎢
                                               ⎢       ⎥ ⎢0       0 1⎥ ⎢      ⎥
                                               ⎣0 1 0 0⎥ ⎢
                                               ⎢       ⎦             ⎥ ⎢1 1 0⎥
                                                                        ⎣     ⎦
                                                         ⎣0       0 1⎦




                                                        186
Mining and Searching Complex Structures                                                                Chapter 5 Graph Similarity Search




              Tree Search based Algorithms

               • Ullmann’s Algorithm
                 •Construction of another matrix M0 with the same size of M’
                                       ⎧1 if deg(V2i ) ≥ deg(V1i )
                              mi0, j = ⎨                          , mi , j ∈ {0,1}
                                       ⎩0      otherweise
                 •Generation of all M’ by setting all but one of each row of M0
                 •A subgraph isomorphism has been found if
                                                                                                     (a1 i , j = 1) ⇒ ( pi , j = 1)
                                                                ⎡0 1 0 0⎤
                                                                ⎢1 0 1 1⎥
                  G2                                       A2 = ⎢       ⎥                                  ⎡1 1 1 1⎤
                                                                ⎢0 1 0 0⎥
                                                                ⎢       ⎥                            M 0 = ⎢1 1 1 1⎥
                                                                                                           ⎢       ⎥
                                                                ⎣0 1 0 0⎦
                                                                                                           ⎢0 1 0 0⎥
                                                                                                           ⎣       ⎦
                                                                ⎡0 0 1⎤
                  G1                                       A1 = ⎢0 0 1⎥
                                                                ⎢     ⎥
                                                                ⎢1 1 0⎥
                                                                ⎣     ⎦




              Tree Search based Algorithms

              • Ullmann’s Algorithm
                •An example                                                  ⎡1 1 1 1 ⎤
                                                                             ⎢1 1 1 1 ⎥
                                                                             ⎢        ⎥
                                                                             ⎢0 1 0 0⎥
                                                                             ⎣        ⎦

                                          ⎡1 0 0 0 ⎤             ⎡0 1 0 0⎤             ⎡0 0 1 0⎤         ⎡0 0 0 1 ⎤
                                          ⎢1 1 1 1 ⎥             ⎢1 1 1 1 ⎥            ⎢1 1 1 1 ⎥        ⎢1 1 1 1⎥
                                          ⎢        ⎥             ⎢        ⎥            ⎢        ⎥        ⎢        ⎥
                                          ⎢0 1 0 0⎥
                                          ⎣        ⎦             ⎢0 1 0 0⎥
                                                                 ⎣        ⎦            ⎢0 1 0 0⎥
                                                                                       ⎣        ⎦        ⎢
                                                                                                         ⎣0 1 0 0 ⎥
                                                                                                                  ⎦

                          ⎡1 0 0 0 ⎤              ⎡1 0 0 0 ⎤            ⎡0 0 1 0⎤      ⎡0 0 1 0⎤       ⎡0 0 0 1 ⎤      ⎡0 0 0 1 ⎤
                          ⎢0 0 1 0⎥               ⎢0 0 0 1 ⎥            ⎢1 0 0 0 ⎥     ⎢0 0 0 1 ⎥      ⎢1 0 0 0 ⎥      ⎢0 0 1 0⎥
                          ⎢        ⎥              ⎢        ⎥            ⎢        ⎥     ⎢        ⎥      ⎢        ⎥      ⎢        ⎥
                          ⎢0 1 0 0⎥
                          ⎣        ⎦              ⎢0 1 0 0 ⎥
                                                  ⎣        ⎦            ⎢0 1 0 0⎥
                                                                        ⎣        ⎦     ⎢0 1 0 0⎥
                                                                                       ⎣        ⎦      ⎢0 1 0 0⎥
                                                                                                       ⎣        ⎦      ⎢0 1 0 0⎥
                                                                                                                       ⎣        ⎦
                              1               4    1         2           2       4       1       2       2         1       1               1

                                      3                3                     3               3               3                     3
                                              2              3                   1               1                 3                       2


                          1               4       P = M ' ( M ' A2 )T
                                  2                 ⎡0 0 1 ⎤                                            ⎡0 0 1 ⎤       1

                                                  = ⎢0 0 1 ⎥                         compared with A1 = ⎢0 0 1 ⎥
                                                                                                        ⎢      ⎥               3
                                          3         ⎢      ⎥
                                                                                                        ⎢1 1 0 ⎥
                                                    ⎢
                                                    ⎣1 1 0 ⎥
                                                           ⎦                                            ⎣      ⎦                       2




                                                                         187
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search




              Tree Search based Algorithms

              • Ullmann’s Algorithm
                •A most widely used algorithm
              • VF or VF2
                •VF defines a heuristic based on the analysis of vertices
                adjacent to the ones already considered in the partial mapping
                •VF2 reduces the memory requirement from O(n2) to O(n)
              • Other methods: Nauty Algorithm
                •Constructs the automorphism group of each of the input
                graphs and derives a canonical labeling. The isomorphism
                can be checked by verifying the equality of the adjacency
                matrices




              Exact Graph Matching

              • Summary
                •The matching problems are all NP-complete except for
                graph isomorphism, which has not yet been shown in NP or
                not.
                 •Exact isomorphism is very seldom used. Subgraph
                 isomorphism can be effectively used in many contexts.
                 •Exact graph matching has exponential time complexity in
                 the worst case.
                 •Ullmann’ algorithm, VF2 algorithm and Nauty algorithm are
                 mostly used algorithms. Most modified algorithms adopt
                 some conditions to prune the unfruitful partial matching.




                                             188
Mining and Searching Complex Structures                         Chapter 5 Graph Similarity Search




              Error-Tolerant Graph Matching

              • GED Computation
                •Optimal algorithms
                    • Exact GED computation requires isomorphism testing
                    • Tree search based algorithms (A* based algorithms)
                 •Suboptimal algorithms
                    • Heuristic algorithms
                    • Formulated as a BLP problem




              A* Algorithm
            • A tree search based algorithm
              •Similar to isomorphism testing
              •Differently, the vertices of the source graph can potentially
              be mapped to any node of the target graph

              •Search tree is constructed dynamically
              •by creating successor vertices linked
              by edges to the currently vertex
              •A heuristic function is usually used to
              •determine the vertex for expansion




                                               189
Mining and Searching Complex Structures                         Chapter 5 Graph Similarity Search




              Exact GED Computation
              • Summary
                •The complexity is exponential in the number of vertices of
                the involved graphs.
                 •For graphs with unique vertex labels the complexity is linear.
                 •Exact graph edit distance is feasible for small graphs only.
                 •Several suboptimal methods have been proposed to speed up
                 the computation and make GED applicable to large graphs.




              Bipartite Matching for GED
              • A Heuristic Algorithm
                •A new suboptimal procedure for the GED computation
                based on Hungarian algorithm (i.e., Munkres’ Algorithm).
                 •Hungarian algorithm is used as a tree search heuristic.
                 •Much faster than the exact computation and the other
                 suboptimal methods
                 •Application for larger graphs




                                              190
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search




              Bipartite Matching for GED
              • Assignment Problem
                •Find an optimal assignment of n elements in a set S1 =
                {u1, …, un} to n elements in a set S2 = {v1, …, vn}
                •Let cij be the costs of the assignment (ui → vj)
                •The optimal assignment is a permutation P = (p1, …, pn) of
                the integers 1, …, n that minimizes

                                                          S1            S2
                                                                  c11
                                                                  c12
                                                                  c13




              Bipartite Matching for GED

              • Assignment Problem
                •Given the n × n matrix Mcij of the assignment costs
                •This problem can be formulated as finding a set of n
                independent elements of Mcij with minimum summation
                          S1           S2
                              1
                            5 4
                             5
                               7
                             6
                             58
                              8
                •Hungarian algorithm finds the minimum cost assignment in
                O(n3) time.




                                            191
Mining and Searching Complex Structures                      Chapter 5 Graph Similarity Search




              Bipartite Matching for GED

              • Main Idea
                •Construct a vertex cost matrix Mcv and an edge cost matrix
                Mce
                •For each open vertex v in the search tree, run Hungarian
                algorithm on Mcv and Mce
                •The accumulated minimum cost of both assignments serves
                as a lower bound for the future costs to reach a leaf node
                •h(P) = Hungarian(Mcv) + Hungarian(Mce) is the tree search
                hearistic
                •Returns a suboptimal solution as an upper bound of GED




              Suboptimal Algorithms

              • Binary Linear Programming (BLP)
                •Use the adjacency matrix representation to formulate a BLP
                •Compute GED between G0 and G1




                 •Edit grid




                                            192
Mining and Searching Complex Structures               Chapter 5 Graph Similarity Search




              Binary Linear Programming

              • Isomorphisms of G0 on the edit grid




              • State vectors




              Binary Linear Programming

              • Definition:




              • Objective Function:




                                          193
Mining and Searching Complex Structures                   Chapter 5 Graph Similarity Search




              Binary Linear Programming

             • Lower Bound: linear program (O(n7))




             • Upper Bound: assignment problem (O(n3))




              Summary

              • The complexity of the exact GED computation is
                exponential and unaccepted.

              •    Suboptimal methods solve the graph matching problem
                  by fast returning the suboptimal solution and can be
                  applied to larger graphs.

              • An important application of the graph matching problem
                is searching a graph database.




                                           194
Mining and Searching Complex Structures                           Chapter 5 Graph Similarity Search




              Outline

              • Introduction
              • Foundation
              • State of the Art on Graph Matching
                •Exact Graph Matching
                •Error-Tolerant Graph Matching
              • Search Graph Databases
                •Graph Indexing Methods
              • Our Works
                •Star Decomposition
                •Sorted Index For Graph Similarity Search




              Graph Search Problem

              • Query a graph database
                •Given a graph database D and a query graph Q, find all
                graphs in D supporting the users’ requirements.
                    •   Full graph search (all match )
                    •   Subgraph search (partial match or containment search)
                    •   Similarity full graph search (based on GED)
                    •   Similarity subgraph search (based on GED)




                                                195
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search




              Scalability Issue

              • On-line searching algorithm


                               100,000
                               checking              answer
                  100,000
              • A full sequential scan
                         Subgraph isomorphism testing
                •I/O costs
                •Subgraph isomorphism testing (GED computation)
              • An indexing mechanism is needed




              Indexing Graphs

              • Indexing is crucial



                             100,000
                             checking           answer
               100,00
                 0
                          filtering       100 checking
                            Index                   answe
               100,00                      10         r
                 0                         0




                                               196
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search




              Indexing Strategy

              • Filter-and-refine framework based on features


                 Step 1. Index Construction
                    Enumerate smaller units (features)
                 in the database, build an index
                 between units and graphs
                  Step 2. Query Processing
                    Enumerate smaller units in the
                  query graph
                    Use the index to first filter out
                  non-candidates
                    Prune the answers by exact
                  checking




              Indexing Strategy

              • Feature-based Indexing methods
                •Break the database graphs into smaller units like paths, trees,
                and subgraphs, and use them as filtering features
                 •Build inverted index between the smaller units and the
                 database graphs
                 •Filter graphs based on the number of smaller units or their
                 locality information




                                              197
Mining and Searching Complex Structures                              Chapter 5 Graph Similarity Search




              Featured-based Indexing Systems


               Small units Smaller units             Query
               GraphGrep        path                 Contain (Containment search)
               SING             path                 Contain
               gIndex           graph                Contain + Edge relaxation
               FGIndex          graph                Contain
               TREE∆            tree+graph           Contain
               Treepi           tree                 Contain
               κ-AT             tree                 Full similarity search
               CTree            -                    Contain + Edge relaxation




              Path-based Algorithms




               [http://ews.uiuc.edu/~xyan/tutorial/kdd06_graph.ht
               m]




                                                  198
Mining and Searching Complex Structures                             Chapter 5 Graph Similarity Search




              Path-based Algorithms




               [http://ews.uiuc.edu/~xyan/tutorial/kdd06_graph.ht
               m]




              Path-based Algorithms: problem




               [http://ews.uiuc.edu/~xyan/tutorial/kdd06_graph.ht
               m]




                                                  199
Mining and Searching Complex Structures                           Chapter 5 Graph Similarity Search




              Feature-based Methods: limitation

              • Problem:
                •For similarity search, filtering is done by inferring the edit
                distance bound through the smaller units that exactly match
                the query structure
                    • A rough bound
                    • Not effective for large graphs (because features that may be
                    rare in small graphs are likely to be found in enormous graphs
                    just by chance)




              Outline

              • Introduction
              • Foundation
              • State of the Art on Graph Matching
                •Exact Graph Matching
                •Error-Tolerant Graph Matching
              • Search Graph Databases
                •Graph Indexing Methods
              • Our Works
                •Star Decomposition
                •Sorted Index For Graph Similarity Search




                                               200
Mining and Searching Complex Structures                               Chapter 5 Graph Similarity Search




              Graph Similarity Search Problem

              • Definition
                 •Given a graph database D and a query structure Q, similarity search is
                 to find all the graphs in D that are similar to Q based on GED.


              • Two challenges in the filter-and-refine framework:
                •How to efficiently compute more effective edit distance
                bounds between two graphs for filtering?
                •How to reduce the number of pairwise graph dissimilarity
                computations to speed up the graph search?




              Our Solutions

              • Work 1: Star decomposition
                •Break each graph into a multiset of stars
                •Propose new effective and efficient lower and upper GED
                bounds through finding a mapping between the star sets of
                two graphs using Hungarian algorithm

              • Work 2: Sorted index for graph similarity search
                •Propose a novel indexing and query processing framework
                •Deploy a filtering strategy based on TA and CA methods to
                reduce the number of pairwise graph dissimilarity
                computations




                                                  201
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search




              Outline

              • Introduction
              • Foundation
              • State of the Art on Graph Matching
                •Exact Graph Matching
                •Error-Tolerant Graph Matching
              • Search Graph Databases
                •Graph Indexing Methods
              • Our Works
                •Star Decomposition
                •Sorted Index For Graph Similarity Search




                  Comparing Stars: On
                Approximating Graph Edit
                       Distance

                                               Zhiping Zeng
                                          Anthony K.H. Tung
                                             Jianyong Wang
                                               Jianhua Feng
                                                 Lizhu Zhou




                                             202
Mining and Searching Complex Structures                            Chapter 5 Graph Similarity Search




              Star Decomposition

              • Star structure
                •A star structure s is an attributed, single-level, rooted tree
                which can be represented by a 3-tuple s=(r, L, l), where r is
                the root vertex, L is the set of leaves and l is a labeling
                function.
              • Star representation for graph
                •A graph can be broken into a multiset of star structures



                       c        b                     a                    c
                                                 b    c    c       a       c       d
                                                      b                    d
                 a         c         d
                           G1                     a   c     d          c       c




              Star Decomposition

              • Star edit distance
                •Given two star structures s1 and s2,
                •      λ(s1, s2) = T(r1, r2) + d(L1, L2)
                •Where T(r1, r2) = 0 if l(r1) = l(r2); otherwise T(r1, r2) = 1
                •     d(L1, L2) = ||L1| − |L2|| + M(L1, L2)
                •     M(L1, L2) = max{| ΨL1|, | ΨL2|} − |ΨL1∩ΨL2|




                Example: given s1 = abcc, and s2 =                 a                       d
                dcc,
                T(r1, r2) = 1, as l(a) ≠ l(d);                 b   c       c           c        c
                d(L1, L2) = |3-2| + 3 – 2 = 2;                     s1                      s2
                λ(s1, s2) = 1 + 2 = 3.




                                               203
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search




              Star Decomposition

              • Mapping distance
                •Given two multisets of star structures S(G1) and S(G2) from
                two graphs G1 and G2 with the same cardinality, and assume
                P: S(G1) → S(G2) is a bijection. The mapping distance
                between G1 and G2 is



                 •This problem can be formulated as the assignment problem.
                 Given a distance cost matrix between two star multisets, the
                 mapping distance can be computed using Hungarian
                 algorithm.




              A Simple Example




                                             204
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search




              Bounds of GED

              • Lower Bound
                •Let G1 and G2 be two graphs, then the mapping distance
                μ(G1, G2) between them satisfies
                     μ(G1, G2) ≤ max{4, [min{δ(G1), δ(G2)]} + 1]} · λ(G1,
                G2)

              • Based on the above Lemma, μ provides a lower bound
                Lm of λ, i.e.,




               Constructing the cost matrix takes Θ(n3), and running
               the Hungarian algorithm takes O(n3).




              Bounds of GED

              • Upper bound
                •The first upper bound τ comes naturally during the
                computation of μ
                •The output from the computation of using Hungarian
                algorithm leads to a mapping P’ from V(G1) to V(G2)
                •Recall the BLP method, exact GED is computed as



                 •Therefore,                  is a naturally upper bound



           The mapping P’ might not be optimal, so τ (G1, G2)≥λ(G1, G2).
           C(G1, G2, P’) is solved in Θ(n2) time, therefore, τ can be computed in
           Θ(n3) time.




                                             205
Mining and Searching Complex Structures                               Chapter 5 Graph Similarity Search




              Bounds of GED

              • Refined upper bound ρ: main idea
                •Given any two vertices v1 and v2 in G1 and their
                corresponding mapping f(v1) and f(v2) in G2 (assuming f is
                the mapping function corresponding to P’), we swap f (v1)
                and f (v2) if this reduce the edit distance.



                     c        b       d        ε         c        b        d        ε

                 a        c       a        c         a        c        a        c
                     G1               G2                 G1                G2
                new mappings obtained might lead to better or worse
              bounds. Refining to get a better takes O(n6).




              Filtering Strategy

              • Integrating all the GED bounds into a filter-and-refine
                framework
                •Filtering features: Lm ≤ λ ≤ ρ ≤ τ.
                •Filtering orders: bounds with lower computation complexity
                are deployed first.




                                                   206
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search




              Full Graph Similarity Search

              • Problem
                •Given a graph database D and a query structure Q, find all
                the graphs Gi in D with λ(Q, Gi) ≤ d (d is a threshold).

              • AppFULL algorithm:
                •if Lm(Q, Gi) > d, Gi can be safely filtered;
                •if τ(Q, Gi) ≤ d, Gi can be reported as a result directly;
                •if ρ(Q, Gi) ≤ d, Gi can be reported as a result directly;
                •otherwise, λ(Q, Gi) must be computed.




              Subgraph exact Search

              • Lemma
                •Given two graphs G1 and G2 , if no vertex relabelling is
                allowed in the edit operations, μ’(G1, G2) ≤ 4 · λ’(G1, G2),
                where μ’ and λ’ are computed without vertex relabelling.
                •(This Lemma can be used in subgraph search, because if a
                graph is subisomorphism to another graph, no vertex
                relabelling happens.)
              • AppSUB algorithm:
                •Filtering based on the lower bound                 .




                                              207
Mining and Searching Complex Structures                         Chapter 5 Graph Similarity Search




              Experimental Results


              • Compare with the exact algorithm




              1,000 graphs were generated, D = 1k,T = 10,V = 4.
              Randomly select 10 seed graphs to form D; a seed has 10 vertices.
              6 query groups. Each group has 10 graphs. Graphs in the same
                 group have the same number of vertices.




              Experimental Results

              • Compare with the BLP method




                                              208
Mining and Searching Complex Structures               Chapter 5 Graph Similarity Search




              Experimental Results

              • Scalability over real datasets




              Experimental Results

              • Scalability over synthetic datasets




                                            209
Mining and Searching Complex Structures         Chapter 5 Graph Similarity Search




              Experimental Results

              • Performance of AppFULL




              Experimental Results

              • Performance of AppSUB




                                          210
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search




              Outline

              • Introduction
              • Foundation
              • State of the Art on Graph Matching
                •Exact Graph Matching
                •Error-Tolerant Graph Matching
              • Search Graph Databases
                •Graph Indexing Methods
              • Our Works
                •Star Decomposition
                •Sorted Index For Graph Similarity Search




                    SEGOS: SEarch similar
                     Graphs On Star index

                                                Xiaoli Wang
                                               Xiaofeng Ding
                                          Anthony K.H. Tung
                                              Shanshan Ying
                                                     Hai Jin




                                              211
Mining and Searching Complex Structures                          Chapter 5 Graph Similarity Search




              Our Solutions

              • Work 1: Scalability issue
                •A full database scan
                •A index mechanism is needed
              • Existing indexing methods: Filtering power
                •Rough bounds with poor filtering power

              • Work 2: Sorted index for graph similarity search
                •Propose a novel indexing and query processing framework
                •Deploy a filtering strategy based on TA and CA methods
                •All exiting lower and upper GED bounds can be directly
                integrated into our filtering framework




              TA Method on the Top-k Query

              • The database model used in TA


                                  M
                  Object                            Sorted L1         Sorted L2
                           A1         A2
                   ID
                           0.9        0.85            (a, 0.9)          (d, 0.9)
                    a

                           0.8        0.7             (b, 0.8)         (a, 0.85)
                    b

                           0.72       0.2            (c, 0.72)          (b, 0.7)
                    c
                                                         .                 .
                    d      0.6        0.9                .                 .
                    .       .          .                 .                 .
                    .       .          .                 .                 .
                    .       .          .              (d, 0.6)          (c, 0.2)
              N     .       .          .




                                             212
Mining and Searching Complex Structures                          Chapter 5 Graph Similarity Search




              TA method on the top-k query

              • A simple query
                •Find the top-2 objects on the ‘query’ of ‘A1&A2 ’
                •This query results in the TA method combing the scores of
                A1 and A2 by an aggregation function like

                                                               sum(A1,A2)



                  Aggregation function:
                  function that gives objects an overall score based on attribute
                  scores
                  examples: sum, min functions
                  Monotonicity!




              Monotony on TA (Halting Condition)

              • Main idea
                •How do we know that scores of seen objects are higher than
                the grades of unseen objects?
                •Predict maximum possible score unseen objects:


                                          L1          L2

                                    a: 0.9       d: 0.9
                 Seen
                                    b: 0.8      a: 0.85
                                    c: 0.72      b: 0.7        ω = sum(0.72, 0.7) =
                                        .           .          1.42
                                        .        f: 0.6
                                                    .
                                        .
                                    f: 0.65         .
            Possibly unseen             .           .                  Threshold value
                                     d: 0.6      c: 0.2




                                               213
Mining and Searching Complex Structures                         Chapter 5 Graph Similarity Search




              A Top-2 Query Example
              • Given 2 sorted lists for attributes A1 and A2,




                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                    ID     A1      A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)

                     (c, 0.72)    (b, 0.7)

                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)




              A Top-2 Query Example
              • Step 1
                •Parallel sorted access attributes from every sorted list




                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                    ID     A1      A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                     a    0.9
                     (c, 0.72)    (b, 0.7)
                                                     d             0.9      1.5
                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)




                                              214
Mining and Searching Complex Structures                             Chapter 5 Graph Similarity Search




              A Top-2 Query Example
              • Step 1
                •Sorted access attributes from every sorted list
                •For each object seen:
                    • get all scores by random access
                    • determine sum(A1,A2)
                    • amongst 2 highest seen? keep in buffer
                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                     ID        A1     A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                      a    0.9
                     (c, 0.72)    (b, 0.7)
                                                      d               0.9
                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)




              A Top-2 Query Example
              • Step 1
                •Sorted access attributes from every sorted list
                •For each object seen:
                    • get all scores by random access
                    • determine sum(A1,A2)
                    • amongst 2 highest seen? keep in buffer
                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                     ID        A1     A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                      a    0.9       0.85
                     (c, 0.72)    (b, 0.7)
                                                      d               0.9
                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)




                                               215
Mining and Searching Complex Structures                             Chapter 5 Graph Similarity Search




              A Top-2 Query Example
              • Step 1
                •Sorted access attributes from every sorted list
                •For each object seen:
                    • get all scores by random access
                    • determine sum(A1,A2)
                    • amongst 2 highest seen? keep in buffer
                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                     ID        A1     A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                      a    0.9       0.85      1.75
                     (c, 0.72)    (b, 0.7)
                                                      d               0.9
                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)




              A Top-2 Query Example
              • Step 1
                •Sorted access attributes from every sorted list
                •For each object seen:
                    • get all scores by random access
                    • determine sum(A1,A2)
                    • amongst 2 highest seen? keep in buffer
                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                     ID        A1     A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                      a    0.9       0.85      1.75
                     (c, 0.72)    (b, 0.7)
                                                      d    0.6        0.9      1.5
                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)




                                               216
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search




              A Top-2 Query Example
              • Step 2
                •Determine threshold value based on objects currently seen
                under sorted access. ω = sum(L1, L2)



                        L1         L2

                     (a, 0.9)    (d, 0.9)
                                                   ID   A1      A2    sum(A1,A2)
                     (b, 0.8)    (a, 0.85)
                                                   a    0.9    0.85      1.75
                     (c, 0.72)   (b, 0.7)
                                                   d    0.6     0.9      1.5
                         .           .
                         .           .
                         .           .
                         .           .

                     (d, 0.6)    (c, 0.2)




              A Top-2 Query Example
              • Step 2
                •Determine threshold value based on objects currently seen
                under sorted access. ω = sum(L1, L2)



                        L1         L2

                     (a, 0.9)    (d, 0.9)
                                                   ID   A1      A2    sum(A1,A2)
                     (b, 0.8)    (a, 0.85)
                                                   a    0.9    0.85      1.75
                     (c, 0.72)   (b, 0.7)
                                                   d    0.6     0.9      1.5
                         .           .
                         .           .
                         .           .
                         .           .

                     (d, 0.6)    (c, 0.2)          ω = sum(0.9, 0.9) = 1.8




                                             217
Mining and Searching Complex Structures                          Chapter 5 Graph Similarity Search




              A Top-2 Query Example
              • Step 2
                •Determine threshold value based on objects currently seen
                under sorted access. ω = sum(L1, L2)
                •2 objects with overall score ≥ threshold value ω? Stop
                •else go to next entry position in sorted list and go to step 1

                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                     ID     A1      A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                     a     0.9     0.85      1.75
                     (c, 0.72)    (b, 0.7)
                                                     d     0.6     0.9       1.5
                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)           ω = sum(0.9, 0.9) = 1.8




              A Top-2 Query Example
              • Step 2
                •Determine threshold value based on objects currently seen
                under sorted access. ω = sum(L1, L2)
                •2 objects with overall score ≥ threshold value ω? Stop
                •else go to next entry position in sorted list and go to step 1

                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                     ID     A1      A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                     a     0.9     0.85      1.75
                     (c, 0.72)    (b, 0.7)
                                                     d     0.6     0.9       1.5
                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)




                                               218
Mining and Searching Complex Structures                          Chapter 5 Graph Similarity Search




              A Top-2 Query Example
              • Step 2
                •Determine threshold value based on objects currently seen
                under sorted access. ω = sum(L1, L2)
                •2 objects with overall score ≥ threshold value ω? Stop
                •else go to next entry position in sorted list and go to step 1

                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                     ID     A1      A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                     a     0.9     0.85      1.75
                     (c, 0.72)    (b, 0.7)
                                                     d     0.6     0.9       1.5
                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)




              A Top-2 Query Example
              • Step 1 (Again)
                •Sorted access attributes from every sorted list




                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                     ID     A1      A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                     a     0.9     0.85      1.75
                     (c, 0.72)    (b, 0.7)
                                                     d     0.6     0.9       1.5
                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)




                                               219
Mining and Searching Complex Structures                              Chapter 5 Graph Similarity Search




              A Top-2 Query Example
              • Step 1 (Again)
                •Sorted access attributes from every sorted list
                •For each object seen:
                    • get all scores by random access
                    • determine sum(A1,A2)
                    • amongst 2 highest seen? keep in buffer
                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                     ID        A1      A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                      a    0.9        0.85      1.75
                     (c, 0.72)    (b, 0.7)
                                                      d    0.6         0.9      1.5
                         .            .
                         .            .               b        0.8     0.7      1.5
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)




              A Top-2 Query Example
              • Step 1 (Again)
                •Sorted access attributes from every sorted list
                •For each object seen:
                    • get all scores by random access
                    • determine sum(A1,A2)
                    • amongst 2 highest seen? keep in buffer
                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                     ID        A1      A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                      a    0.9        0.85      1.75
                     (c, 0.72)    (b, 0.7)
                                                      d    0.6         0.9      1.5
                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)




                                               220
Mining and Searching Complex Structures                          Chapter 5 Graph Similarity Search




              A Top-2 Query Example
              • Step 2 (Again)
                •Determine threshold value based on objects currently seen
                under sorted access. ω = sum(L1, L2)
                •2 objects with overall score ≥ threshold value ω? Stop
                •else go to next entry position in sorted list and go to step 1

                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                     ID     A1      A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                     a     0.9     0.85      1.75
                     (c, 0.72)    (b, 0.7)
                                                     d     0.6     0.9       1.5
                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)           ω = sum(0.8, 0.85) = 1.65




              A Top-2 Query Example
              • Step 2 (Again)
                •Determine threshold value based on objects currently seen
                under sorted access. ω = sum(L1, L2)
                •2 objects with overall score ≥ threshold value ω? Stop
                •else go to next entry position in sorted list and go to step 1

                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                     ID     A1      A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                     a     0.9     0.85      1.75
                     (c, 0.72)    (b, 0.7)
                                                     d     0.6     0.9       1.5
                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)




                                               221
Mining and Searching Complex Structures                            Chapter 5 Graph Similarity Search




              A Top-2 Query Example
              Situation at stopping:
                                              ω = sum(0.72, 0.7) = 1.42 < 1.5
                        L1          L2

                     (a, 0.9)     (d, 0.9)
                                                        ID   A1      A2    sum(A1,A2)
                     (b, 0.8)     (a, 0.85)
                                                        a    0.9    0.85      1.75
                     (c, 0.72)    (b, 0.7)
                                                        d    0.6     0.9      1.5
                         .            .
                         .            .
                         .            .
                         .            .

                     (d, 0.6)     (c, 0.2)




              TA-based Filtering Strategy for Graph
              Search Problem
              • Main idea
                •Each graph is broken into a multiset of stars
                •Each distinct star generated from the database graphs can be
                seen as an index attribute in the TA database model
                •Each entry in the sorted lists contains the graph identity
                (denoted by gi) and its score (denoted by λ) in that star
                attribute, the score is defined as the star edit distance between
                a star of gi and the index star
                •Halting condition: given m sorted lists, if the aggregation
                function of ω = sum(λ1,…, λm)≥d (d is the edit distance
                threshold bound for graph mapping distance), TA stops.




                                                  222
Mining and Searching Complex Structures                               Chapter 5 Graph Similarity Search




              TA-based Filtering Strategy for Graph
              Search Problem
              • Challenges:
                •How do we know that the distance threshold is larger than
                those of unseen graphs (these graphs can be safely filtered
                out)? Predict minimum possible mapping distance for unseen
                graphs:

                                       L1           L2

                                      g1: 0         g4: 0
                 Seen
                                      g2: 1         g1: 1
                                      g3: 2         g2: 3         ω = sum(2, 3) = 5 > d (= 4)
                                        .             .
                                        .             :
                                                    g6. 5

            Possibly unseen           g6. 5
                                        :             .
                                        .             .                  Threshold value
                                      g4: 6         g3: 9




              TA-based Filtering Strategy for Graph
              Search Problem
              • A graph database with a query example



               Sorted list L1   Sorted list L2   Sorted list L3




                                                 223
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search




              Requirement

              • An index structure
                •Convenient for score-sorted lists construction
              • Efficient star search algorithm
                •Quickly return similar stars to a query star
              • Sorted properties for the halting condition of TA
                •The mapping distance of any unseen graph gi satisfies
                      λ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’)
                 q is the query graph, τ is the distance threshold, and

                  where D’ is the set of all unseen graphs.




              •Requirement distance in our previous work
               Recall the mapping
               satisfy:
              •μAn index structure
                  (q, gi) ≤ max{4, [min{δ (q), δ(gi)]} + 1]} · λ(q, gi)
                  •Convenient for score-sorted lists construction
                  Efficient δ search max{4, [min{δ (q), δ(gi)]} + 1]},
              •We denotestar(q, gi) =algorithm
                  •Quickly gi) ≤ δ’.
               then δ (q,return similar stars to a query star
               If μ(q, g ) > τ*δ’, then λ(q, gi) > τ*δ’/δ > τ,
              • Sorted iproperties for the halting condition of TA
               and this graph can be safely filtered out.
                  •The mapping distance of any unseen graph gi satisfies
                  •     μ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’)
                  •q is the query graph, τ is the distance threshold, and
                  •δ’ = max{4, [min{δ(q), δ(D’)]} + 1]}
                  •where D’ is the set of all unseen graphs.




                                              224
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search




              Requirement
              • An index structure
                •Convenient for score-sorted lists construction
              • Efficient star search algorithm
                •Quickly return similar stars to a query star
              • Sorted properties for the halting condition of TA
                •The mapping distance of any unseen graph gi satisfies
                •     λ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’)
                •q is the query graph, τ is the distance threshold, and
                •
                •where D’ is the set of all unseen graphs.




              Build Inverted Index Structures based
              on the Star Decomposition
           • The upper-level index
             •Build an inverted index between stars and graphs
             •Used to quickly returned graph lists
           • The lower-level index
             •Build an inverted index between labels and stars
             •Used to construct the sorted lists
             •for top-k star search based on TA
             •filtering strategy




                                             225
Mining and Searching Complex Structures         Chapter 5 Graph Similarity Search




              Build Inverted Index Structures based
              on the Star Decomposition




              Top-k Star Search Algorithm

              • Construct sorted lists




                                          226
Mining and Searching Complex Structures                           Chapter 5 Graph Similarity Search




              Graph Score-sorted Lists

              • Construct lists based on the top-k results




              TA-based Graph Range Query

              • Definition
                •Given a graph database D and a query q, find all gi ∈ D that
                are similar to q with λ(q, gi) ≤ τ. τ is the distance
                threshold.
              • Steps: given m sorted lists for a query graph q
                •Perform sorted retrieval in a round-robin schedule to each
                sorted list. For a retrieved graph gi, if Lm(q, gi) > τ, filter out
                the graph; if Um(q, gi) ≤ τ, report the graph to the answer
                set.
                •For each sorted list SLj, let χj be the corresponding distance
                last seen under sorted access. If ω = sum(χ1,…, χm) >
                τ∗δ’, then halt. Otherwise, go to step 1.




                                                227
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search




              CA-based Filtering Strategy

              • The difference between TA and CA
                •TA computes the mapping distance between two graphs
                when retrieving a new graph through sorted accesses

                 •Only in each h depth of the sorted scan, for seen and
                 unprocessed graphs, CA uses estimated mapping distance
                 bounds to first filter graphs; Then, it uses Incremental
                 Hungarian algorithm to compute the partial mapping
                 distances for filtering




              CA-based Filtering Strategy

              • Suppose l(g) = {l1,…,ly} ⊆ {1,2,…,m} is a set of known
                lists of g seen below q. Let χ(g) be the multiset of
                distances of the distinct stars of g last seen in known lists.
                •Lower bound denoted by Lμ(q, g) is obtained by substituting
                the missing lists j ∈ {1,2,…,m}\l(g) with χj (the distance
                last seen under the jth list) in ζ(q, g)
                •Upper bound denoted by Uμ(q, g) is computed as
                               Uμ(q, g) = t′(χ(g)) + χ ∗ (|g| − |χ(g)|)
              • Theorem: Let g1 and g2 be two graphs, the bounds
                obtained as above satisfies
                      ζ(g1, g2) ≤ Lμ(g1, g2) ≤ μ(g1, g2) ≤ Uμ(g1, g2)




                                             228
Mining and Searching Complex Structures                                  Chapter 5 Graph Similarity Search




              CA-based Filtering Strategy

              • Dynamic hungarian for partial mapping distance
                •Given m sorted lists for q, suppose S′(g) ⊆ S(g) is a
                multiset of stars in g seen below lists. Then we have μ(S(q),
                S′(g)) ≤ μ(q, g)




              CA-based Graph Range Query

              • Steps: given m sorted lists for a query graph q
                •Perform sorted retrieval in a round-robin schedule to each
                sorted list. At each depth h of lists:
                    • Maintain the lowest values χ1, . . . , χm encountered in the
                    lists. Maintain a distance accumulator ζ(q, gi) and a multiset
                    of retrieved stars S′(gi) ⊆ S(gi) for each gi seen under lists.
                    • For each gi that is retrieved but unprocessed, if ζ(q, gi) > τ∗δgi,
                    filter out it; if Lμ(q, gi) > τ∗δgi, filter out it; if Uμ(q, gi) ≤ τ∗δgi ,
                    add the graph to the candidate set. Otherwise, if μ(S(q), S′(gi )
                    > τ∗δgi, filter out the graph. Finally, run the Dynamic
                    Hungarian to obtain Lm(q, gi) and Um(q, gi) for filtering.
                 •When a new distance is updated, compute a new ω. If ω =
                 t′(χ) > τ∗δ′, then halt. Otherwise, go to step 1.




                                                    229
Mining and Searching Complex Structures          Chapter 5 Graph Similarity Search




              Experimental Results: Sensitivity test




              Experimental Results: Index construction




                                          230
Mining and Searching Complex Structures         Chapter 5 Graph Similarity Search




            Experimental Results: compare with other
            works varying distance thresholds




              Experimental Results: compare with other
              works varying dataset sizes




                                          231
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search




              References
              • D. Conte, Pasquale Foggia, Carlo Sansone, and Mario Vento.
                Thirty Years of Graph Matching in Pattern Recognition.
              • P. Foggia, C. Sansone and M. Vento. A performance
                comparison of five algorithms for graph isomorphism. In 3rd
                IAPR-TC15 workshop on graph-based representations in
                pattern recognition, 2001.
              • K. Riesen, M. Neuhaus, and H. Bunke. Bipartite graph
                matching for computing the edit distance of graphs. In GBRPR,
                2007.
              • P. Hart, N. Nilsson, and B. Raphael. A formal basis for the
                heuristic determination of minimum cost paths. IEEE Trans.
                SSC, 1966.




              References
              • D. Justice. A binary linear programming formulation of the
                graph edit distance. IEEE TPAMI, 2006.
              • R. Giugno and D. Shasha. Graphgrep: A fast and universal
                method for querying graphs. In ICPR, 2002.
              • R. D. Natale, A. Ferro, R. Giugno, M. Mongiovì, A. Pulvirenti,
                and D. Shasha. SING: subgraph search in non-homogeneous
                graphs. BMC Bioinformatics, 2010.
              • X. Yan, P.S. Yu, and J. Han. Graph indexing: a frequent
                structure-based approach. In SIGMOD, 2005.
              • J. Cheng, Y. Ke, W. Ng, and A. Lu. Fg-index: towards
                verification-free query processing on graph databases. In
                SIGMOD, 2007.




                                              232
Mining and Searching Complex Structures                      Chapter 5 Graph Similarity Search




              References
              • D.W. Williams, J. Huan, and W. Wang. Graph database
                indexing using structured graph decomposition. In ICDE, 2007.
              • S. Zhang, M. Hu, and J. Yang. Treepi: a novel graph indexing
                method. In ICDE, 2007.
              • P. Zhao, J. X. Yu, and P. S. Yu. Graph indexing: tree + delta
                >= graph. In VLDB, 2007.
              • G. Wang, B. Wang, X. Yang, and G. Yu. Efficiently indexing
                large sparse graphs for similarity search. IEEE TKDE, 2010.




                                            233
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining




                  Searching and Mining Complex
                            Structures
                               Massive Graph Mining
                                      Anthony K. H. Tung(鄧锦浩)
                                          School of Computing
                                     National University of Singapore
                                      www.comp.nus.edu.sg/~atung




         Research Group Link: http://nusdm.comp.nus.edu.sg/index.html
         Social Network Link: http://www.renren.com/profile.do?id=313870900




              Graph applications: everywhere

               And often, they are huge and messy.



                    social network

                                                         Bio Pathway




                         Co-authorship
                         network




                                                 234
Mining and Searching Complex                       Chapter 6 Structures Massive Graph Mining




              Knowledge: NOWHERE


              Unless we manage to find where they hide.
              Too many clues is like no clue.




              Roadmap


              Part I (1.5 hrs)
                •Graph Mining Primer
                •Recent advances in Massive Graph Mining
              Part 2(1.5 hrs)
                •CSV: cohesive subgraph Mining
                •Dngraph mining: a triangle based approach




                                           235
Mining and Searching Complex                             Chapter 6 Structures Massive Graph Mining




              Roadmap
              • Graph Mining Primer
                   •   Data mining vs. Graph mining
                   •   Massive graph mining domain
                   •   Types of graph patterns
                   •   Properties of large graph structure
              • Recent advances in Massive Graph Mining
              • CSV: cohesive sub graph Mining
              • DNgraph mining: a triangle based approach




              From Data Mining to Graph Mining
                                             •
             Data Mining
                                             raph Mining
                • Classification
                                                • Captures more complicated
                • Clustering                      entity relationships.
                • Association rule              • Output: patterns, which are
                  learning                        smaller subgraphs with
                                                  interpretable meanings.




                                                236
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining




              Massive graph mining domains
              •   Financial data analyzing
              •   Bioinformatics network
              •   User profiling for customized search
              •   Identify financial crime




              Financial data analysis
             In stock market,
                correlations among
                stocks helps in profit
                making.
             Mining stock
                correlation graphs                Stocks Correlation Tabular Form
                predicting stocks'
                price change for
                estimating future
                return, allocating
                portfolio and
                controlling risks etc.
                                                    Stocks Correlation Patterns




                                            237
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining




              Financial data analysis
             In stock market,
                correlations among
                stocks helps in profit
                making.
             Mining stock
                correlation graphs                   Stocks Correlation Tabular Form
                predicting stocks'
                price change for
                estimating future Highly
                return, allocating correlated
                portfolio and       stock sets
                controlling risks etc.
                                                        Stocks Correlation Patterns




              Bioinformatics network

                •Protein-protein interaction
                   • The fundamental
                   activities for very
                   numerous living cells.
                   • A dense graph pattern
                   indicates these proteins
                   have similar functionalities.




                                                     one representation of an assembled
                                                     NEDD9 network




                                               238
Mining and Searching Complex                             Chapter 6 Structures Massive Graph Mining




              User profiling for customized search
             The Internet Movie Database (IMDB)
               Registered users can comment on movies of their interest.
               Mining on comments sharing network provides insight of
               user’s interest thus further facilitate customized search.



                                                                         Movie centric
                                                                         view of IMDB
                                                                         review network




              Identify financial crime

             Large classes of financial crimes such as money laundering,
               follow certain transactional patterns.




              Geospatial information of suspects          A money laundering pattern




                                                   239
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining




              Dense Graph Patterns
              Clique/Quasi-Clique
                 A clique represents the highest level of internal interactions.
                 Quasi-clique is an ``almost'' clique with few missing edges.
              High Degree Patterns
                 Concern the average vertex degree, which is the number of
                 edges intercepting the vertex.




              Dense graph patterns (cont.)
             Dense Bipartite Patterns                 Heavy Patterns




                                                        Weighted, directed graph of
              Bipartite graph of pathways and           online citation network, by
              genes for the AML/ALL dataset.            Rosvall & Bergstrom




                                                240
Mining and Searching Complex                        Chapter 6 Structures Massive Graph Mining




              Properties of large graph structure
              Static
                 •Power law degree distributions.
                 •Small world phenomenon.
                 •Communities and clusters.
              Dynamic
                 •Shrinking diameters of enlarging graphs
                 •Densification along time




              Power law




                                             241
Mining and Searching Complex                           Chapter 6 Structures Massive Graph Mining




           Large graph: properties and laws (cont.)
              Dynamic
                •Shrinking diameters of enlarging graphs.
                •Densification along time




                 Roadmap
           • Graph Mining Primer
           • Large graph: properties and laws
           • Approaches in Graph mining
                 • Pattern based Mining algorithms
                 • Practical techniques in Massive Graph Mining
                    • Graph summarization with randomized sampling
                    •Connectivity based traversal
                    •MapReduce based
           • CSV: cohesive subgraph Mining
           • Dngraph mining: a triangle based approach




                                               242
Mining and Searching Complex                       Chapter 6 Structures Massive Graph Mining




              Pattern based Mining algorithms
              Greedy methods
                SUBDUE (PWKDD04), GBI(JAI94)
              Apriori-based approaches (detail in next few slides)
                AGM , FSG, gSpan
              Inductive logic programming (ILP) oriented solutions
                WARMR, FARMAR
              Kernel based solutions
                Kernels for graph classification




              Apriori Paradigm Recall


             Search in breadth-first
               manner
             Use a Lattices structure
               to count candidate
               subgraph sets
               efficiently.


                                           A search lattice for item set mining




                                          243
Mining and Searching Complex                       Chapter 6 Structures Massive Graph Mining




              Apriori-based Graph Mining
              Performance bottleneck: candidate subgraph generation.
              Solution:
                1. Build a lexicographic order among graphs.
                2. Search using depth-first strategy.
              Very effective in mining large collections of small to medium
                size graphs.




          Graph summarization with randomized
          sampling
             • Efficient Aggregation for Graph Summarization –
               SIGMOD 2008
             • Graph Summarization with Bounded Error-SIGMOD
               2008
             • Mining graph patterns efficiently via randomized
               summaries - VLDB 2009




                                            244
Mining and Searching Complex                      Chapter 6 Structures Massive Graph Mining




              Efficient Aggregation for Graph
              Summarization
              As graph size increases, graphs summarization becomes
                crucial when visualize the whole graph.
              Criteria for an efficient summarization solution
                Able to produce meaningful summarization for real
                application.
                Scalable to large graphs.
                 The choice: graph aggregation




              Graph Aggregation
              1. Summarization based on user-selected node attributes and
                 relationships.
              2. Produce summaries with controllable resolutions.
                  “drill-down” and “roll-up” abilities to navigate
              Propose two aggregation operations
                 SNAP – address 1
                 k-SNAP      - address 2




                                          245
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining




              Operation SNAP
             Group nodes by user-selected node attributes & relationships
             Nodes in each group are homogenous (in terms of attributes
               and relationships).
             Goal: minimum # of groups




              How does SNAP work?

             Top down approach
               Initial Step: Use user selected attributes to group nodes.
                Iterative Step:
                    If a group are not homogeneous w.r.t. relationships, split the
                  group based on its relationships with other groups.




                                               246
Mining and Searching Complex                       Chapter 6 Structures Massive Graph Mining




              SNAP limitation
              Homogeneity requirement for relationships
                Noise and uncertainty




                Users have no control over the resolutions of summaries
                SNAP operation can result in a large number of small groups




               Operation k-SNAP
             The entities inside a group are not necessarily
               homogenous in terms of relationships with other
               groups.
             Users can control resolution by specifying k (#
               groups).
             Varying of k provides “drill-down” and “roll-up”
               abilities.




                                           247
Mining and Searching Complex                           Chapter 6 Structures Massive Graph Mining




              Access quality of summarization

             Determined by sum of noisy relations.
               When the relationship between two relationships are strong
               (>50%), count missing participants.
                When the relationship between two relationships are weak
               (<=50%), count extra participants.




              K-SNAP goal
              Find the summary of size k with best quality.
                 I.e. minimal Δ.
              The exact solution to minimize Δ is NP-Complete.
              Heuristics
                 Top down: split a group into 2 at each iteration.
                     Choose the group with worst quality and split.
                Bottom up: merge 2 groups into 1
                     Choose same attribute values, similar neighbors, or similar
                   participation ratio.




                                               248
Mining and Searching Complex                   Chapter 6 Structures Massive Graph Mining




              Major results




                          Double-blind
                          review’s
                          effect on LP
                          authors.




              k-SNAP: Top down vs. Bottom up




                                         249
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining




             Graph Summarization with Bounded
             Error
             Large graph data needs compression
                Compression can reduce size to 1/10 (web graph)
             Graph compression vs. Clustering

                  Compression                     Clustering

                  use urls, node labels  works for generic 
                  for compression        graphs
                  Result lacks meaning No compression for 
                                         space saving




              Solution: MDL Based
              Compression for Graphs

              Intuition                                        d       e       f       g
                 Many nodes with similar
                 neighborhoods
                    • Communities in social networks;              a       b       c
                    link-copying in webpages
                 Collapse such nodes into
                 supernodes (clusters)
                 and the edges into                            X = {d,e,f,g}               Summary
                 superedges
                    • Bipartite subgraph to two
                                                                   Y = {a,b,c}
                    supernodes and a superedge
                    • Clique to supernode with a
                    “self-edge”




                                                  250
Mining and Searching Complex                              Chapter 6 Structures Massive Graph Mining




         How to choose vertex sets to compress
                                                                                Cost = 14 edges
                                                                            d        e        f       g

                                                                h                                                           i
                                                                                                                                    j
                                                                                a        b        c
             MDL based compression
              S is a high-level summary graph:
              C is a set of edge corrections:                               X = {d,e,f,g}                  Summary

              minimize cost of S+C                                                                                      i
                                                                h
              Novel Approximate Representation:                                 Y = {a,b,c}                                     i
              reconstructs graph with bounded error
              (є); results in better compression
                                                                            Corrections
                                                                                     +(a,h)               Cost = 5
                                                                                                          (1 superedge +
                                                                                     +(c,i)               4 corrections)
                                                                                     +(c,j)
                                                                                     ‐(a,d)




              Compress (cont.)

             Summary S(VS, ES)                                                   X = {d,e,f,g}
                Each supernode v represents a set of nodes Av           h                                       i
                Each superedge (u,v) represents
                all pair of edges πuv = Au x Av                                     Y = {a,b,c}                             j
             Corrections C: {(a,b); a and b are nodes
                                                                        C = {+(a,h), +(c,i), +(c,j), -(a,d)}
               of G}
             Supernodes are key,
               superedges/corrections easy
                Auv actual edges of G between Au and Av
                Cost with (u,v) = 1 + |πuv – Euv|
                Cost without (u,v) = |Euv|
                                                                            d         e       f       g
                Choose the minimum, decides whether edge
                (u,v) is in S                                       h                                               i
                                                                                 a        b       c                         j




                                                 251
Mining and Searching Complex                                              Chapter 6 Structures Massive Graph Mining




              Reconstruct

             Reconstructing the graph from R
               For all superedges (u,v) in S, insert all pair of edges πuv
               For all +ve corrections +(a,b), insert edge (a,b)
               For all -ve corrections -(a,b), delete edge (a,b)




              Approximate Representation Rє
                                                                                         X = {d,e,f,g}
             Approximate representation
                Recreating the input graph exactly is not always necessary
                                                                                             Y = {a,b}
                Reasonable approximation enough: to compute communities,
                anomalous traffic patterns, etc.
                Use approximation leeway to get further cost reduction                   C = {-(a,d), -(a,f)}
             Generic Neighbor Query
                Given node v, find its neighbors Nv in G
                Apx-nbr set N’v estimates Nv with є-accuracy                             d       e   f       g
                Bounded error: error(v) = |N’v - Nv| + |Nv - N’v| < є |Nv|
                Number of neighbors added or deleted is at most є-fraction of
                the true neighbors
                                                                                             a           b
             Intuition for computing Rє
                If correction (a,d) is deleted, it adds error for both a and d
                From exact representation R for G, remove (maximum)                  For є=.5, we can remove
                corrections s.t. є-error guarantees still hold
                                                                                     one correction of a

                                                                                         d       e   f       g



                                                                                             a           b




                                                             252
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining




              Main Results: cost reduction

                                                     Reduces the cost down to 40%




                                                         Cost of GREEDY 20%
                                                       lower than RANDOMIZED




                               RANDOMIZED is 60%
                                faster than GREEDY




              Comparison with other schemes

               Techniques give much
                better compression




                                              253
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining




              Approximate-Representation




            Cost reduces linearly
              as є is increased;
            With є=.1, 10% cost
              reduction over R




              Mining graph patterns efficiently via
              randomized summaries
              Motivation
               In a graph with large number of identical labeled vertices,
               graph isomorphism becomes a bottleneck.
               How to avoid enumerate identical patterns?




                                                   3 (triangular) × 4 (square) = 12 (total)




                                             254
Mining and Searching Complex                                Chapter 6 Structures Massive Graph Mining




              Solution framework


              Summarization->Mining->Verification
                                                                                  Raw DB




                                                                               Summarized DB




                                                                                  Raw DB




              Reduce false positive
              • Technique 1: merge vertices that are far away from each
                other.
                 •The length of the shortest path
                 •The probability of random walk
              • Technique 2: merge vertices whose neighborhood overlap.
                 •Cosine, Chi^2, Lift, Coherence
              • Technique 3: Go back to raw database to do verification
              It is guaranteed that there is no false positives.
                      Summarization may cause false positive
                  a         b                a
                                                          False Embeddings
                                                             False Positives
                  a         b            b           b




                                                    255
Mining and Searching Complex                           Chapter 6 Structures Massive Graph Mining




              Summarization: Reduce false negative
                 a        b                a
                                                      Miss Embeddings   False Negatives

                 a        c            b          c

              Technique 1: For raw database with frequency threshold min_sup,
                  we adopt a lower frequency threshold pseudo min_sup for
                  summarized database.
              Technique 2: Iterate the mining steps for T times and combine the
                  results generated in each time.
              It is NOT guaranteed that there is no false positives, but the
                  possibility is bounded by




              Connectivity based traversal
              CSV: Cohesive Subgraph Mining –SIGMOD 2008
                (Discussed in detail in Part II)
              Progressive Clustering of Networks using
                Structure-Connected Order of Traversal –ICDE 2010




                                               256
Mining and Searching Complex                              Chapter 6 Structures Massive Graph Mining




            Progressive clustering of networks using structure-
            connected order of traversal
              SCAN Algorithm
                 •Similar to DBSCAN: connectivity-based
                 •Average O(n) time
                 •Uses structural similarity measure, minimum cluster size mu, and
                 minimum similarity epsilon
                 •Finds outliers and hubs
              Problems
                 •No automated way to find good epsilon
                 •Must rerun algorithm for each possible epsilon
                 •Epsilon is global threshold
                    • No hierarchical clusters
                    • No variation in cluster subtlety




         Solution

         • Structure-Connected Order of Traversal (SCOT)
            •Contains all possible epsilon-clusterings
         • Efficient method to find global epsilon
         • New Contiguous Subinterval Heap structure
           (ContigHeap)
         • New Progressive Mean Heap Clustering (ProClust)
            •Epsilon-free
            •Hierarchical
         • Refinement by Gap Constraint (GapMerge)




                                                   257
Mining and Searching Complex                                 Chapter 6 Structures Massive Graph Mining




         Original Network:



         SCOT plot:




         Optimal Global Epsilon

         SCAN paper only contains supervised
           sampling method.
            Sample points, find k-NN similarities, sort,
            plot, find knee visually
            O(nd log n) time
            In addition to clustering time

         Our solution:
            Knee hypothesis implies approx concave
            plot
            Optimal epsilon minimizes obtuse angle
            between segments
            Modified histogram and binary search: O(n)
            time
            Uses already done SCOT result




                                                       258
Mining and Searching Complex                                        Chapter 6 Structures Massive Graph Mining




         ContigHeap

         BuildContigHeap produces heap containing
          all contiguous subintervals from SCOT
          output in O(n) time, and integrates with
          SCOT
         Example:




         GapMerge: Gap Constraint Refinement
         Merges chained clusters, heap branches with single children
            Does not merge across pruned heap nodes (local maxima boundary)
         Gap constraint prevents clusters whose left or right boundaries differ by more
           than mu from being merged
            Such clusters are not redundant relative to the minimum interesting cluster size
         Steps
            1.Identify chains that meet gap constraint
            2.When a node has more than one child or violates gap constraint, begin new chain.
            3.Within each chain, calculate significance of each cluster in both up and down
            directions
            4.Begin with most redundant node, merge nodes in direction of least significance
            5.After each merge, recalculate significances
            6.Continue until chain contains one node, or no merging possible under gap constraint.




                                                          259
Mining and Searching Complex                     Chapter 6 Structures Massive Graph Mining




              MapReduce based approach
              PEGASUS: A Peta-Scale Graph Mining System –ICDM
                2009
              Pregel: a system for large-scale graph processing SIGMOD
                2010




           PEGASUS: A Peta-Scale
           Graph Mining System
            Dealing with real graph such as Yahoo! Web graph up to 6.7
              billion edges.
            A Hadoop based graph mining package.
            Target at primitive matrix operations such as matrix
              multiplication (GMI-v).




                                          260
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining




              Motivation
            Many Graph mining tasks require matrix
             multiplication
               PageRank,
               Random Walk with Restart(RWR),
               Diameter estimation, and
               Connected components …
            MapReduce provides a simplified programming
             concept for large data processing
                Details of the data distribution, replication, load balancing are
               taken care of.
               Provides a similar programming structure. i.e. functional
               programming




              GIM-V: Generalized Iterative Matrix-
              Vector multiplication
             Intuition: Matrix Multiplication
                                                            M × v = v'
                combine2
                                                            v i' = ∑ j =1 m
                                                                     n
                                                                               i, j   vj
                combineall
                Assign
             Operator× G are matrix multiplication expressed by above 3
                steps
             × G is iteratively carried out until converge.




                                             261
Mining and Searching Complex                               Chapter 6 Structures Massive Graph Mining




             × G and SQL
              The matrix multiplication operation can be expressed by
                 an SQL query.
              If view graphs are two table:      ×G
                 edge table E(sid, did, val) and
                 a vector table V(id, val)
                    becomes

                ×G


                     SELECT E.sid, combineAllE.sid(combine2(E.val, V.val))
                     FROM E, V
                     WHERE E.did = V.id
                     GROUP BY E.sid




              Generalize × G


              Vary definition of three steps to generalize × G
              PageRank            row normalization adj.
                                   matrix


                     p = (cE T + (1 - c)U)p
                                                   All element = 1/n

                       Damping factor = 0.85




                                                 262
Mining and Searching Complex                        Chapter 6 Structures Massive Graph Mining




                  Generalize × G
              Vary definition of three steps to generalize
              PageRank
                                                             ×G


                     p = (cE T + (1 - c)U)p
                      combine2 = c × mi, jvj
                                       1- c
                                            + ∑ j=1 xj
                                               n
                     combineAll =
                                        n




              Generalize (cont.)
              By altering three functions, GIM-V adapts to
              • Random Walk with Restart
              • Diameter Estimation
              • Connected Components




                                            263
Mining and Searching Complex                                    Chapter 6 Structures Massive Graph Mining




              GIM-V: How to
              Stage 1
                 Combine2
                 V: Key = id, v: vval, E: Key = idsrc
              State 2
                 Combineall & assign




                           Bottleneck: shuffling and disc I/O




              GIM_V Block Multiplication (BL)




              Advantage
                Save on sorting
                Data compressing
                Clustered Edge




                                                   264
Mining and Searching Complex                        Chapter 6 Structures Massive Graph Mining




              Block advantage (cont.)
              Clustered edge:




              GIM-V DI Dialogonal Block Iteration

             Intuition
                Increase multiplication
                inside an iteration to
                reduce # of iterations.
             How
                Reach local convergence
                within a block first before
                iterate                         Compare GIM-V BL and DI




                                              265
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining




              Main Results
              Scalability
                GIM-V BL DI is ~5 times faster than GIM-V Base




              Main Results (cont.)

             Evolution of LinkedIn
               Distribution of connected
               components are stable after a
               ‘gelling’ point in 2003.




                                               266
Mining and Searching Complex                        Chapter 6 Structures Massive Graph Mining




              Main Results (cont.)
              Bimodal structure of Radius




              Pregel: A System for Large-Scale
              Graph Processing
              A scalable and fault-tolerant platform with an API that is
                 sufficiently flexible to express arbitrary graph algorithms.
              Model of computing:
                 Vertex centric, synchronized iterative model




                                            267
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining




              Graph Algorithms Implementation in
              Pregel
              Graph data are in respect machines, pass messages only, NO
                graph state passing.




              Pregel C++ API
              • Compute() - executed at each active vertex in every
                superstep.
                •Query information about the current vertex and its edges.
                •Send messages to other vertices.
                •Inspect or modify the value associated with its vertex/out-
                edges.
                •state updates are visible immediately. no data races on
                concurrent value access from diefferent vertices
              • Limiting the graph state managed by the framework to
                single value per vertex or edge simplifies the main
                computation cycle, graph distribution, and failure
                recovery.




                                             268
Mining and Searching Complex                      Chapter 6 Structures Massive Graph Mining




              Pregel C++ API (cont.)
             • Message Passing
               •No guaranteed order, but it will be delivered and no
               duplication.
             • Combiners
               •Combine several messages to reduce overhead
             • Aggregators
               •Mechanism for global communication, monitoring, and data.
               •A number of predefined aggregators, such as min, max, or
               sum operations
             • Topology mutation
               •Change graph toplogy, resolve conflicts when individual
               vertices sent conflict messages.




              Pregel C++ API (cont.)
              • Input and output
                • Readers and writers




                                           269
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining




              Pregel implementation
              • Design for Google cluster architecture
                •Each consists of thousands of commercial PCs
              • Persistent data
                •Stored in files on distributed file systems such as GFS or
                BigTable
              • Temporary data
                •Stored as buffered message on local disk.




              Pegel: Assign load
              • Divide graph vertices into partitions and assign to
                different machines
                •controllable by users, default method: hash
              • In absence of fault:
                •One master, many other workers on a cluster of machines.
                   • master assign load jobs, i/o and instruct on super steps
              • Fault tolerent:
                •Use checkpoint: master ping workers
                •Confined recovery (undergoing): master log outgoing
                message




                                               270
Mining and Searching Complex              Chapter 6 Structures Massive Graph Mining




              Graph Application
              PageRank
              Shortest Path
              Bipartite Matching
              Semi Cluster




              Pregel: Main Result




                                    271
Mining and Searching Complex                                            Chapter 6 Structures Massive Graph Mining




              Reference (partial)
              Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations by J.
                   Leskovec, J. Kleinberg, C. Faloutsos. (KDD), 2005.
              Substructure Discovery in the SUBDUE System. L. B. Holder, D. J. Cook and S. Djoko. In
                   (PWKDD), 1994.
              Efficient Aggregation for Graph Summarization – Yuanyuan Tian, Richard A. Hankins, Jignesh M.
                   Patel SIGMOD 2008
              Graph Summarization with Bounded Error-Saket Navlakha, Rajeev Rastogi, Nisheeth Shrivastava
                   SIGMOD 2008
              Mining graph patterns efficiently via randomized summaries Chen Chen, Cindy X. Lin, Matt
                   Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han - VLDB 2009
              Progressive Clustering of Networks using
                   Structure-Connected Order of Traversal Dustin Bortner, Jiawei Han –ICDE 2010
              PEGASUS: A Peta-Scale Graph Mining System U. Kang, Charalampos E. Tsourakakis,
                   ChristosFaloutsos, ICDM
              Graph based induction as a unified learning framework, K. Yoshida, H. Motoda, and N. Indurkhya.
                   Applied Intelligence volume 4, 1994.
              Complete mining of frequent patterns from graphs: Mining graph data. Akihiro,W. Takashi, and
                   M. Hiroshi. Mach. Learn., 50(3):321–354, 2003.




              Reference (cont.)


              Frequent subgraph discovery, K. Michihiro and G. Karypis. In ICDM, pages 313–320, 2001.
              gSpan: Graph-based substructure pattern mining, X. Yan and J. Han. ICDM 2002.
              WARMR Discovery of frequent datalog patterns. L. Dehaspe and H. Toivonen. Data Mining and
                  Knowledge Discovery, 3(7-36), 1999.
              FARMAR Fast association rules for multiple relations. S. Nijssen and J. Kok. Data Mining and
                  Knowledge Discovery, 3(7–36), 1999.




                                                            272
Mining and Searching Complex                       Chapter 6 Structures Massive Graph Mining




              Roadmap
              Part I (1.5 hrs)
                Graph Mining Primer
                Recent advances in Massive Graph Mining
              Part 2(1.5 hrs)
                CSV: cohesive subgraph Mining
                Dngraph mining: a triangle based approach




         CSV

         1. Cohesive sub-graph mining, with visualization
         2. Existing approaches
         3. CSV provides effective visual solution
            – Algorithm principle
            – Connectivity Estimation
         4. Experimental Study




                                           273
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining




              Existing solutions

             1. Current state-of-the-art to abstract information from huge
                graphs.                                 information Yes,
                1. Graph partition algorithms.         structure No.
                   Spectral clustering[Ng01]: high computational cost
                   METIS[Karypis96]: favors balanced pattern
                2. Graph Pattern Mining algorithms
                   CODENSE[Hu05], CLAN[Zeng06]: exponentially running time
             2. Graph Layout Tools:
                   Osprey [Breitkreutz03] Visant [Mellor04]: Do not have mining
                       capability                              information No,
                  We want structured information
                                                               structure Yes.




              CSV: General Approach

             • Separate vertices in the graph into VISITED, UNVISITED
             • Start: Pick a vertex and add into VISITED
             • Repeat until UNVISITED=empty
               –Among all vertices that are in UNVISITED, pick one vertex V
               most highly connected to VISITED
               –Plot V’s connectivity
               –Add V into VISITED




                But how do we measure connectivity?




                                                274
Mining and Searching Complex                                                   Chapter 6 Structures Massive Graph Mining




         Connectivity measurement

              Connectivity measurement is closely related to clique (fully connected sub-
              graph) size.

             The connectivity between two vertices in
                a graph (ηmax) is defined to be the                           The “connectivity” of a vertex
                biggest clique in the graph such that                            (ζmax) is similarly defined
                both are members of the clique                                   as the biggest clique it
                                                                                 can participate.
                                   b
                                                                                                 b
                          a                c
                                                                                        a                  c

                              e        d
                                                                                            e        d
                              ηmax(a, d) = 0
                              ηmax(a, c) = 4                                          ζmax(a) = 5




              CSV: Step by Step

                                                                                                               heap
            From Graph to Plot                             connectivity

                                                       D             4
                    A          B       C
                                                                     3
                               F               H       I             2
                    E                  G                             1
                                                   J                                                            B
                                                                          A                     vertices
                        unvisited
                        neighbors
                                                             Start from A, explore A’s neighbor B.
                        visiting
                                                             Calculate ζmax (A)=2 and output it
                        visited




                                                                   275
Mining and Searching Complex                                         Chapter 6 Structures Massive Graph Mining




              CSV algorithm on a synthetic graph

                                                                                          heap
            From graph to plot             connectivity

                                               D        4
                   A        B      C
                                                        3                                  C
                            F          H        I
                   E                                    2                                  F
                                   G
                                           J                                               B
                                                                                           H
                                                        1

                       unvisited                               AB             vertices

                       neighbors
                                                    Mark A visited, from B, explore B’s
                       visiting                     immediate neighbors CFH.
                       visited                      Calculate ηmax (AB)=2 and output it




              CSV algorithm on a synthetic graph

                                                                                          heap
             From graph to plot                connectivit
                                                   y
                                               D         4
                   A        B      C                                                       F
                                                         3
                                                                                           H
                                                                                           C
                            F          H        I        2
                   E                                                                       F
                                                                                           G
                                   G                     1
                                           J                                               D
                                                                                           H
                                                                   A BC      vertices
                       unvisited
                       neighbors                    Mark B visited, choose the closely
                       visiting                     connected C as next visiting vertex. From
                                                    C, explore C’s immediate neighbors DFGH,
                       visited
                                                    update ηmax when necessary.
                                                    Calculate ηmax (BC)=4 and output it




                                                             276
Mining and Searching Complex                                       Chapter 6 Structures Massive Graph Mining




              CSV algorithm on a synthetic graph


             From graph to plot                     connectivity
                                                                                     Cohesive sub-
                                                                                     graph
                                               D               4
                   A        B      C
                                                               3
                            F          H        I
                                                               2
                   E               G
                                                               1
                                           J

                                                                   ABCH FGDE I J    vertices
                       unvisited
                       neighbors
                       visiting                Visit every vertex accordingly to produce a
                       visited                 plot.

                                               Peaks represent cohesive sub-graphs.




             Important Theorem




                                                         277
Mining and Searching Complex                                    Chapter 6 Structures Massive Graph Mining




                Connectivity computation is a hard
                             problem

               However, if graphs are very huge and massive, exact computation of
                 connectivity is prohibitive.



                                     Direct computation
                                     is costly




              Connectivity computation is
              prohibitive

               •Exact algorithm relays on                                                   D
                                                                A       B       C
               clique detection (NP-hard).
               •Even approximation is hard.                             F           H       I
               •Solution Part 1: Spatial                        E           G
               Mapping                                                                  J
                    •Pick k pivots
                                                           P1   I
                    •Map graph into k-
                    dimensional space based on              3       A               E
                    their shortest distance to the                          F       GJ
                    pivots                                  2           B   C       D
                    •A clique will map into the
                    same grid.                              1               H
                                                                                    I
                                                                    0   1   2       3           P0   A




                                                     278
Mining and Searching Complex                                    Chapter 6 Structures Massive Graph Mining




              Connectivity computation


              •Solution Part 2: Approximate
              Upper Bound for ζmax(v) and
              ηmax(v, v’)
              •Each vertex in a clique of size k
              must have
                  •degree=k-1
                  •k-1 neighbors with degree k-1

                                                              Let estimate ηmax(a, f)
              •For each vertex v, find it immediate
              neighbors in the same grid cell and      Locate the immediate neighborhood of a
              construct a sub-graph                    and f, {a, b, c, d, e, f, g}. After sorting the
                                                       degree array in descending order, we have
                                                       array
              •Iteratively readjust estimation for     6(a), 6(f), 5(d), 4(b), 4(c), 4(e), 3(g).
              clique size

                                                                           =5? =6? =7?




              Experimental study on real datasets
                                                            DBLP: co-authorship graphs.
                                                                  DBLP: v 2819, e 54990




                                          Two groups of
                                          German researchers




                Peaks in DBLP CSV plot represents different research groups




                                                      279
Mining and Searching Complex                                   Chapter 6 Structures Massive Graph Mining




                             SMD: Stock Market Data
                                                      Bridging vertex




                                                                        Partial clique
                        Partial clique
                                                              Peaks in SMD CSV plot
                                                              represents highly cohesive
                                                              stocks




                 DIP: Database of interacting proteins
           8    SMD3
                                                                                           9    PFS2
           89    LSM8
                PRP4
                                                                                           10   RNA14
           89    LSM2
                PRP8
                                                                                           10   FIP1
           89    DCP1
                PRP6

           89    LSM6
                LUC7
                                          Structure of a nucleotide-bound Clp1-Pcf11       10   REF2


           89    LSM3
                SMX2
                                          polyadenylation factor                           10   CFT1


           89    LSM4
                SNP1
                                          Christian G. Noble, Barbara Beuth, and Ian       10   CFT2


           89    PAT1
                STO1
                                          A. Taylor*. Nucleic Acids Res. 2007 January;     10   MPE1


           89    LSM7
                NAM8
                                          35(1): 87–99.                                    10   GLC7

                                                                                           10   PAP1
           89    LSM5
                SNU71

           8    PRP31
                                          “CPF is also required in both the cleavage       10   PTA1


           8    YHC1
                                          and polyadenylation reactions. It contains a     10   YSH1


           8    PRP40
                                          core of eight subunits Cft1, Cft2, Ysh1, Pta1    10   YTH1

                                                                                           10   PTI1
           8    MUD1
                                          Mpe1, Pfs2, Fip1 and Yth1”
           8    SNU56




                                                      280
Mining and Searching Complex                             Chapter 6 Structures Massive Graph Mining




         Experimental Study

         CSV as a pre-selection step
           How?
            •Apply CSV to identify potential
            cohesive sub-graphs first.
            •Use exact algorithm CLAN to run on
            these candidates.
            Result
            •Get the exact cohesive sub-graphs as
            running CLAN alone.
            •Saves 28-84% of the time compared            CSV as a pre-selection methods
            to running CLAN alone.




         DNgraph mining: A triangle based approach

         • Mining dense patterns out of an extremely large graph
           •When the graph is extremely large, it is even difficult to mine
           dense patterns.
               • An iterative improvement mining approach is more desirable
                  •Users are able to obtain the most updated results on demand.
         • Dense patterns have strong connection with triangles inside a
           graphs.
               • This has already observed and explained by the preferential
               attachment property of large scaled graphs.




                                                  281
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining




         DNgraph mining: A triangle based approach

         • What makes a pattern dense? Intuitively                      B      C


           •A collection of vertices with high relevance.
           •They share large number of common.                  A
                                                                                    D

         • With that we propose the definition of
           Dngraph
           •A DNgraph is the largest sub graph sharing
                                                          A’                   E
                                                                        F

           the most neighbors.
           •Require each connected vertex pair sharing at      λ(G) = 3, λ(GA’)=0
           least λ neighbors.




         Compare Dngraph with other dense pattern
         definition
              • Two interesting patterns
                • 4-clique and a Turan graph T(14, 4) [14 vertices, 4 groups, fully
                  connected between groups]
                • If mining quasi-clique, may ends up discovering 1 pattern, as in
                  (d)
                • If searching for closed clique, may only find (e)




                                             282
Mining and Searching Complex                          Chapter 6 Structures Massive Graph Mining




         DNgraph mining: challenge

         • Find common neigbhors for every connected vertices is
           expensive
           •Require O(E) join operations.
           •Need random disc access.
           •In fact, finding an DN-graph is an NP-problem.
         • Solution
           •Using triangles that two vertices participates to approximate
           common neighbor size.
           •Iterative refine the approximation following graph edge’s locality.




         DNgraph mining: How

         1. Initially: count # triangles each edge participates.
            •Sort vertices and its neighbors in descending order of their degrees
            •Scan the graphs to get # triangles for every vertex.
            •The # triangle set the initial value of λ .
         2. Next, Iteratively refine λ for every vertex
            •Using streams of triangles.
            •Iterative refine λcur.




                                              283
Mining and Searching Complex                                 Chapter 6 Structures Massive Graph Mining




         Triangle Counting: how?
         1. Sort vertices and its neighbors in descending order of
            their degrees

                                        a     bde                           e       dbacgf
                                        b     acde               Sort       d       eacgh
              a        e
                               f        c     bde                           b       edac
                                        d     acegh                         a       edb
              b                    g    e     abcdgf                        c       edb
              c
                                        f     eg                            g       edf
                       d           h
                                        g     def                           f       eg
                                        h     d                             h       d




         Triangle counting (cont.)
         1. Sort vertices and its neighbors in                              a               f
                                                                                        e
            descending order of their degrees
         2. Join neighborhood for triangle count for                        b                   g
            every edge
                                                                                c       d       h
                  • The two vertices inhibits locality, due to
                  reordering and preferential attachment                3   e       dbacgf
                  property of large graphs                                  d       eacgh
                                                                        3
                                                                            b       edac
                                                                            a       edb
                                                                            c       edb
                                                                            g       edf
                                                                            f       eg
                                                                            h       d




                                                    284
Mining and Searching Complex                          Chapter 6 Structures Massive Graph Mining




         Triangle counting (cont.)
                                                                   a              f
                                                                            e
         1. Sort vertices and its neighbors in
            descending order of their degrees                      b                  g
         2. Join neighborhood for triangle count for
                                                                   c        d         h
            every edge
                                                              vertex       λcur
         3. Use that as the initial λ value for every
            edge/vertex                                        e       3
               • Vertex λ value is the maximal edge λ value    d       3
               it participates                                 …       …
                   •λcur(e) = 3




         DNgraph mining: How (cont.)

         • Initially: count # triangles each edge participates.
         • Next, Iteratively refine λ for every vertex
           •Using streams of triangles.
           •Iterative refine λcur.




                                              285
Mining and Searching Complex                                                   Chapter 6 Structures Massive Graph Mining




            Triangle stream

              •Follow the same order of visiting graph during triangle
              counting
                 •Triangles are not materialized, saving storage

                                               n1                              nx
                              n2                                  n2                           n2
                      n1                                                                n1
                                      a                 b n1              a         b
                                          nx                              nx                            nx




                  a                       b         a                     b         a        lambda=k   b
                           lambda=k                            lambda=k




            Iteratively refine λ

              •Follow the same order of visiting graph during triangle
              counting
                 •Triangles are not materialized, saving storage
                 •For every vertex v, when its triangles come, bound λcur(v)
                 using two other vertices’ λcur




                                                                   286
Mining and Searching Complex                                Chapter 6 Structures Massive Graph Mining




         Iteratively refine λ (cont.)
                                                                         a                   f
                                                                                     e
         • Initially: count # triangles each edge                                3
           participates.                                                 b                       g
                                                                                         3
         • Next, Iteratively refine λ for every vertex
                                                                         c           d           h
           •Using streams of triangles.                            vertex        λcur
           •Iterative refine λcur.
                                                                     e       3
                 • Until all vertices’ λcur are converged
                                                                     b       3
                                                                     …       …




            DNgraph: Experiment
              •Large scaled graph
                 •Flicker Dataset with with 1,715,255 vertices an 22,613,982
                 edges.
                 •1 iteration requires 1 hour, a workstation with a Quad-Core
                 AMD Opteron(tm) processor 8356, 128GB RAM and 700GB
                 hard disk.
                 •Converge in 66 iterations, almost stable after 35 iterations




                                                 287
Mining and Searching Complex                                                Chapter 6 Structures Massive Graph Mining




              Advantage

              • Abstraction
                Within the triangulation algorithm. The abstraction ensures
                our approach’s extensibility to different input settings.
              • Iteratively refine results
              • The estimation of common neighborhood improves along
                every iteration, users are able to obtain the most updated
                results on demand.
              • Pre-collection of Statistics to support effective buffer
                management
              • Process can be easily mapped to key->value pair for
                further distributed processing.




              Reference (partial)
              [Hu05] H.Hu, X.Yan, Y.Huang, J.Han, and X.J.Zhou. Mining coherent dense subgraphs across
                   massive biological networks for functional discovery. Bioinformatics, 21(1):213--221, 2005.
              [Ng01] A.Y. Ng, M.I. Jordan, and Y.Weiss. On spectral clustering: Analysis and an algorithm.
              Advances in Neural Information Processing Systems, volume~14, 2001.
              [Karypis96] G.Karypis and V.Kumar. Parallel multilevel k-way partitioning scheme for irregular
                   graphs. Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on
                   Supercomputing (CDROM), page~35, Washington, DC, USA, 1996. IEEE Computer Society.
              [Breitkreutz03] B.J.Breitkreutz, C.Stark, and M.Tyers.Osprey: a network visualization system.
                   Genome Biology, 4, 2003.
              [Mellor04] J.W.J. Z., Mellor and C. DeLisi. An online visualization and analysis tool for biological
                   interaction data. BMC Bioinformatics, 5:17--24, 2004.
              [Zeng06]J. Wang, Z.Zeng, and L. Zhou. Clan: An algorithm for mining closed cliques from large
                   dense graph databases. Proceedings of the International Conference on Data Engineering},
                   page~73, 2006.
              [Turan41] P. Turan. On an extremal problem in graph theory. Mat. Fiz. Lapok, 48:436–452, 1941
              [Ankerst99] M.Ankerst, M.Breunig, H.P. Kriegel, and J.Sander. OPTICS: Ordering points to
                   identify the clustering structure. Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data
               (SIGMOD'99), pages 49--60, Philadelphia, PA, June 1999.
              [DNgraph10] On Triangle based DNgraph Mining. NUS technical report TRB4/10




                                                                288

								
To top