VIEWS: 32 PAGES: 290 POSTED ON: 11/12/2011 Public Domain
VLDB Database School (China) 2010 August 3-7, 2010, Shenyang Lecture Notes Part 1 Mining and Searching Complex Structures Anthony K.H. Tung(邓锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung Mining and Searching Complex Structures Contents Chapter 1: Introduction ------------------------------------------ 1 Chapter 2: High Dimensional Data ------------------------- 34 Chapter 3: Similarity Search on Sequences ------------ 110 Chapter 4: Similarity Search on Trees ------------------- 156 Chapter 5: Graph Similarity Search ---------------------- 175 Chapter 6: Massive Graph Mining ------------------------ 234 Mining and Searching Complex Structures Chapter 1 Introduction Mining and Searching Complex Structures Introduction Anthony K. H. Tung(鄧锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung Research Group Link: http://nusdm.comp.nus.edu.sg/index.html Social Network Link: http://www.renren.com/profile.do?id=313870900 What is data mining? Really nothing different from what scientists had been doing for Correct, Generate useful data model Collect data and verify or Nobel Real World construct model of real world Prize Output most likely model based on some statistical Feed in data measure What’s new? Systematically and efficiently test many statistical models 1 Mining and Searching Complex Structures Chapter 1 Introduction Components of data mining Structure of model geneA=high and geneB=low ===> cancer geneA, geneB and geneC exhibit strong correlation Statistical Score for the model Accuracy of rule 1 is 90% Similarity function: Are they sufficiently similar group of records that support a certain model or hypothesis? Search method for the correct model parameters Given 200 genes, there could be 2^200 rules. Which rule give the best prediction power? Database access method Given 1 million records, how to quickly find relevant records to compute the accuracy of a rule? The Apriori Algorithm • Bottom-up, breadth first a,b,c,e search • Only read is perform on the databases a,b,c a,b,e a,c,e b,c,e • Store candidates in memory to simulate the lattice search a,b a,c a,e b,c b,e c,e • Iteratively follow the two steps: –generate candidates a b c e –count and get actual frequent items start {} 4 2 Mining and Searching Complex Structures Chapter 1 Introduction The K-Means Clustering Method • Given k, the k-means algorithm is implemented in 4 steps: –Partition objects into k nonempty subsets –Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. –Assign each object to the cluster with the nearest seed point. –Go back to Step 2, stop when no more new assignment. 5 The K-Means Clustering Method • Example 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 6 3 Mining and Searching Complex Structures Chapter 1 Introduction Training Dataset (Decision Tree) Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High 7 Strong No Selecting the Next Attribute S=[9+,5-] S=[9+,5-] E=0.940 E=0.940 Humidity Wind High Normal Weak Strong [3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-] E=0.985 E=0.592 E=0.811 E=1.0 Gain(S,Humidity) Gain(S,Wind) =0.940-(7/14)*0.985 =0.940-(8/14)*0.811 – (7/14)*0.592 – (6/14)*1.0 =0.151 =0.048 8 4 Mining and Searching Complex Structures Chapter 1 Introduction Selecting the Next Attribute S=[9+,5-] E=0.940 Outlook Over Sunny Rain cast [2+, 3-] [4+, 0] [3+, 2-] E=0.971 E=0.0 E=0.971 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247 9 ID3 Algorithm [D1,D2,…,D14] Outlook [9+,5-] Sunny Overcast Rain Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14] [2+,3-] [4+,0-] [3+,2-] ? Yes ? Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019 10 5 Mining and Searching Complex Structures Chapter 1 Introduction Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes 11 Can we fit what we learn into the framework? Apriori K-means ID3 task rule pattern discovery clustering classification structure of the model association rules clusters decision tree or pattern search space lattice of all possible choice of any k all possible combination of items points as center combination of size= 2m size=infinity decision tree size= potentially infinity score function support, confidence square error accuracy, information gain search /optimization breadth first with gradient descent greedy method pruning data management TBD TBD TBD 12 technique 6 Mining and Searching Complex Structures Chapter 1 Introduction Components of data mining(II) Models Enumeration Algorithm Statistical Score Function Similarity/Search Function Database Access Method Database Background knowledge • We assume you have some basic knowledge about data mining, some of the slides here will be very useful for this purpose • Association Rule Mining http://www.comp.nus.edu.sg/~atung/Renmin56.pdf • Classification and Regression http://www.comp.nus.edu.sg/~atung/Renmin67.pdf • Clustering http://www.comp.nus.edu.sg/~atung/Renmin78.pdf 7 Mining and Searching Complex Structures Chapter 1 Introduction IT Trend Processors are cheap and will become cheaper(multi-core processor, graphic cards) Storage will be cheap but might not be fast Bandwidth will be growing What can we do with this? Play more realistic games! Not exactly a joke since any technologies that speed up games can speed up scientific simulation Smarter (more intensive) computation Can store more personal semantic/ontology People can collaborate more over the Internet (Flickr, Wikipedia) to make things more intelligent The AI dream now have the support of much better hardwares Essentially, data mining can be made much more simple for the man on the street Data mining should be human-centered, not machine centered 2010-7-31 15 What is complex data? What is “simple” data? data? What are complex tabular table, with small number of attributes (of the same type), no Test1 Regular Progress Gene1 comments values. Pos missing Fever 2.0 …… Neg -0.3 Unconscious N.A 5.7 High dimensional data: Lots of attributes with different data types with missing values Sequences/ time series Trees Graphs 8 Mining and Searching Complex Structures Chapter 1 Introduction Why complex data? They come naturally in many applications. Bring research nearer to real world Lots of challenges which mean more fun! Some fundamental challenges: How do you compare complex objects effectively and efficiently? How do you find special subset in the data that is interesting? Test1 Gene1 Progress comments Pos What new type of models and score function must you used? 2.0 Fever …… Neg How do you handle noise and error ? -0.3 Unconscious N.A 5.7 a a d b e b c d c d c d b c d e T1 T2 Personalized Semantic for Personal Data Management everyone will own terabytes of data soon improve query/search interface by mining and extracting personalized semantics like entities and their relationship etc. by comparing them against high quality tagged databases Query by Query by Query by documents audio/music Query by video photographs/images Wikipedia singers authors High Quality Data semantic actor/actress songs Sources papers layer places movies Personal Data documents audio video photographs/i Webpage/Blogs/Bookmarks music mages 9 Mining and Searching Complex Structures Chapter 1 Introduction Integrated Approach to Mining Software Engineering Data software engineering data: code base, change history, bug reports, runtime trace integrated into a data warehouse to support decision making and mining, Example: Which code module should I modify to create a new function? Which module need maintenance? programming defect detection testing debugging maintenance … software engineering tasks helped by data mining association/ classification clustering … patterns Data Warehouse code change program structural bug bases history states entities reports/nl … software engineering data WikiScience Web 2.0: Facebook for scientists Collaborative platform for scientist to build scientific models/hypothesis and share data, applications Based on some articles, I make some changes to Model A supporting to create Model B articles tagged to Model B Centralized, Centralized, Model A Hybrid Model Hybrid Model Model B Model A Model B C Constructed C Constructed by System by System supporting dataset tagged to Model A This is my model of the solar system base on my supporting dataset 10 Mining and Searching Complex Structures Chapter 1 Introduction Hey, why not Cloud Computing, Map/Reduce? • These are platform for scaling up services to large number of users on large amount of data • But what exactly do you want to scale up? • Services that provide useful and semantically correct information to the users • We have too many scalable data mining algorithms that find nothing or too many things • Let’s focus on finding useful things first (assuming we have lot’s of processing power) and then try to scale it up Schedule of the Course Date/Time Content Lesson 1 Introduction Lesson 2 Mining and Search High Dimensional Data I Lesson 3 Mining and Search High Dimensional Data II Lesson 4 Mining and Search High Dimensional Data III Lesson 5 Similarity Search for Sequences and Trees I Lesson 6 Similarity Search for Sequences and Trees III Lesson 7 Similarity Search for Graph I Lesson 8 Similarity Search for Graph II Lesson 9 Similarity Search for Graph III Lesson 10 Mining Massive Graph I Lesson 11 Mining Massive Graph II Lesson 12 Mining Massive Graph III 11 Mining and Searching Complex Structures Chapter 1 Introduction Focus of the course • Techniques that can handle high dimensional, complex structures –Providing semantics to similarity search –Shotgun and Assembly: Column/Feature Wise Processing using Inverted Index –Row-wise Enumeration –Using local properties to infer global properties • Throughout the course, please try to think of how these techniques are applicable across different type of complex structures Databases Queries To start off, we will consider something very basic call ranking queries since we need ranking any similarity search (usually from most similar to most dissimilar) In relational database, SQL returns all results at one go How many tuples can be fitted in one screen? How many tuples can you remember? Options: Summarize the results Display representative tuples How to select representative tuples? 12 Mining and Searching Complex Structures Chapter 1 Introduction Retrieve Relevant Information Search videos related to Shanghai Expo Too many results: as long as you click “next”, there are 20 more new results Are we interested in all results? No, only most relevant ones Search engines have to rank the results, out of which they make money from Question: How to Select a Small Result Set Selecting the most representative or most interesting results is not trivial Find an apartment with rental cheaper than 1000, the cheaper the better The result tuples can be sorted in the ascending order of rental prices, those in front are more favorable Find an apartment with rental cheaper than 1000 near NEU, the lower the better, the nearer the better Apartment with lower rent may not be near, nearer one may not be cheap Order by prices? Order by distances? 13 Mining and Searching Complex Structures Chapter 1 Introduction Top-k Queries Define a scoring function, which maps a tuple to a real number, as a score The higher the score is, the more favorable the tuple is Define an integer k Answer: k objects with highest scores Different scoring function may give different top-k result Price Distance to NEU Apartment A $800 500 meter Apartment B $1200 200 meters Given k = 1, if the score function is defined as the sum of price and distance, the first tuple is better; if it is defined as the product, the second tuple is better Brute Force Top-k Compute scores for each result tuple Sort the tuples according to the descending order of the scores Select the first k tuples What if the number of tuples is unlimited? Search engines can give unlimited number of results Even if the number of tuples is limited, it is too slow to compute score for each tuple We have to do it efficiently 14 Mining and Searching Complex Structures Chapter 1 Introduction Outline Two well-known top-k algorithms Fagin's Algorithm (FA) The Threshold Algorithm (TA) Take random access into consideration No Random Access Algorithm (NRA) The Combined Algorithm (CA) Monotonicity A score function f is monotone if f(x1,x2,...,xm)≤f(y1,y2,...,ym) whenever xi≤yi for every i Select top-3 students with highest total score in mathematics, physics and computer science: • select name, math+phys+comp as score from student order by score desc limit 3 sum(x.math,x.phys,x.comp)≤sum(y.math,y.phys,y.comp) if x.math≤y.math and x.phys≤y.phys and x.comp≤y.comp 15 Mining and Searching Complex Structures Chapter 1 Introduction Sorted Lists We shall think of a database consisting of m sorted lists L1, L2, … Lm Lmath Lphys Lcomp Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Outline Two well-known top-k algorithms Fagin's Algorithm (FA) The Threshold Algorithm (TA) Take random access into consideration No Random Access Algorithm (NRA) The Combined Algorithm (CA) 16 Mining and Searching Complex Structures Chapter 1 Introduction Fagin's Algorithm (I) Do sequential access until there are at least k matches Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Sequential accesses are stopped when 3 students are seen, i.e. Ann, Hugh and Kurt Fagin's Algorithm (II) For each object that has been seen, do random accesses on other lists to compute its score Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Random accesses need to be done for Ben, Carl, Jane and Ryan 17 Mining and Searching Complex Structures Chapter 1 Introduction Fagin's Algorithm (III) Select the k objects with highest score as top-k result Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Why is FA correct? (I) There are at least k objects seen on all attributes when sequential access is stopped By monotonicity, those objects that are not seen do not have higher score than the above k objects Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 18 Mining and Searching Complex Structures Chapter 1 Introduction Why is FA correct? (II) For those that have been seen, it is either all attributes has been seen, or random accesses are performed to know all attributes The k objects with highest scores are therefore the top-k result Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Outline Two well-known top-k algorithms Fagin's Algorithm (FA) The Threshold Algorithm (TA) Take random access into consideration No Random Access Algorithm (NRA) The Combined Algorithm (CA) 19 Mining and Searching Complex Structures Chapter 1 Introduction The Threshold Algorithm (I) Do sequential access on all lists. If an object is seen, do random access to the other lists to compute its score Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Random accesses on Ann, Hugh and Kurt first, then on Ben and Ryan The Threshold Algorithm (II) Remember the k objects with highest scores, together with their scores Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Score (Ann) = 285 Score (Hugh) = 280 Score (Kurt) = 280 20 Mining and Searching Complex Structures Chapter 1 Introduction The Threshold Algorithm (III) • Let threshold value τ be the function value on last seen values on all sorted lists • As soon as at least k objects with score at least τ, then halt Ann 98 Hugh 97 Kurt 96 τ(1) = 291 Ben 96 Ryan 94 Ann 95 τ(2) = 285 Kurt 93 Ann 92 Jane 95 τ(3) = 280 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Why is TA correct? • By monotonicity, those unseen objects do not have higher score than τ • For those that have been seen, random accesses are performed, the k objects with highest scores are therefore the top-k result Ann 98 Hugh 97 Kurt 96 τ(1) = 291 Ben 96 Ryan 94 Ann 95 τ(2) = 285 Kurt 93 Ann 92 Jane 95 τ(3) = 280 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 21 Mining and Searching Complex Structures Chapter 1 Introduction Comparing TA with FA • Number of sequential accesses At the time FA stops sequential accesses, τ is guaranteed not higher than the k objects seen on all sorted lists • Number of random accesses TA requires m-1 random accesses for each object But FA is expected to random access more objects • Size of buffers used Buffer used by FA can be unbounded TA only needs to remember k objects with k scores, and the threshold value τ Outline Two well-known top-k algorithms Fagin's Algorithm (FA) The Threshold Algorithm (TA) Take random access into consideration No Random Access Algorithm (NRA) The Combined Algorithm (CA) 22 Mining and Searching Complex Structures Chapter 1 Introduction Random Access Random accesses are impossible Text retrieval: sorted lists are results of search engines Random accesses are expensive Sequential accesses on disk are orders of magnitude faster than random accesses We need to consider not using random accesses or using them as few as possible No Random Access Without random access, all we know are the upper bounds Lmath Lphys Lcomp Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Carl’s scores on physics and computer science are not higher than 89 and 92 respectively 23 Mining and Searching Complex Structures Chapter 1 Introduction Lower and Upper Bounds If an object has not been seen on one attribute Lower bound is 0 Upper bound is the last seen value Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... The lower bound of Carl’s score on physics is 0 The upper bound of Carl’s score on physics is 89 Worse and Best Scores (I) W (R): The worst possible score of tuple R B (R): The best possible score of tuple R Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... W (Carl) = 90 B (Carl) = 90 + 89 + 92 24 Mining and Searching Complex Structures Chapter 1 Introduction Worse and Best Scores (II) W (R) ≤ Score of R ≤ B (R) W (R) and B (R) get updated as its value gets sequential accessed Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Ann Hugh Kurt W 98 97 96 B 291 291 291 Worse and Best Scores (II) W (R) ≤ Score of R ≤ B (R) W (R) and B (R) get updated as its value gets sequential accessed Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Ann Hugh Kurt Ben Ryan W 98→193 97 96 96 94 B 291→287 291→288 291→286 285 285 25 Mining and Searching Complex Structures Chapter 1 Introduction Worse and Best Scores (II) W (R) ≤ Score of R ≤ B (R) W (R) and B (R) get updated as its value gets sequential accessed Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Ann Hugh Kurt Ben Ryan Jane W 193→285 97 96→189 96 94 95 B 287→285 288→285 286→281 285→283 285→282 280 Outline Two well-known top-k algorithms Fagin's Algorithm (FA) The Threshold Algorithm (TA) Take random access into consideration No Random Access Algorithm (NRA) The Combined Algorithm (CA) 26 Mining and Searching Complex Structures Chapter 1 Introduction No Random Access Algorithm (I) Maintain the last-seen values x1,x2,…,xm For every seen object, maintain its worst possible score, its known attributes and their values Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... xmath = 96; xphys = 94; xcomp = 95 Ann:193:{<Math:98>;<Comp:95>} No Random Access Algorithm (II) Why not maintain the best possible score for each objects Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Ann Hugh Kurt Ben Ryan Jane W 193→285 97 96→189 96 94 95 B 287→285 288→285 286→281 285→283 285→282 280 Too Frequently Updated! 27 Mining and Searching Complex Structures Chapter 1 Introduction No Random Access Algorithm (III) Let M be the kth largest W value An object R is viable if B (R) ≥ M Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Ann Hugh Kurt Ben Ryan Jane W 285 97→188 189→280 96→189 94 95 M = 189 B 285 285→280 281→280 283→280 282→278 280→277 No Random Access Algorithm (III) Let M be the kth largest W value An object R is viable if B (R) ≥ M Ann 98 Hugh 97 Kurt 96 Ben 96 Ryan 94 Ann 95 Kurt 93 Ann 92 Jane 95 Hugh 91 Kurt 91 Ben 93 Carl 90 Jane 89 Hugh 92 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Ann Hugh Kurt Ben Ryan Jane W 285 188→280 280 188 94 95→184 M = 280 B 285 285→280 280 280→278 278→276 277→274 28 Mining and Searching Complex Structures Chapter 1 Introduction No Random Access Algorithm (IV) Let set T contain objects with W (R) ≥ M Halt when There are at least k objects seen on all sorted lists No viable objects left outside set T Ann Hugh Kurt Ben Ryan Jane W 285 188→280 280 188 94 95→184 M = 280 B 285 285→280 280 280→278 278→276 277→274 T = {Ann, Hugh, Kurt} Why is NRA correct? W (R) ≤ Score of R ≤ B (R) always holds If an object R is not viable, Score of R ≤ B (R) ≤ M, then there are at least k objects with scores not lower than R Therefore, if there is no viable object outside T and T contains at least k objects, T is the set of top-k result 29 Mining and Searching Complex Structures Chapter 1 Introduction Comparing NRA with TA • Number of sequential accesses The number of sequential accesses of NRA is at least the last position of top-k result on all attributes • Number of random accesses NRA is obviously 0 • Size of buffers used TA remembers k objects with k scores, and the threshold value τ NRA remembers all viable objects with its scores on all seen attributes, and the last-seen value on all attributes How deep can NRA go? Ann 98 Hugh 97 Kurt 96 Hugh 97 Kurt 96 Ann 95 Ben 60 Ryan 60 Jane 60 Ryan 60 Ben 60 Ben 60 Carl 60 Jane 60 Carl 60 ... ... ... ... ... ... Jane 60 Carl 60 Ryan 60 Kurt 0 Ann 0 Hugh 0 The set T can be identified quickly, but their scores will only be certain at the end of lists If we allow relatively fewer number of random accesses, scanning the entire lists can be avoided 30 Mining and Searching Complex Structures Chapter 1 Introduction Outline Two well-known top-k algorithms Fagin's Algorithm (FA) The Threshold Algorithm (TA) Take random access into consideration No Random Access Algorithm (NRA) The Combined Algorithm (CA) The Combined Algorithm (I) CA combines TA and NRA cR: the cost of a random access cS: the cost of a sequential access h= Run NRA, but every h steps to run random accesses, like TA h = ∞ → never do random access, CA is then NRA 31 Mining and Searching Complex Structures Chapter 1 Introduction The Combined Algorithm (II) Ann 98 Hugh 97 Kurt 96 Hugh 97 Kurt 96 Ann 95 Ben 60 Ryan 60 Jane 60 Ryan 60 Ben 60 Ben 60 Carl 60 Jane 60 Carl 60 ... ... ... ... ... ... Jane 60 Carl 60 Ryan 60 Kurt 0 Ann 0 Hugh 0 Random accesses for Ann, Hugh and Kurt quickly find out the scores for Ann, Hugh and Kurt The Combined Algorithm (III) In CA, by doing random accesses, we wish to either Confirm an object is a top-k result, or Prune a viable object As the number of random accesses in CA is limited, various heuristics can be made to optimize CA in terms of total cost 32 Mining and Searching Complex Structures Chapter 1 Introduction Reference • Ronald Fagin, Amnon Lotem, Moni Naor: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4): 614-656 (2003) 33 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Mining and Searching Complex Structures High Dimensional Data Anthony K. H. Tung(鄧锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung Research Group Link: http://nusdm.comp.nus.edu.sg/index.html Social Network Link: http://www.renren.com/profile.do?id=313870900 Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data Mining and Searching Complex Structures 34 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Sources of High Dimensional Data • Microarray gene expression • Text documents • Images • Features of Sequences, Trees and Graphs • Audio, Video, Human Motion Database (spatio- temporal as well!) Mining and Searching Complex Structures Challenges of High Dimensional Data • Indistinguishable –Distance between two nearest points and two furthest points could be almost the same • Sparsity –As a result of the above, data distribution are very sparse giving no obvious indication on where the interesting knowledge is • Large number of combination –Efficiency: How to test the number of combinations –Effectiveness: How do we understand and interpret so many combinations? Mining and Searching Complex Structures 35 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data Mining and Searching Complex Structures Similarity Search : Traditional Approach • Objects represented by multidimensional vectors Elevation Aspect Slope Hillshade (9am) Hillshade (noon) Hillshade (3pm) … 2596 51 3 221 232 148 … • The traditional approach to similarity search: kNN query Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist P1 1.1 1 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.93 P2 1.4 1.4 1.4 1.5 1.4 1 1.2 1.2 1 1 0.98 P3 1 1 1 1 1 1 2 1 2 2 1.73 P4 20 20 21 20 22 20 20 19 20 20 57.7 P5 19 21 20 20 20 21 18 20 22 20 60.5 P6 21 21 18 19 20 19 21 20 20 20 59.8 Mining and Searching Complex Structures 36 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Deficiencies of the Traditional Approach • Deficiencies –Distance is affected by a few dimensions with high dissimilarity –Partial similarities can not be discovered • The traditional approach to similarity search: kNN query Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist P1 1.1 1 100 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.93 99.0 P2 1.4 1.4 1.4 1.5 1.4 1 100 1.2 1.2 1 1 99.0 0.98 P3 1 1 1 1 1 1 2 1 100 2 2 1.73 99.0 P4 20 20 21 20 22 20 20 19 20 20 57.7 P5 19 21 20 20 20 21 18 20 22 20 60.5 P6 21 21 18 19 20 19 21 20 20 20 59.8 Mining and Searching Complex Structures Thoughts • Aggregating too many dimensional differences into a single value result in too much information loss. Can we try to reduce that loss? • While high dimensional data typically give us problem when in come to similarity search, can we turn what is against us into advantage? • Our approach: Since we have so many dimensions, we can compute more complex statistics over these dimensions to overcome some of the “noise” introduce due to scaling of dimensions, outliers etc. Mining and Searching Complex Structures 37 Mining and Searching Complex Chapter 2 Structures High Dimensional Data The N-Match Query : Warm-Up • Description –Matches between two objects in n dimensions. (n ≤ d) –The n dimensions are chosen dynamically to make the two objects match best. • How to define a “match” –Exact match –Match with tolerance δ • The similarity search example Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) n=6 ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist P1 1.1 1 100 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.2 P2 1.4 1.4 1.4 1.5 1.4 1 100 1.2 1.2 1 1 0.4 0.98 P3 1 1 1 1 1 1 2 1 100 2 2 1.73 0 P4 20 20 21 20 22 20 20 19 20 20 19 P5 19 21 20 20 20 21 18 20 22 20 19 P6 21 21 18 19 20 19 21 20 20 20 19 Mining and Searching Complex Structures The N-Match Query : The Definition • The n-match difference Given two d-dimensional points P(p1, p2, …, pd) and Q(q1, q2, …, qd), let δi = |pi - qi|, i=1,…,d. Sort the array {δ1 , …, δd} in increasing order and let the sorted array be {δ1’, …, δd’}. Then δn’ is the n-match difference y between P and Q. 1-match=A 10 E • The n-match query 8 D 2-match=B Given a d-dimensional database DB, a query point Q and an integer n (n≤d), find the point P ∈ DB that has the smallest 6 n-match difference to Q. P is called the n-match of Q. 4 A B 2 C • The similarity search example Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 7 8 n=6 Q 2 4 6 8 10 x ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist P1 1.1 1 100 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.2 0.6 P2 1.4 1.4 1.4 1.5 1.4 1 100 1.2 1.2 1 1 0.4 0.98 P3 1 1 1 1 1 1 2 1 100 2 2 1.73 0 1 P4 20 20 21 20 22 20 20 19 20 20 19 P5 19 21 20 20 20 21 18 20 22 20 19 P6 21 21 18 19 20 19 21 20 20 20 19 Mining and Searching Complex Structures 38 Mining and Searching Complex Chapter 2 Structures High Dimensional Data The N-Match Query : Extensions • The k-n-match query Given a d-dimensional database DB, a query point Q, an integer k, and an integer n, find a set S which consists of k points from DB so that for any point P1 ∈ S and any point P2∈ DB-S, P1’s n-match difference is smaller than P2’s n-match difference. S is called the k-n-match of Q. y • The frequent k-n-match query 2-1-match={A,D} 10 E Given a d-dimensional database DB, a query point Q, an integer k, and an integer range [n0, n1] within [1,d], let S0, …, Si be 8 D 2-2-match={A,B} the answer sets of k-n0-match, …, k-n1-match, respectively, 6 find a set T of k points, so that for any point P1 ∈ T and any point 4 A P2 ∈ DB-T, P1’s number of appearances in S0, …, Si is larger B C 2 than or equal to P2’s number of appearances in S0, …, Si . • The similarity search example Q 2 4 6 8 10 x Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) n=6 ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist P1 1.1 1 100 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.2 P2 1.4 1.4 1.4 1.5 1.4 1 100 1.2 1.2 1 1 0.4 0.98 P3 1 1 1 1 1 1 2 1 100 2 2 1.73 0 P4 20 20 21 20 22 20 20 19 20 20 19 P5 19 21 20 20 20 21 18 20 22 20 19 P6 21 21 18 19 20 19 21 20 20 20 19 Mining and Searching Complex Structures Cost Model • The multiple system information retrieval model –Objects are stored in different systems and scored by each system –Each system can sort the objects according to their scores –A query retrieves the scores of objects from different systems and then combine them using some aggregation function Q : color=“red” & shape=“round” & texture “cloud” System 1: Color System 2: Shape System 3: Texture Object ID Score Object ID Score Object ID Score 1 0.4 0.4 1 1.0 1.0 1 1.0 1.0 2 2.8 2.8 2 5 1.5 5.5 2 2.0 2.0 3 5 3.5 6.5 3 2 5.5 7.8 3 5.0 5.0 3 4 6.5 9.0 4 3 7.8 9.0 4 5 8.0 9.0 4 5 9.0 3.5 5 4 9.0 1.5 5 4 9.0 8.0 • The cost –Retrieval of scores – proportional to the number of scores retrieved • The goal –To minimize the scores retrieved Mining and Searching Complex Structures 39 Mining and Searching Complex Chapter 2 Structures High Dimensional Data The AD Algorithm • The AD algorithm for the k-n-match query –Locate the query’s attributes value in every dimension –Retrieve the objects’ attributes value from the query’s attributes in both directions –The objects’ attributes are retrieved in Ascending order of their Differences to the query’s attributes. An n-match is found when it appears n times. 2-2-match 3.0 ( 3.0 , 7.0 , 4.0 ) shape=“round” ) Q : color=“red” &Q : (of Q ,:7.0 , 4.0& texture “cloud” System 1: Color d1 2: Systemd2 Shape 3: System d3 Texture Object ID Score Attr Object ID Score Attr Object ID Score Attr 1 0.4 1 1.0 1 1.0 2 2.8 3.0 5 1.5 2 2.0 4.0 5 3.5 2 5.5 7.0 3 5.0 3 6.5 3 7.8 5 8.0 4 9.0 4 9.0 4 9.0 Auxiliary structures d1 d2 d3 Next attribute to retrieve g[2d] 2 , 0.2 1 , 2.6 3 ,, 3.5 5 0.5 2 , 1.5 4 , 0.8 3 , 2.0 2 , 2.0 3 1.0 5 ,, 4.0 Number of appearances appear[c] 1 2 3 4 5 0 0 2 1 0 2 1 0 0 1 Answer set S { { 3 , {23} } Mining and Searching Complex Structures The AD Algorithm : Extensions • The AD algorithm for the frequent k-n-match query –The frequent k-n-match query • Given an integer range [n0, n1], find k-n0-match, k-(n0+1)-match, ... , k-n1- match of the query, S0, S1, ... , Si. • Find k objects that appear most frequently in S0, S1, ... , Si. –Retrieve the same number of attributes as processing a k-n1-match query. • Disk based solutions for the (frequent) k-n-match query –Disk based AD algorithm • Sort each dimension and store them sequentially on the disk • When reaching the end of a disk page, read the next page from disk –Existing indexing techniques • Tree-like structures: R-trees, k-d-trees • Mapping based indexing: space-filling curves, iDistance • Sequential scan • Compression based approach (VA-file) Mining and Searching Complex Structures 40 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Experiments : Effectiveness • Searching by k-n-match –COIL-100 database –54 features extracted, such as color histograms, area moments k-n-match query, k=4 kNN query n Images returned k Images returned 5 36, 42, 78, 94 10 13, 35, 36, 40, 42 10 27, 35, 42, 78 64, 85, 88, 94, 96 15 3, 38, 42, 78 20 27, 38, 42, 78 25 35, 40, 42, 94 30 10, 35, 42, 94 35 35, 42, 94, 96 40 35, 42, 94, 96 45 35, 42, 94, 96 50 35, 42, 94, 96 Searching by frequent k-n- Data sets (d) IGrid HCINN Freq. k-n-match match Ionosphere (34) 80.1% 86% 87.5% UCI Machine learning repository Competitors: Segmentation (19) 79.9% 83% 87.3% IGrid Wdbc (30) 87.1% N.A. 92.5% Human-Computer Interactive NN search (HCINN) Glass (9) 58.6% N.A. 67.8% Iris (4) 88.9% N.A. 89.6% Mining and Searching Complex Structures Experiments : Efficiency • Disk based algorithms for the Frequent k-n-mach query –Texture dataset (68,040 records); uniform dataset (100,000 records) –Competitors: • The AD algorithm • VA-file • Sequential scan Mining and Searching Complex Structures 41 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Experiments : Efficiency (continued) • Comparison with other similarity search techniques –Texture dataset ; synthetic dataset –Competitors: • Frequent k-n-match query using the AD algorithm • IGrid • scan Mining and Searching Complex Structures Future Work(I) • We now have a natural way to handle similarity search for data with categorical , numerical and attributes. Investigating k-n-match performance on such mixed-type data is currently under way • Likewise, applying k-n-match on data with missing or uncertain attributes will be interesting • Query={1,1,1,1,1,1,1,M,No,R} ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 P1 1.1 1 1.2 1.6 1.1 1.6 1.2 M Yes R P2 1.4 1.4 1.4 1.5 1.4 1 1.2 F No B P3 1 1 1 1 1 1 2 M No B P4 20 20 21 20 22 20 20 M Yes G P5 19 21 20 20 20 21 18 F Yes R P6 21 21 18 19 20 19 21 F Yes Y Mining and Searching Complex Structures 42 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Future Work(I) • We now have a natural way to handle similarity search for data with categorical , numerical and attributes. Investigating k-n-match performance on such mixed-type data is currently under way • Likewise, applying k-n-match on data with missing or uncertain attributes will be interesting • Query={1,1,1,1,1,1,1,M,No,R} ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 P1 1 1.2 1.6 1.1 1.6 1.2 M R P2 1.4 1.4 1.5 1 1.2 F No B P3 1 1 1 1 1 2 M No B P4 20 20 20 22 20 20 M G P5 19 21 20 20 20 18 Yes R P6 21 18 20 21 F Yes Y Mining and Searching Complex Structures Future Work(II) • In general, three things affect the result from a similarity search: noise, scaling and axes orientation. K-n-match reduce the effect of noise. Ultimate aim is to have a similarity function that is robust to noise, scaling and axes orientation • Eventually will look at creating mining algorithms using k-n- match Mining and Searching Complex Structures 43 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data Mining and Searching Complex Structures Motivation query Large results Data Sets Ever-increasing data collection rates of modern enterprises and the need for effective, guaranteed- quality approximate answers to queries Concern: compress as much as possible. Mining and Searching Complex Structures 22 44 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Conventional Compression Method • Try to find the optimal encoding of arbitrary strings for the input data: –Huffman Coding –Lempel-Ziv Coding (gzip) • View the whole table as a large byte string • Statistical or dictionary based • Operate at the byte level Mining and Searching Complex Structures 23 Why not just “syntactic”? • Do not exploit the complex dependency patterns in the table • Individual retrieval of tuple is difficult • Do not utilize lossy compression Mining and Searching Complex Structures 24 45 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Semantic compression methods • Derive a descriptive model M • Identify the data values which can be derived from M (within some error tolerance), which are essential for deriving, and which are the outliers • Derived values need not to be stored, only the outliers need Mining and Searching Complex Structures 25 Advantages • More Complex Analysis –Example: detect correlation among columns • Fast Retrieval –Tuple-wise access • Query Enhancement –Possible to answer query directly from discover semantic –Compress in way which enhanced answering of some complex queries, eg. “Go Green: Recycle and Reuse Frequent Patterns”, C. Gao, B. C. Ooi, K. L. Tan and A. K. H. Tung. ICDE’2004. Choose a combination of compression methods based on semantic and syntactic information Mining and Searching Complex Structures 26 46 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Fascicles • Key observation –Often, numerous subsets of records in T have similar values for many attributes Protocol Duration Bytes Packets • Compress data by storing http 12 20K 3 http 16 24K 5 representative values (e.g., http 15 20K 8 “centroid”) only once for each http 19 40K 11 attribute cluster http 26 58K 18 ftp 27 100K 24 ftp 32 300K 35 • Lossy compression: ftp 18 80K 15 information loss is controlled by the notion of “similar values” for attributes (user-defined) Mining and Searching Complex Structures 27 ItCompress: Compression Format Representative Rows (Patterns) Original Table RRid age salary credit sex age salary credit sex 1 30 90k good F 20 30k poor M 2 70 35k poor M 25 76k good F Compressed Table 30 90k good F Outlying 40 100k poor M RRid bitmap value 50 110k good F 2 0111 20 60 50k good M 1 1111 70 35k poor F 1 1111 75 15k poor M 1 0100 40, poor, M Error Tolerance: 1 0111 50 age salary credit sex 1 0010 60, 50k, M 5 25k 0 0 2 1110 F 28 2 1111 Mining and Searching Complex Structures 47 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Some definitions • Error tolerance –Numeric attributes • The upper bound that x’ can be different from x • x ∈ [ x’-ei, x’+ei ] –Categorical attributes • The upper bound on the probability that the compressed value differs from actual value • Given an actual value x and its error tolerance ei, the compressed value x’ should satisfy: Prob( x=x’ ) ≥ 1 - ei Mining and Searching Complex Structures 29 Some definitions • Coverage –Let R be a row in the table T, and Pi be a pattern –The coverage of Pi on R : cov( Pi , R ) = number of attributes X i in which R[ X i ] is match by Pi [ X i ] • Total coverage –Let P be a set of patterns P1,…,Pk; and the table T contains n rows R1,…,Rn – totalcov ( P, T ) = ∑ cov( P i =1..n max ( Ri ), Ri ) 30 Mining and Searching Complex Structures 48 Mining and Searching Complex Chapter 2 Structures High Dimensional Data ItCompress: basic algorithm • First randomly choose k rows as initial patterns • Scan the table T: Phase1 –For each row R, compute the coverage of each pattern on it, then try to find Pmax(R) –Allocate R to its most covered pattern • After each iteration, re-compute all patterns’ Phase2 attributes, always using the most frequent values • Iterate until sum of total coverage does not increase Mining and Searching Complex Structures 31 Example: the 1st iteration begins age salary credit sex RRid age salary credit sex 20 30k poor M 1 20 30k poor M 25 76k good F 2 25 76k good F 30 90k good F 40 100k poor M 50 110k good F 60 50k good M 70 35k poor F 75 15k poor M Error Tolerance: age salary credit sex 5 25k 0 0 32 Mining and Searching Complex Structures 49 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Example: Phase 1 RRid age salary credit sex age salary credit sex 1 20 30k poor M 20 30k poor M 2 25 76k good F 25 76k good F age salary credit sex 30 90k good F 20 30k poor M 40 100k poor M 40 100k poor M 50 110k good F 60 50k good M 60 50k good M 70 35k poor F 70 35k poor F 75 15k poor M 75 15k poor M age salary credit sex Error Tolerance: 25 76k good F age salary credit sex 30 90k good F 5 25k 0 0 33 50 110k good F Mining and Searching Complex Structures Example: Phase 2 RRid age salary credit sex age salary credit sex 1 20 M 70 30k poor M 20 30k poor M 2 25 25 90k 76k good F F 25 76k good F 30 90k good F age salary credit sex 40 100k poor M 20 30k poor M 50 110k good F 40 100k poor M 60 50k good M 60 50k good M 70 35k poor F 70 35k poor F 75 15k poor M 75 15k poor M Error Tolerance: age salary credit sex 25 76k good F age salary credit sex 30 90k good F 5 25k 0 0 34 50 110k good F Mining and Searching Complex Structures 50 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Convergence(I) • Phase 1: –When we assign the rows to their most coverage patterns: • For each row, the coverage increases or maintain So the total coverage also increases or maintain • Phase 2: –When we re-compute the attribute values for the patterns: • For each pattern, the coverage increases or maintains So the total coverage also increases or maintains Mining and Searching Complex Structures 35 Convergence(II) • In both Phase 1&2, the total coverage is either increased or maintained, and it has a obvious upper bound (cover the whole table) The algorithm will converge eventually Mining and Searching Complex Structures 36 51 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Complexity • Phase 1: –In l iterations, we need to go through the n rows in the table and match each row against the k patterns(2m comparisons,) The running time complexity is O(kmnl) where m is the number of attributes • Phase 2: –Computing each new pattern Pi will require going through all the domain values/intervals of each value Assuming the total number of domain values/intervals is d, the running time complexity is O(kdl) The total time complexity is O(kmnl+kdl) Mining and Searching Complex Structures 37 Advantages of ItCompress • Simplicity and Directness –Two phases process of Fascicle and Spartan • Find rules/patterns • Compress database using discovered rules/patterns –ItCompress optimize the compression directly without finding rules/patterns that may not be useful (a.k.a microeconomic approach) • Less constraints –Do not need patterns to be matched completely or rules that apply globally • Easily tuned parameters Mining and Searching Complex Structures 38 52 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Performance Comparison • Algorithms –ItCompress, ItCompress+gzip –Fascicles, Fascicles+gzip –SPARTAN+gzip • Platform –ItCompress,Fascicles: AMD Duron 700Mhz, 256MB Memory –SPARTAN: Four 700Mhz Pentium CPU, 1GB Memory) • Datasets –Corel: 32 numeric attributes, 35000 rows, 10.5MB –Census: 7 numeric, 7 categorical, 676000 rows, 28.6MB –Forest-cover: 10 numeric, 44 categorical, 581000 rows, 75.2MB Mining and Searching Complex Structures 39 Effectiveness (Corel) Mining and Searching Complex Structures 40 53 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Effectiveness (Census) Mining and Searching Complex Structures 41 Effectiveness (Forest Cover) Mining and Searching Complex Structures 42 54 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Efficiency Mining and Searching Complex Structures 43 Varying k Mining and Searching Complex Structures 44 55 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Varying Sample Ratio Mining and Searching Complex Structures 45 Adding Noises (Census) Mining and Searching Complex Structures 46 56 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Effect of Corruption 20% Corruption? A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 47 Mining and Searching Complex Structures Effect of Corruption 20% Corruption? A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 48 Mining and Searching Complex Structures 57 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Findings • ItCompress is –More efficient than SPARTAN –More effective than Fascicles –Insensitive to parameter setting –Robust to noises Mining and Searching Complex Structures 49 Future work • Can we perform mining on the compressed datasets using only the patterns and the bitmap ? –Example: Building Bayesian Belief Network • Is ItCompress a good “bootstrap” semantic compression algorithm ? ItCompress Compressed database database Other Semantic Compression Algorithms 50 Mining and Searching Complex Structures 58 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data Mining and Searching Complex Structures Metric v.s. Non-Metric • Euclidean distance dominates DB queries • Similarity in human perception • Metric distance is not enough! 2010-7-31 Mining and Searching Complex Structures 52 59 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Bregman Divergence h (q,f(q)) convex function f(x) (p,f(p)) Bregman divergence Df(p,q) q p Euclidean dist. 2010-7-31 Mining and Searching Complex Structures 53 Bregman Divergence • Mathematical Interpretation –The distance between p and q is defined as the difference between f(p) and the first order Taylor expansion at q f(x) at p first order Taylor expansion at q 2010-7-31 Mining and Searching Complex Structures 54 60 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Bregman Divergence • General Properties –Non-Negativity • Df(p,q)≥0 for any p, q –Identity of Indiscernible • Df(p,p)=0 for any p –Symmetry and Triangle Inequality • Do NOT hold any more 2010-7-31 Mining and Searching Complex Structures 55 Examples Distance f(x) Df(p,q) Usage KL-Divergence x logx p log (p/q) distribution, color histogram Itakura-Saito -logx p/q-log (p/q)-1 signal, speech Distance Squared x2 (p-q)2 Euclidean space Euclidean Von-Nuemann tr(X log X – X) tr(X logX – X symmetric matrix Entropy logY – X + Y) 2010-7-31 Mining and Searching Complex Structures 56 61 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Why in DB system? • Database application –Retrieval of similar images, speech signals, or time series –Optimization on matrices in machine learning –Efficiency is important! • Query Types –Nearest Neighbor Query –Range Query 2010-7-31 Mining and Searching Complex Structures 57 Euclidean Space • How to answer the queries –R-Tree 2010-7-31 Mining and Searching Complex Structures 58 62 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Euclidean Space • How to answer the queries –VA File 2010-7-31 Mining and Searching Complex Structures 59 Our goal • Re-use the infrastructure of existing DB system to support Bregman divergence –Storage management –Indexing structures –Query processing algorithms 2010-7-31 Mining and Searching Complex Structures 60 63 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Basic Solution • Extended Space –Convex function f(x) = x2 point D1 D2 point D1 D2 D3 p 0 1 p+ 0 1 1 q 0.5 0.5 q+ 0.5 0.5 0.5 r 1 0.8 r+ 1 0.8 1.64 t 1.5 0.3 t+ 1.5 0.3 3.15 2010-7-31 Mining and Searching Complex Structures 61 Basic Solution • After the extension –Index extended points with R-Tree or VA File –Re-use existing algorithms with lower and upper bounds on the rectangles 2010-7-31 Mining and Searching Complex Structures 62 64 Mining and Searching Complex Chapter 2 Structures High Dimensional Data How to improve? • Reformulation of Bregman divergence • Tighter bounds are derived • No change on index construction or query processing algorithm 2010-7-31 Mining and Searching Complex Structures 63 A New Formulation h h’ query vector vq Df(p,q)+Δ q p D*f(p,q) 2010-7-31 Mining and Searching Complex Structures 64 65 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Math. Interpretation • Reformulation of similarity search queries –k-NN query: query q, data set P, divergence Df • Find the point p, minimizing –Range query: query q, threshold θ, data set P • Return any point p that 2010-7-31 Mining and Searching Complex Structures 65 Naïve Bounds • Check the corners of the bounding rectangles 2010-7-31 Mining and Searching Complex Structures 66 66 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Tighter Bounds • Take the curve f(x) into consideration 2010-7-31 Mining and Searching Complex Structures 67 Query distribution • Distortion of rectangles –The difference between maximum and minimum distances from inside the rectangle to the query 2010-7-31 Mining and Searching Complex Structures 68 67 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Can we improve it more? • When Building R-Tree in Euclidean space –Minimize the volume/edge length of MBRs –Does it remain valid? 2010-7-31 Mining and Searching Complex Structures 69 Query distribution • Distortion of bounding rectangles –Invariant in Euclidean space (triangle inequality) –Query-dependent for Bregman Divergence 2010-7-31 Mining and Searching Complex Structures 70 68 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Utilize Query Distribution • Summarize query distribution with O(d) real number • Estimation on expected distortion on any bounding rectangle in O(d) time • Allows better index to be constructed for both R-Tree and VA File 2010-7-31 Mining and Searching Complex Structures 71 Experiments • Data Sets –KDD’99 data • Network data, the proportion of packages in 72 different TCP/IP connection Types –DBLP data • Use co-authorship graph to generate the probabilities of the authors related to 8 different areas 2010-7-31 Mining and Searching Complex Structures 72 69 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Experiment • Data Sets –Uniform Synthetic data • Generate synthetic data with uniform distribution –Clustered Synthetic data • Generate synthetic data with Gaussian Mixture Model 2010-7-31 Mining and Searching Complex Structures 73 Experiments • Methods to compare Basic Improved Query Bounds Distribution R-Tree R R-B R-BQ VA File V V-B V-BQ Linear Scan LS BB-Tree BBT 2010-7-31 Mining and Searching Complex Structures 74 70 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Existing Solution • BB-Tree (L. Clayton, ICML 2009) –Memory-based indexing tree –Construct with k-means clustering –Hard to update –Ineffective in high-dimensional space 2010-7-31 Mining and Searching Complex Structures 75 Experiments • Index Construction Time 2010-7-31 Mining and Searching Complex Structures 76 71 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Experiments • Varying dimensionality 2010-7-31 Mining and Searching Complex Structures 77 Experiments • Varying dimensionality (cont.) 2010-7-31 Mining and Searching Complex Structures 78 72 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Experiments • Varying data cardinality 2010-7-31 Mining and Searching Complex Structures 79 Conclusion • A general technique on similarity for Bregman Divergence • All techniques are based on existing infrastructure of commercial database • Extensive experiments to compare performances with R- Tree and VA File with different optimizations 2010-7-31 Mining and Searching Complex Structures 80 73 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data Mining and Searching Complex Structures Motivation • Probabilistic data is ubiquitous –To represent the data uncertainty (WSN, RFID, moving object monitoring) –To compress data (image processing) • Histogram is a good way to represent the prob. data –Easy to capture –Is very useful in image representation • Colors • Textures • Gradient • Depth Mining and Searching Complex Structures 74 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Motivation • Similarity search is important for managing prob. data –Given a threshold θ, can answer which sensors’ readings are similar with sensor A (range query) –Can answer which k pictures are similar (top-k query) • Similarity function for prob. data should be carefully chosen –Bin by bin methods • L1 and L2 norms • χ2 distance –Cross-bin methods • Earth Mover’s Distance (EMD) • Quadratic form Mining and Searching Complex Structures Outline • Motivation • Introduction to Earth Mover’s Distance (EMD) • Related works • Indexing the probabilistic data based on EMD • Experimental results • Conclusion and future work Mining and Searching Complex Structures 75 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Introduction to Earth Mover’s Dist • Bin by bin vs. cross bin Bin-by-bin Not good! Cross bin Good! Can handle distribution shift Mining and Searching Complex Structures Introduction to Earth Mover’s Dist • What is EMD? –Earth （泥土） –Mover （搬运） –Distance （代价） –Can be understood as 搬运泥土的代价 • See an example… Mining and Searching Complex Structures 76 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Moving Earth ≠ Mining and Searching Complex Structures Moving Earth ≠ Mining and Searching Complex Structures 77 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Moving Earth = Mining and Searching Complex Structures The Difference? (amount moved) = Mining and Searching Complex Structures 78 Mining and Searching Complex Chapter 2 Structures High Dimensional Data The Difference? Difference (amount moved) * (distance moved) = Mining and Searching Complex Structures Linear programming P m bins (distance moved) * (amount moved) Q All movements n bins Mining and Searching Complex Structures 79 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Linear programming P m clusters (distance moved) * (amount moved) Q n clusters Mining and Searching Complex Structures Linear programming P m clusters * (amount moved) Q n clusters Mining and Searching Complex Structures 80 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Linear programming P m clusters Q n clusters Mining and Searching Complex Structures Constraints 1. Move “earth” only from P to Q P m clusters P’ Q n clusters Q’ Mining and Searching Complex Structures 81 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Constraints 2. Cannot send more “earth” than P there is m clusters P’ Q n clusters Q’ Mining and Searching Complex Structures Constraints 3. Q cannot receive more “earth” P than it can hold m clusters P’ Q n clusters Q’ Mining and Searching Complex Structures 82 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Constraints 4. As much “earth” as possible P must be moved m clusters P’ Q n clusters Q’ Mining and Searching Complex Structures The Formal Definition of EMD • Earth Mover’s Distance (EMD) –the minimum amount of work needed to change one histogram into another • Challenge of EMD –O(N^3logN) Mining and Searching Complex Structures 83 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Related Works • Filter-and-refine framework –[1] Approximation Techniques for Indexing the Earth Mover's Distance in Multimedia Databases. ICDE 2006 • Cannot handle high dimensional histograms –[2] Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction. SIGMOD 2008 • Based on scan framework and influence the scalability • Use scanning scheme to process queries –Merit: can obtain a good order to access when execute the k-NN queries and thus can minimize the number of candidates –Demerit: need to scan the whole dataset to obtain the order and thus low algo. scalability Mining and Searching Complex Structures Related Works • Related works –Based on the filter-and-refine framework –Based on scanning method and low scalability • Our work –Also based on the filter-and-refine method –But avoid to scan the whole data set • Use B+ trees • And thus can obtain high scalability • Our contributions –To the best of our knowledge, the 1st paper to index the high dimensional prob. data based on the EMD –Proposed algorithms of processing the similarity query based on B+ tree filter –Improve the efficiency and scalability of EMD-based similarity search Mining and Searching Complex Structures 84 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Indexing the probabilistic data based on EMD • Our intuition: –primal-dual theory in linear programming • Primal problem (EMD) • Dual problem Mining and Searching Complex Structures Indexing the probabilistic data based on EMD • Good properties of dual space –Constrains of dual space are independent of prob. data points (i.e., p and q in this example) • Thus, give any feasible solution (π, Ф) in dual space we can derives a lower bound for EMD(p, q) • Lower bound can help to filter out the not-hit histograms. –given any feasible solution (π, Ф) in dual space, a histogram p can be mapped as a value, using the operation of • Can index histograms using B+ tree Mining and Searching Complex Structures 85 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Indexing the probabilistic data based on EMD • 1. Mapping Construction –Key and counter key Key Counter key –Assuming p is a histogram in DB, given a feasible solution (π, Ф), we calculate the Key for each record in DB –We can index those keys using B+ tree –For each feasible solution (π, Ф), a B+ tree can be constructed Mining and Searching Complex Structures Answering Range Query • Range query based on B+ index –Given any feasible solution (π, Ф) , we construct a B+ tree using keys of histograms –Given a query histogram, we calculate its counter key using the operation of –Given a similarity search threshold θ, we have proved that all candidate histogram’s key can be bounded by –To further filter the candidates, we use L B+ tree and make an intersection among their candidate results Mining and Searching Complex Structures 86 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Answering KNN Query • K-NN query based on B+ index –Given a query q, we issue search on each B+ tree Tl with key(q, Фl) –We create two cursors for each tree and let them to fetch records from different directions (one left and one right) –Whenever record r has already been accessed by all B+ tree, it can be output as a candidate for k-NN query Mining and Searching Complex Structures Experimental Setup • 3 real data set –RETINA1 • an image data set consists of 3932 feline retina scans labeled with various antibodies. –IRMA • contains 10000 radiography images from the Image Retrieval in Medical Application (IRMA) project –DBLP • With parameter setting Mining and Searching Complex Structures 87 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Experimental Results on Query CPU Time Mining and Searching Complex Structures Experimental Results on Scalability sigmod our Mining and Searching Complex Structures 88 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Conclusions • We present a new indexing scheme for the general purposes of similarity search on Earth Mover's Distance • Our index method relies on the primal-dual theory to construct mapping functions from the original probabilistic space to one-dimensional domain • Our B+ tree-based index framework has –High scalability –High efficiency –can handle High dimensional data Mining and Searching Complex Structures Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data Mining and Searching Complex Structures 89 Mining and Searching Complex Chapter 2 Structures High Dimensional Data A Microarray Dataset 1000 - 100,000 columns Class Gene1 Gene2 Gene3 Gene4 Gene Gene Ge 5 6 Sample1 Cancer Sample2 Cancer 100- 500 . rows . . SampleN-1 ~Cance r SampleN ~Cance r • Find closed patterns which occur frequently among genes. • Find rules which associate certain combination of the columns that affect the class of the rows –Gene1,Gene10,Gene1001 -> Cancer Mining and Searching Complex Structures Challenge I • Large number of patterns/rules –number of possible column combinations is extremely high • Solution: Concept of a closed pattern –Patterns are found in exactly the same set of rows are grouped together and represented by their upper bound • Example: the following patterns are found in row 2,3 and 4 upper aeh bound i ri Class (closed 1 a ,b,c,l,o,s C pattern) 2 a ,d, e , h ,p,l,r C ae ah 3 a ,c, e , h ,o,q,t C eh 4 a , e ,f, h ,p,r ~C 5 b,d,f,g,l,q,s,t ~C e h lower bounds “a” however not part of the group Mining and Searching Complex Structures 90 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Challenge II • Most existing frequent pattern discovery algorithms perform searches in the column/item enumeration space i.e. systematically testing various combination of columns/items • For datasets with 1000-100,000 columns, this search space is enormous • Instead we adopt a novel row/sample enumeration algorithm for this purpose. CARPENTER (SIGKDD’03) is the FIRST algorithm which adopt this approach Mining and Searching Complex Structures Column/Item Enumeration Lattice • Each nodes in the lattice represent a combination of columns/items a,b,c,e • An edge exists from node A to B if A is subset of B and A differ from a,b,c a,b,e a,c,e b,c B by only 1 column/item • Search can be done breadth first a,b a,c a,e b,c b i ri Class 1 a,b,c,l,o,s C a b c 2 a,d,e,h,p,l,r C 3 a,c,e,h,o,q,t C 4 a,e,f,h,p,r ~C 5 b,d,f,g,l,q,s,t ~C start {} Mining and Searching Complex Structures 91 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Column/Item Enumeration Lattice • Each nodes in the lattice represent a combination of columns/items a,b,c,e • An edge exists from node A to B if A is subset of B and A differ from B by only 1 column/item a,b,c a,b,e a,c,e b,c • Search can be done depth first • Keep edges from parent to child only if child is the prefix of parent a,b a,c a,e b,c b i ri Class 1 a,b,c,l,o,s C a b c 2 a,d,e,h,p,l,r C 3 a,c,e,h,o,q,t C 4 a,e,f,h,p,r ~C 5 b,d,f,g,l,q,s,t ~C start {} Mining and Searching Complex Structures General Framework for Column/Item Enumeration Read-based Write-based Point-based Association Mining Apriori[AgSr94], Eclat, Hmine DIC MaxClique[Zaki01], FPGrowth [HaPe00] Sequential Pattern GSP[AgSr96] SPADE Discovery [Zaki98,Zaki01], PrefixSpan [PHPC01] Iceberg Cube Apriori[AgSr94] BUC[BeRa99], H- Cubing [HPDW01] Mining and Searching Complex Structures 92 Mining and Searching Complex Chapter 2 Structures High Dimensional Data A Multidimensional View types of data others or knowledge other interest measure associative pattern constraints pruning method sequential pattern compression method closed/max iceberg pattern cube lattice transversal/ main operations read write point Mining and Searching Complex Structures Sample/Row Enumeration Algorihtms • To avoid searching the large column/item enumeration space, our mining algorithm search for patterms/rules in the sample/row enumeration space • Our algorithms does not fitted into the column/item enumeration algorithms • They are not YAARMA (Yet Another Association Rules Mining Algorithm) • Column/item enumeration algorithms simply does not scale for microarray datasets Mining and Searching Complex Structures 93 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Existing Row/Sample Enumeration Algorithms • CARPENTER(SIGKDD'03) –Find closed patterns using row enumeration • FARMER(SIGMOD’04) –Find interesting rule groups and building classifiers based on them • COBBLER(SSDBM'04) –Combined row and column enumeration for tables with large number of rows and columns • Topk-IRG(SIGMOD’05) –Find top-k covering rules for each sample and build classifier directly • Efficiently Finding Lower Bound Rules(TKDE’2010) –Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao. What is Unequal among the Equals? Ranking Equivalent Rules from Gene Expression Data. Accepted in TKDE Mining and Searching Complex Structures Concepts of CARPENTER ij R (ij ) C ~C i ri Class a 1,2,3 4 b 1 5 1 a,b,c,l,o,s C C ~C c 1,3 2 a,d,e,h,p,l,r C d 2 5 a 1,2,3 4 3 a,c,e,h,o,q,t C e 2,3 4 e 2,3 4 4 a,e,f,h,p,r ~C f 4,5 h 2,3 4 5 b,d,f,g,l,q,s,t ~C g 5 h 2,3 4 TT|{2,3} Example Table l 1,2 5 o 1,3 p 2 4 q 3 5 r 2 4 s 1 5 t 3 5 Transposed Table,TT Mining and Searching Complex Structures 94 Mining and Searching Complex Chapter 2 Structures High Dimensional Data ij R (ij ) Row Enumeration a C 1,2,3 4 ~C b 1 5 c 1,3 d 2 5 e 2,3 4 123 12345 f 4,5 {a} 1234 {} {a} g 5 12 124 h 2,3 4 {al} {a} 1235 l 1,2 5 {} ij R (ij ) 13 125 o 1,3 {aco} {l} C ~C p 2 4 1 1245 a 1,2,3 4 q 3 5 14 134 {} {abclos} {a} {a} TT|{1} b 1 5 r 2 4 s 1 5 15 135 c 1,3 {bls} {} 1345 t 3 5 {} l 1,2 5 23 145 o 1,3 2 {} ij R (ij ) {aeh} s 1 5 {adehplr} C ~C 24 234 2345 {aeh} {} a 1,2,3 4 {aehpr} TT|{12} l {} 1,2 5 3 25 235 {dl} {} ij R (ij ) {acehoqt} 245 C ~C 34 {} {aeh} a 1,2,3 4 4 TT|{123} {124} {aefhpr} 345 35 {} {q} 5 45 {bdfglqst} {f} Mining and Searching Complex Structures Pruning Method 1 • Removing rows that appear in all tuples of transposed table will not affect results C ~C a 1,2,3 4 e 2,3 4 h 2,3 4 r2 r3 r2 r3 r4 TT|{2,3} {aeh} {aeh} r4 has 100% support in the conditional table of “r2r3”, therefore branch “r2 r3r4” will be pruned. Mining and Searching Complex Structures 95 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Pruning method 2 123 {a} 1234 {a} • if a rule is discovered 12345 {} 12 {al} 124 {a} 1235 before, we can prune {} 13 125 {l} enumeration below this {aco} 1 14 134 {a} 1245 {} node {abclos} {a} 15 135 –Because all rules below {bls} {} 1345 {} this node has been 23 145 2 {aeh} {} discovered before {adehplr} 234 24 {aeh} 2345 –For example, at node 34, if {} {aehpr} {} 3 25 235 we found that {aeh} has {dl} {} {acehoqt} 245 been found, we can prune 34 {} C ~Coff all branches below it {aeh} 4 345 a 1,2,3 4 {aefhpr} 35 {} {q} e 2,3 4 h 2,3 4 5 45 {f} TT|{3,4} {bdfglqst} Mining and Searching Complex Structures Pruning Method 3: Minimum Support • Example: From TT|{1}, we can see ij R (ij ) that the support of all possible pattern below node {1} will be at C ~C most 5 rows. TT|{1} a 1,2,3 4 b 1 5 c 1,3 l 1,2 5 o 1,3 s 1 5 Mining and Searching Complex Structures 96 Mining and Searching Complex Chapter 2 Structures High Dimensional Data From CARPENTER to FARMER • What if classes exists ? What more can we do ? • Pruning with Interestingness Measure –Minimum confidence –Minimum chi-square • Generate lower bounds for classification/ prediction Mining and Searching Complex Structures Interesting Rule Groups • Concept of a rule group/equivalent class –rules supported by exactly the same set of rows are grouped together • Example: the following rules are derived from row 2,3 and 4 with 66% confidence i ri Class upper 1 a ,b,c,l,o,s C aeh--> C(66%) bound 2 a ,d, e , h ,p,l,r C 3 a ,c, e , h ,o,q,t C ae-->C (66%) ah--> C(66%) eh-->C (66%) 4 a , e ,f, h ,p,r ~C 5 b,d,f,g,l,q,s,t ~C a-->C however is not in e-->C (66%) h-->C (66%) the group lower bounds Mining and Searching Complex Structures 97 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Pruning by Interestingness Measure • In addition, find only interesting rule groups (IRGs) based on some measures: –minconf: the rules in the rule group can predict the class on the RHS with high confidence –minchi: there is high correlation between LHS and RHS of the rules based on chi-square test • Other measures like lift, entropy gain, conviction etc. can be handle similarly Mining and Searching Complex Structures ij R (ij ) C ~C Ordering of Rows: All Class C before ~C a b 1,2,3 4 1 5 c 1,3 d 2 5 e 2,3 4 123 12345 f 4,5 {a} 1234 {} {a} g 5 12 124 h 2,3 4 {al} {a} 1235 l 1,2 5 {} ij R (ij ) 13 125 o 1,3 {aco} {l} C ~C p 2 4 1 1245 a 1,2,3 4 q 3 5 14 134 {} {abclos} {a} {a} TT|{1} b 1 5 r 2 4 s 1 5 15 135 c 1,3 {bls} {} 1345 t 3 5 {} l 1,2 5 23 145 o 1,3 2 {} ij R (ij ) {aeh} s 1 5 {adehplr} C ~C 24 234 2345 {aeh} {} a 1,2,3 4 {aehpr} TT|{12} l {} 1,2 5 3 25 235 {dl} {} ij R (ij ) {acehoqt} 245 C ~C 34 {} {aeh} a 1,2,3 4 4 TT|{123} {124} {aefhpr} 345 35 {} {q} 5 45 {bdfglqst} {f} Mining and Searching Complex Structures 98 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Pruning Method: Minimum Confidence • Example: In TT|{2,3} on the right, C ~C the maximum confidence of all rules a 1,2,3,6 4,5 below node {2,3} is at most 4/5 e 2,3,7 4,9 h 2,3 4 TT|{2,3} Mining and Searching Complex Structures Pruning method: Minimum chi-square C ~C Same as in computing maximum confidence a 1,2,3,6 4,5 e 2,3,7 4,9 h 2,3 4 TT|{2,3} C ~C Total A max=5 min=1 Computed ~A Computed Computed Computed Constant Constant Constant Mining and Searching Complex Structures 99 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Finding Lower Bound, MineLB –Example: An upper bound rule with antecedent A=abcde a,b,c,d,e and two rows (r1 : abcf ) and (r2 : cdeg) ad ae bd be –Initialize lower bounds {a, b, abc cde c, d, e} –add “abcf”--- new lower {d ,e} a e b c d –Add “cdeg”--- new lower bound{ad, bd, ae, be} Candidate lower bound: ad, ae, bd, be, Candidate lower bound: ad, ae, bd, be cd, ce Removed since d,e are still lower them Kept since no lower bound overridebound Mining and Searching Complex Structures Implementation • In general, CARPENTER FARMER can be ij R (ij ) implemented in many ways: C ~C a 1,2,3 4 –FP-tree b 1 5 –Vertical format c 1,3 d 2 5 • For our case, we assume the dataset can be e 2,3 4 fitted into the main memory and used f 4,5 g 5 pointer-based algorithm similar to BUC h 2,3 4 l 1,2 5 o 1,3 p 2 4 q 3 5 r 2 4 s 1 5 t 3 5 Mining and Searching Complex Structures 100 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Experimental studies • Efficiency of FARMER –On five real-life dataset • lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALL- AML leukemia (ALL), Colon Tumor(CT) –Varying minsup, minconf, minchi –Benchmark against • CHARM [ZaHs02] ICDM'02 • Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99 • Usefulness of IRGs –Classification Mining and Searching Complex Structures Example results--Prostate 100000 FA RM ER 10000 Co lumnE 1000 CHA RM 100 10 1 3 4 5 6 7 8 9 mi ni mum sup p o r t Mining and Searching Complex Structures 101 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Example results--Prostate 1200 FA RM ER:minsup=1:minchi=10 1000 FA RM ER:minsup =1 800 600 400 200 0 0 50 70 80 85 90 99 minimum confidence(%) Mining and Searching Complex Structures Top k Covering Rule Groups • Rank rule groups (upper bound) according to – Confidence – Support • Top k Covering Rule Groups for row r – k highest ranking rule groups that has row r as support and support > minimum support • Top k Covering Rule Groups = TopKRGS for each row Mining and Searching Complex Structures 102 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Usefulness of Rule Groups • Rules for every row • Top-1 covering rule groups sufficient to build CBA classifier • No min confidence threshold, only min support • #TopKRGS = k x #rows Mining and Searching Complex Structures Top-k covering rule groups • For each row, we find the most significant k rule groups: class Items –based on confidence first –then support C1 a,b,c • Given minsup=1, Top-1 –row 1: abc C1(sup = 2, conf= 100%) C1 a,b,c,d –row 2: abc C1 C1 c,d,e • abcd C1(sup=1,conf = 100%) –row 3: cd C1(sup=2, conf = 66.7%) • If minconf = 80%, ? –row 4: cde C2 (sup=1, conf = 50%) C2 c,d,e Mining and Searching Complex Structures 103 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Main advantages of Top-k coverage rule group • The number is bounded by the product of k and the number of samples • Treat each sample equally provide a complete description for each row (small) • The minimum confidence parameter-- instead k. • Sufficient to build classifiers while avoiding excessive computation Mining and Searching Complex Structures Top-k pruning • At node X, the maximal set of rows covered by rules to be discovered down X-- rows containing X and rows ordered after X. – minconf MIN confidence of the discovered TopkRGs for all rows in the above set – minsup the corresponding minsup • Pruning –If the estimated upper bound of confidence down X < minconf prune –If same confidence and smaller support prune • Optimizations Mining and Searching Complex Structures 104 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Classification based on association rules • Step 1: Generate the complete set of association rules for each class ( minimum support and minimum confidence.) –CBA algorithm adopts apriori-like algorithm -fails at this step on microarray data. • Step 2:Sort the set of generated rules • Step 3: select a subset of rules from the sorted rule sets to form classifiers. Mining and Searching Complex Structures Features of RCBT classifiers Problems RCBT To discover, store, retrieve and Mine those rules to be used for sort a large number of rules classification.e.g.Top-1 rule group is sufficient to build CBA classifier Default class not convincing for Main classifier + some back-up biologists classifiers Rules with the same A subset of lower bound rules— discriminating ability, how to integrate using a score integrate? considering both confidence and support. Upper bound rules: specific Lower bound rules: general Mining and Searching Complex Structures 105 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Experimental studies • Datasets: 4 real-life data • Efficiency of Top-k Rule mining –Benchmark: Farmer, Charm, Closet+ • Classification Methods: –CBA (build using top-1 rule group) –RCBT (our proposed method) –IRG Classifier –Decision trees (single, bagging, boosting) –SVM Mining and Searching Complex Structures Runtime v.s. Minimum support on ALL-AML dataset 10000 FARMER FARMER(minconf=0.9) 1000 FARMER+prefix(minconf=0.9) TOP1 100 TOP100 Runtime(s) 10 1 0.1 0.01 17 19 21 22 23 25 Minimum Support Mining and Searching Complex Structures 106 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Scalability with k 100 PC ALL 10 Runtime(s) 1 0.1 100 300 500 600 800 1000 k Mining and Searching Complex Structures Biological meaning –Prostate Cancer Data Frequncy of Occurrence 1800 W72186 1600 1400 1200 AF017418 1000 AI635895 800 X14487 600 AB014519 M61916 400 Y13323 200 0 0 200 400 600 800 1000 1200 1400 1600 Gene Rank Mining and Searching Complex Structures 107 Mining and Searching Complex Chapter 2 Structures High Dimensional Data Classification results Mining and Searching Complex Structures Classification results Mining and Searching Complex Structures 108 Mining and Searching Complex Chapter 2 Structures High Dimensional Data References • Anthony K. H. Tung, Rui Zhang, Nick Koudas, Beng Chin Ooi. "Similarity Search: A Matching Based Approach", VLDB'06 • H. V. Jagadish, Raymond T. Ng, Beng Chin Ooi, Anthony K. H. Tung, "ItCompress: An Iterative Semantic Compression Algorithm". International Conference on Data Engineering (ICDE'2004), Boston, 2004. • Zhenjie Zhang, Beng Chin Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung. Similarity Search on Bregman Divergence: Towards Non-Metric Indexing. In the Proceedings of the 35th International Conference on Very Large Data Bases(VLDB), Lyon, France August 24-28, 2009. • Jia Xu, Zhenjie Zhang, Anthony K. H. Tung, and Ge Yu. "Efficient and Effective Similarity Search over Probabilistic Data Based on Earth Mover's Distance". to appear in VLDB 2010, a preliminary version on Technical Report TRA5-10, National University of Singapore. [Codes & Data] • Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". In Proceedings SIGMOD'05,Baltimore, Maryland 2005 • Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao. What is Unequal among the Equals? Ranking Equivalent Rules from Gene Expression Data. Accepted in TKDE Mining and Searching Complex Structures Optional References: • Feng Pan, Gao Cong, Anthony K. H. Tung, Jiong Yang, Mohammed Zaki, "CARPENTER: Finding Closed Patterns in Long Biological Datasets", In Proceedings KDD'03, Washington, DC, USA, August 24-27, 2003. • Gao Cong, Anthony K. H. Tung, Xin Xu, Feng Pan, Jiong Yang. "FARMER: Finding Interesting Rule Groups in Microarray Datasets". Iin SIGMOD'04, June 13-18, 2004, Maison de la Chimie, Paris, France. • Feng Pang, Anthony K. H. Tung, Gao Cong, Xin Xu. "COBBLER: Combining Column and Row Enumeration for Closed Pattern Discovery". SSDBM 2004 Santorini Island Greece. • Gao Cong, Kian-Lee Tan, Anthony K.H. Tung, Feng Pan. “Mining Frequent Closed Patterns in Microarray Data”. In IEEE International Conference on Data Mining, (ICDM). 2004 • Xin Xu, Ying Lu, Anthony K.H. Tung, Wei Wang. "Mining Shifting-and- Scaling Co-Regulation Patterns on Gene Expression Profiles". ICDE 2006. Mining and Searching Complex Structures 109 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Searching and Mining Complex Structures Similarity Search on Sequences Anthony K. H. Tung(鄧锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung Research Group Link: http://nusdm.comp.nus.edu.sg/index.html Social Network Link: http://www.renren.com/profile.do?id=313870900 Types of sequences Symbolic vs Numeric We only touch discrete symbols here. Sequences of number are called time series and is a huge topic by itself! Single dimension vs multi-dimensional Example: Yueguo Chen, Shouxu Jiang, Beng Chin Ooi, Anthony K. H. Tung. "Querying Complex Spatial-Temporal Sequences in Human Motion Databases" accepted and to appear in 24th IEEE International Conference on Data Engineering (ICDE) 2008 Single long sequence vs multiple sequences 2010-7-31 110 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Outline • Searching based on a disk based suffix tree • Approximate Matching Using Inverted List (Vgrams) • Approximate Matching Based on B+ Tree (BED Tree) 2010-7-31 Suffix Suffixes of acacag$: 1. acacag$ 2. cacag$ 3. acag$ 4. cag$ 5. ag$ 6. g$ 7. $ 2010-7-31 111 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Suffix Trie E.g. consider the string S = acacag$ g Suffix Trie: a ties of $ a c $ all possible suffices of S 7 6 c g a Suffix a $ g 1 acacag$ c 2 cacag$ c g 5 a a $ 3 acag$ 4 cag$ $ 4 g g 5 ag$ 3 6 g$ $ $ 7 $ 1 2 2010-7-31 Suffix Tree (I) Suffix tree for S=acacag$: merge nodes with only one child 1 2 3 4 5 6 7 S= a c a c a g $ $ a c g $ a 7 6 c g a $ “ca” is an Path-label of c edge label v a g node v is “aca” g Denoted as α(v) 5 $ ac g $ g This is a $ $ 2 4 leaf edge 1 3 2010-7-31 112 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Suffix Tree (II) Suffix tree has exactly n leaves and at most n edges The label of each edge can be represented using 2 indices Thus, suffix tree can be represented using O(n log n) bits 1 2 3 4 5 6 7 7,7 $ 1,1 a c 6,7 g$ S= a c a c a g $ a 7 2,3 2,3 6 c g a $ 4,7 6,7 c Note: The end index of every 4,7 a g 6,7 leaf edge should be 7, the last 5 g $ ac 6,7 $ index of S. Thus, for leaf edges, g g we only need to store the start $ $ 2 4 index. 1 3 2010-7-31 Generalized suffix tree Build a suffix tree for two or more strings E.g. S1 = acgat#, S2 = cgt$ a c # $ g g t 6 4 c t g a a t t$ t a # t $ # $ #t # # 1 4 2 1 3 2 5 3 2010-7-31 113 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Straightforward construction of suffix tree Consider S = s1s2…sn where sn=$ Algorithm: Initialize the tree we only a root For i = n to 1 Includes S[i..n] into the tree Time: O(n2) 2010-7-31 Example of construction S=acca$ Init For-loop c a c a c $ $ $ a a a $ $ $ c a c $ $ $ c $ a a a $ c $ a $ $ $ 5 5 4 5 4 3 5 4 3 2 5 4 1 3 2 I5 I4 I3 I2 I1 2010-7-31 114 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Construction of generalized suffix tree S’= c# Init For-loop a c a c a c $ c c #$ c c #$ c c a a $ c $ a a a $ c $ a a a $ c $ a$ # $ $ $ $ $ 5 4 1 3 2 2 5 4 1 3 2 2 5 4 1 3 2 1 I1 J2 J1 2010-7-31 Property of suffix tree Fact: For any internal node v in the suffix tree, if the path label of v is α(v)=ap, then there exists another node w in the suffix tree such that α(w)=p. Proof: Skip the proof. Definition of Suffix Link: For any internal node v, define its suffix link sl(v) = w. 2010-7-31 115 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Suffix Link example S=acacag$ $ a c g $ a 7 6 c g a $ c a g 5 g $ ac g $ g $ $ 2 4 1 3 2010-7-31 Can we construct a suffix tree in O(n) time? Yes. We can construct it in O(n) time and O(n) space Weiner’s algorithm [1973] Linear time for constant size alphabet, but much space McGreight’s algorithm [JACM 1976] Linear time for constant size alphabet, quadratic space Ukkonen’s algorithm [Algorithmica, 1995] Online algorithm, linear time for constant size alphabet, less space Farach’s algorithm [FOCS 1997] Linear time for general alphabet Hon,Sadakane, and Sung’s algorithm [FOCS 2003] O(n) bit space O(n logen) time for 0<e<1 O(n) bit space O(n) time for suffix array construction But they are all in-memory algorithm that does not guarantee locality of processing 2010-7-31 116 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Trellis Algorithm A novel disk-based suffix tree construction algorithm designed specifically for DNA sequences Scales gracefully for very large genome sequences (i.e. human genome) Unlike existing algorithms, Trellis exhibits no data skew problem Trellis recovers suffix links quickly Trellis has fast construction and query time Trellis is a 4-step algorithm 2010-7-31 Trellis: Algorithm Overview 1. Variable-length prefixes: e.g. AA, ACA, ACC, … R0 R1 Rr-1 S TR0 TR1 TRr-1 2. Prefixed Suffix Sub-trees TPi TR0,P0 TR1,Pm-1 3. Tree Merging Disk TR0,Pi TRr-1,Pi 4. Suffix Link Recovery (optional) 2010-7-31 117 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences 1. Variable-length Prefix Creation Goal: Separate the complete suffix tree by prefixes of suffixes, such that each subtree can reside entirely in the available memory Frequency of Length-2 Prefixes for Human Genome 300,000,000 AA TT Main Idea: 250,000,000 AT Expand prefixes Frequency 200,000,000 AG CA CT TG TA GA TC 150,000,000 AC CC GGGT 100,000,000 50,000,000 GC only as needed CG 0 0 5 10 15 20 Prefixes 2010-7-31 2. Suffix Tree Partitioning 1. Variable-length prefixes: e.g. AA, ACA, ACC, … R0 R1 Rr-1 S TR0 TR1 TRr-1 2. Prefixed Suffix Sub-trees TR0,P0 TR1,Pm-1 • Use Ukkonen’s method because of Its efficiency: O(n) time &space • Discard suffix links when store the subtrees on disk Disk • Store enough information so that a subtree can be rebuilt quickly, e.g. edge starting index, edge length, node parent, etc. 2010-7-31 118 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences 3. Suffix Tree Merging 1. Variable-length prefixes: e.g. AA, ACA, ACC, … R0 R1 Rr-1 S TR0 TR1 TRr-1 2. Prefixed Suffix Sub-trees TPi TR0,P0 TR1,Pm-1 3. Tree Merging Disk TR0,Pi TRr-1,Pi 2010-7-31 Merge Algorithm T1 T2 A G T C Case 1: No common prefix 2010-7-31 119 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Merge Algorithm T1 T2 T A C G Case 1: No common prefix 2010-7-31 Merge Algorithm T1 T2 T1 T2 T A CAAT CAGGC A C G Case 1: No common prefix Case 2: Has common prefix 2010-7-31 120 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Merge Algorithm T1 T2 T1 T2 CA T A A C AT GGC G Case 1: No common prefix Case 2: Has common prefix 2010-7-31 4. Suffix Link Recovery Some internal nodes have suffix links from the Ukkonen’s algorithm in Step #1 Some internal nodes are created in the merging step and do not have suffix links Discard all suffix link information from step #1 and stored suffix trees on disk (does not help speed this step up, so discard to simplify) Should suffix links are required, use the suffix link recovery algorithm to rebuild them 2010-7-31 121 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences 4. Suffix Link Recovery (cont.) For each prefixed suffix tree, recursively call this function from the tree’s root. x: an internal node L: be edge label between x and parent(x) RECOVER(x, L) if (x == root) sl(x) x; else { 1. p = parent(x); 2. q = sl(p); //get suffix link of p, and load the prefix tree for q from disk if not in memory 3. Skip/count using L to locate sl(x) under q; } for (each internal child y of x) RECOVER(y, edge-label(x,y)); 2010-7-31 Experimental Results Construction Time Construction and Link Recovery Time Trellis vs TOP-Q and DynaCluster Trellis vs TDD 1000 400 Time (mins) Time (mins) 100 300 200 10 100 1 0 0 20 40 60 80 100 120 200 400 600 800 1000 Sequence Length (Mbp) Sequence Length (Mbp) TOP-Q (mins) DynaCluster (mins) Trellis (mins) TDD Trellis Link Recovery Total Trellis • Memory: 512 MB • Memory: 512MB • TOP-Q and DynaCluster parameters were Human genome suffix tree set as recommended in their papers (size ~3Gbp, using 2GB of memory) Trellis TDD: 12.8hr • Without links: 4.2hr 2010-7-31 • With links: 5.9hr 122 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Experimental Results (cont.) Disk Space Usage Disk-based Suffix Tree Size Trellis vs TDD On average, Trellis uses about 27 bytes per character indexed while 30 TDD uses about 9.7 bytes. Size (GB) 20 10 For the human genome, TDD uses 0 200 400 600 800 1000 about 19.3 bytes/char because it Sequence Length (Mbp) requires 64-bit environment to index Trellis TDD larger sequences. Trellis remains at 27 bytes/char for the human genome. Human Genome Trellis TDD Disk-space vs query time tradeoff 72GB 54GB 2010-7-31 Experimental Results (cont.) Query time (without suffix links) TDD Trellis vs TDD • smaller suffix trees Query Times on the Human Genome Suffix Tree • edge length must be determined by examining all children nodes 8000 Trellis • each internal node only has a TDD 4000 pointer to its first child, i.e. children Query Length (bp) 1000 must be linearly scanned during a query search 600 Trellis 200 • larger suffix trees 80 • edge length stored locally with its 40 respective node 0.000 0.050 0.100 0.150 0.200 • all children locations stored locally, Query Time (secs) so each child can be accessed in a constant time, i.e. no linear scan needed Hence, faster query time! 2010-7-31 123 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Experimental Results (cont.) S[150] xαG C Query length = 100 xα α • Uses suffix links to move across the tree to search for the next query v sf(v) • Mimics the behavior of A G A G exact match anchor search CA during a genome alignment 2010-7-31 Experiment Results (cont.) Query time (with suffix links) Trellis: Without Suffix Links vs With Suffix Links Query Times on the Human Genome Suffix Tree 8000 With Suffix Links 4000 Without Suffix Links Query Length (bp) 1000 600 200 80 40 0.000 0.010 0.020 0.030 0.040 0.050 Query Time (secs) 2010-7-31 124 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Summary Trellis builds a disk-based suffix tree based on A partitioning method via variable-length prefixes A suffix subtree merging algorithm Trellis is both time and space efficient Trellis quickly recovers suffix links Faster than existing leading methods in both construction and query time 2010-7-31 Outline • Searching based on a disk based suffix tree • Approximate Matching Using Inverted List (Vgrams) • Approximate Matching Based on B+ Tree (BED Tree) 2010-7-31 125 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Example 1: a movie database Tom Find movies starred Samuel Jackson Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Star Wars: Episode III - Revenge 2005 Sci-Fi of the Sith Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson Goodfellas 1990 Drama … … … … 2010-7-31 How about Schwarrzenger? The user doesn’t know the exact spelling! Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Star Wars: Episode III - Revenge 2005 Sci-Fi of the Sith Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson Goodfellas 1990 Drama … … … … 2010-7-31 126 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Relax Condition Find movies with a star “similar to” Schwarrzenger. Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Star Wars: Episode III - Revenge 2005 Sci-Fi of the Sith Schwarzenegger The Terminator 1984 Sci-Fi Samuel Jackson Goodfellas 1990 Drama … … … … 2010-7-31 Edit Distance Given two strings A and B, edit A to B with the minimum number of edit operations: Replace a letter with another letter Insert a letter Delete a letter E.g. A = interestings _i__nterestings B = bioinformatics bioinformatic_s 101101101100110 Edit distance = 9 2010-7-31 127 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Edit Distance Computation Instead of minimizing the number of edge operations, we can associate a cost function to the operations and minimize the total cost. Such cost is called edit distance. For the previous example, the cost function is as follows: A= _i__nterestings B= bioinformatic_s 101101101100110 Edit distance = 9 _ A C G T _ 1 1 1 1 A 1 0 1 1 1 C 1 1 0 1 1 G 1 1 1 0 1 T 1 1 1 1 0 2010-7-31 Needleman-Wunsch algorithm (I) Consider two strings S[1..n] and T[1..m]. Define V(i, j) be the score of the optimal alignment between S[1..i] and T[1..j] Basis: V(0, 0) = 0 V(0, j) = V(0, j-1) + δ(_, T[j]) Insert j times V(i, 0) = V(i-1, 0) + δ(S[i], _) Delete i times 2010-7-31 128 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Needleman-Wunsch algorithm (II) Recurrence: For i>0, j>0 ⎧V (i − 1, j − 1) + δ ( S [i ], T [ j ]) Match/mismatch ⎪ V (i, j ) = max ⎨ V (i − 1, j ) + δ ( S [i ], _) Delete ⎪ V (i, j − 1) + δ (_, T [ j ]) ⎩ Insert In the alignment, the last pair must be either match/mismatch, delete, insert. xxx…xx xxx…xx xxx…x_ | | | xxx…yy yyy…y_ yyy…yy match/mismatch delete insert 2010-7-31 Example (I) _ A G C A T G C _ 0 -1 -2 -3 -4 -5 -6 -7 A -1 C -2 A -3 A -4 T -5 C -6 C -7 2010-7-31 129 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Example (II) _ A G C A T G C _ 0 -1 -2 -3 -4 -5 -6 -7 A -1 2 1 0 -1 -2 -3 -4 C -2 1 1 ? 3 2 A -3 A -4 T -5 C -6 C -7 2010-7-31 Example (III) _ A G C A T G C _ 0 -1 -2 -3 -4 -5 -6 -7 A -1 2 1 0 -1 -2 -3 -4 C -2 1 1 3 2 1 0 -1 A -3 0 0 2 5 4 3 2 A -4 -1 -1 1 4 4 3 2 T -5 -2 -2 0 3 6 5 4 C -6 -3 -3 0 2 5 5 7 C -7 -4 -4 -1 1 4 4 7 2010-7-31 130 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences “q-grams” of strings universal 2-grams 2010-7-31 q-gram inverted lists at 4 ch 0 2 id strings ck 1 3 0 rich 2-grams ic 0 1 2 4 1 stick ri 0 2 stich st 1 2 3 4 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 3 2010-7-31 131 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Searching using inverted lists Query: “shtick”, ED(shtick, ?)≤1 sh ht ti ic ck # of common grams >= 3 at 4 ch 0 2 id strings ck 1 3 0 rich 2-grams ic 0 1 2 4 1 stick ri 0 2 stich st 1 2 3 4 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 3 2010-7-31 2-grams -> 3-grams? Query: “shtick”, ED(shtick, ?)≤1 sht hti tic ick # of common grams >= 1 ati 4 ich 0 2 id strings ick 1 0 rich ric 0 1 stick 3-grams sta 4 2 stich sti 1 2 3 stuck stu 3 4 static tat 4 tic 1 2 4 tuc 3 2010-7-31 uck 3 132 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings at 4 ch 0 2 id strings ck 1 3 0 rich 2-grams ic 0 1 2 4 1 stick ri 0 2 stich st 1 2 3 4 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 3 2010-7-31 Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio 2010-7-31 133 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences VGRAM: Main idea Grams with variable lengths (between qmin and qmax) zebra ze(123) corrasion co(5213), cor(859), corr(171) Advantages Reduce index size ☺ Reducing running time ☺ Adoptable by many algorithms ☺ 2010-7-31 Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms? 2010-7-31 134 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Challenge 1: String Variable-length grams? Fixed-length 2-grams universal Variable-length grams [2,4]-gram dictionary universal ni ivr sal uni vers 2010-7-31 Representing gram dictionary as a trie ni ivr sal uni vers 2010-7-31 135 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Challenge 2: Constructing gram dictionary Step 1: Collecting frequencies of grams with length in [qmin, qmax] st 0, 1, 3 sti 0, 1 stu 3 stic 0, 1 stuc 3 Gram trie with frequencies 2010-7-31 Step 2: selecting grams Pruning trie using a frequency threshold T (e.g., 2) 2010-7-31 136 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Step 2: selecting grams (cont) Threshold T = 2 2010-7-31 Final gram dictionary [2,4]-grams 2010-7-31 137 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Challenge 3: Edit operation’s effect on grams Fixed length: q universal k operations could affect k * q grams 2010-7-31 Deletion affects variable-length grams Not affected Affected Not affected i-qmax+1 i i+qmax- 1 Deletion 2010-7-31 138 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Grams affected by a deletion Affected? i-qmax+1 i i+qmax- 1 Deletion [2,4]-grams Deletion ni ivr universal sal uni Affected? vers 2010-7-31 Grams affected by a deletion (cont) Affected? i-qmax+1 i i+qmax- 1 Deletion Trie of grams 2010-7-31 Trie of reversed grams 139 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences # of grams affected by each operation Deletion/substitution Insertion 0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0 _u_n_i_v_e_r_s_a_l_ 2010-7-31 Max # of grams affected by k operations Vector of s = <2,4,6,8,9> With 2 edit operations, at most 4 grams can be affected Called NAG vector (# of affected grams) Precomputed and stored 2010-7-31 140 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Summary of VGRAM index 2010-7-31 Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min # of their common grams 2010-7-31 141 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Lower bound on # of common grams Fixed length (q) universal If ed(s1,s2) <= k, then their # of common grams >=: (|s1|- q + 1) – k * q Variable lengths: # of grams of s1 – NAG(s1,k) 2010-7-31 Example: algorithm using inverted lists Query: “shtick”, ED(shtick, ?)≤1 sh ht tick 2-grams 2-4 grams … Lower bound = 3 … ck 1 3 ck 1 3 ic 0 1 2 4 ic 1 4 … ich 0 2 ti 1 2 4 … … id strings tic 2 4 0 rich tick 1 1 stick … 2 stich 3 stuck Lower bound = 1 2010-7-31 4 static 142 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences PartEnum + VGRAM PartEnum, fixed q-grams: ed(s1,s2) <= k hamming(grams(s1),grams(s2)) <= k * q VGRAM: ed(s1,s2) <= k hamming(VG (s1),VG(s2)) <= NAG(s1,k) + NAG(s2,k) 2010-7-31 PartEnum + VGRAM (naïve) R S Bm(S) = max(NAG(s,k)) Bm(R) = max(NAG(r,k)) • Both are using the same gram dictionary. • Use Bm(R) + Bm(S) as the new hamming bound. 2010-7-31 143 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences PartEnum + VGRAM (optimization) R S R1 with Bm(R1) R2 with Bm(R2) R3 with Bm(R3) Bm(S) = max(NAG(s,k)) • Group R based on the NAG(r,k) values • Join(R1,S) using Bm(R1) + Bm(S) • Similarly, Join(R2,S), Join(R3,S) • Local bounds tighter better signatures generated • Grouping S also possible. 2010-7-31 Outline • Searching based on a disk based suffix tree • Approximate Matching Using Inverted List (Vgrams) • Approximate Matching Based on B+ Tree (BED Tree) 2010-7-31 144 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Approximate String Search Information Retrieval Web search query with string “Posgre SQL” instead of “Postgre SQL” Data Cleaning “13 Computing Road” is the same as “#13 Comput’ng Rd”? Bioinformatics Find out all protein sequences similar to “ACBCEEACCDECAAB” 2010-7-31 71 Edit Distance Edit distance on strings 13 Computing Drive 3 deletions 13 Computing Dr Edit distance: 5 1 replacement 13 Comput’ng Dr 1 insertion #13 Comput’ng Dr Normalized edit distance ED(s1,s2) 5 MaxLength(s1,s2) 18 2010-7-31 72 145 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Existing Solution Q-Gram Q=3 Postgre ##P #Po Pos ost stg tgr gre re# e## Posgre ##P #Po Pos osg sgr gre re# e## Observation: If ED(s1,s2)=d, they agree on at least min(|s1|,|s2|)+Q-1-d*(Q+1) grams 2010-7-31 73 Existing Solution Inverted List Postgre ##P #Po Pos osg sgr gre re$ e$$ Posgre 2010-7-31 74 146 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Limitations Inverted List Method Limited queries supported Range Query Join Query Top-K Query Top-K Join Edit Distance Y Y N N Uncontrollable memory consumption Normalized ED N N N N Concurrency protocol 2010-7-31 75 Our Contributions Bed-Tree Wide support on different queries and distances Range Query Join Query Top-K Query Top-K Join Edit Distance Y Y Y Y Y Normalized EDbuffer size Adjustable Y and low I/O cost Y Y Highly concurrent Easy to implement Competitive performance 2010-7-31 76 147 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Basic Index Framework Bed-Tree Framework Index Construction follows standard B+ Estimate the minimal tree Query: Posgre distance to query and prune B+ tree nodes Map all strings to a 1D domain Result: Postgre Refine the result by exact edit distance 2010-7-31 77 String Order Properties P1: Comparability Given two string s1 and s2, we know the order of s1 and s2 under the specified string order P2: Lower Bounding Given an interval [L,U] on the string order, we know a lower bound on edit distance to the query string Query: Posgre Candidates in the sub-tree? 2010-7-31 78 148 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences String Order Properties P3: Pairwise Lower Bounding Given two intervals [L,U] and [L’,U’], we know the lower bound of edit distance between s1 from [L,U] and s2 from [L’,U’] P4: Length Bounding Given an interval [L,U] on the string order, we know the minimal length of the strings in the interval Potential join results? 2010-7-31 79 String Order Properties Properties v.s. supported queries and distances Range Query Join Query Top-K Query Top-K Join Edit Distance P1, P2 P1, P3 P1, P2 P1, P3 Normalized ED P1, P2, P4 P1, P3, P4 P1, P2, P4 P1, P3, P4 Description P1 Comparability P2 Lower Bounding P3 Pair-wise Lower Bounding P4 Length Bounding 2010-7-31 80 149 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Dictionary Order All strings are ordered alphabetically, satisfying P1, P2 and P3 Search: Posgre with ED=1 Insertion: Postgre It’s between “pose” pose powder sit and “powder” 2010-7-31 81 Dictionary Order All strings are ordered alphabetically, satisfying P1, P2 and P3 Search: Posgre with ED=1 Not pruning pose powder sit anything! Pruning happens power put sad only when long prefix exists 2010-7-31 82 150 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Gram Counting Order Jim Gray Hash all grams to 4 buckets 2010-7-31 Count the grams in binary 1 1 1 0 0 1 1 1 Gram Counting Order Transform the count vector to a bit string with z-order Encode with z- order Order the strings with this signature 2010-7-31 84 151 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Gram Counting Order Lower Bounding Query: Jim Gary “11011011” to “11011101” Prefix: “11011???” signature: (4,1,2,2) Minimal edit distance: 1 2010-7-31 85 Gram Location Order Extension of Gram Counting Order Include positional information of the grams Jim Gray Grace Hopper Allow better estimation of mismatch grams Harder to encode 2010-7-31 86 152 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Experiment Settings Data Five Index Schemes Bed-Tree: BD, BGC, BGL Inverted List: Flamingo, Mismatch Default Setting Q=2, Bucket=4, Page Size=4KB 2010-7-31 87 Empirical Observations How good is Bed-Tree? With small threshold, Inverted Lists are better When threshold increases, Bed-Tree is not worse 153 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences Empirical Observations Which string order is better? Gram counting order is generally better Gram Location order: tradeoff between gram content information and position information Conclusion A new B+ tree index scheme All similarity queries supported Both edit distance and normalized distance General transaction and concurrency protocol competitive efficiencies 2010-7-31 90 154 Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences References Benjarath Phoophakdee, Mohammed J. Zaki: "Genome-scale disk-based suffix tree indexing". SIGMOD Conference 2007: 833-844 Chen Li, Bin Wang, and Xiaochun Yang . "VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams". In VLDB 2007. • Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, and Divesh Srivastava, "B^{ed}-Tree: An All-Purpose Tree Index for String Similarity Search on Edit Distance". SIGMOD 2010. 155 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Searching and Mining Complex Structures Similarity Search on Trees Anthony K. H. Tung(鄧锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung Research Group Link: http://nusdm.comp.nus.edu.sg/index.html Social Network Link: http://www.renren.com/profile.do?id=313870900 Outline Importance of Trees Distance between Trees Fast Edit Distance Approximation for Trees 2 156 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Importance of Trees Between sequences and graphs Equivalent to acyclic graph Represents hierarchal structures Examples XML documents Programs RNA structure 3 Types of Trees Is there a root? Are the nodes labeled? Are the children of a node ordered? 4 157 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Outline Importance of Trees Distance between Trees Fast Edit Distance Approximation for Trees 5 Distance Measure Many ways to define distance Convert to standard types and adopt the distance metric there How many operations to transform one tree to another? (Edit distance) Inverse of similarity dist(S, T) = maxSim – sim(S,T) Relationship between different definitions? 6 158 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Operations on Trees Relabel Delete Insert 7 Remarks on Edit Distance Ordered trees are tractable Approach based on dynamic programming NP-hard for unordered trees Approach is to impose restrictions so that DP can be used 8 159 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Edit Script Edit script(S, T): sequence of operations to transform S to T Example 1. S= 3. Insert c Relabel f → a 2. Delete c Relabel e → d 9 Edit Distance Mapping Edit distance mapping(S, T): alternative representation of edit operations relabel: v → w delete: v → $ insert: $ → w Mapping corresponding to the script 10 160 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Edit Distance for Ordered Trees Generalize the problem to forests. C(φ, φ) = 0 C(S, φ) = C(S – v, φ) + cost(v → $) C(φ, T) = C(φ, T – w) + cost($ → w) C(S, T) = minimum of 1. C(S – v, T) + cost(v → $) [deleting v] 2. C(S, T – w) + cost($ → w) [inserting w] 3. C(S – tree(v), T – tree(w)) + C(S(v) - v, T(w)) + cost(v → w)[relabel v → w] 11 Illustration of Case 3 C(S – tree(v), T – tree(w)) + C(S(v), T(w)) + cost(v → w) [relabel v → w] S - tree(v) ... v T - tree(w) ... w S(v) T(w) 12 161 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Algorithm Complexity Number of subproblems bounded by O(|S|2|T|2) Zhang and Shasha, 1989 showed that the number of relevant subproblems is O(|S||T|min(SD, SL) min(TD, TL)) and space is O(|S||T|) Further improvements, required decomposition of a rooted tree into disjoint paths 13 Decomposition into Paths Concept of heavy and light nodes/edges (Harel and Tarjan, 1984) Root is light, child with max size is heavy Removal of light edges partitions T into disjoint heavy paths Important property: light depth(v) ≤ log|T| + O(1) Complexity can be reduced to O(|S|2|T|log|T|) 14 162 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Unordered Edit Distance NP-hard Special cases (in P) T is a sequence Number of leaves in T is logarithmic Impose additional constraints Disjoint subtrees map to disjoint subtrees 15 Tree Inclusion Is there a sequence of deletion operations on S which can transform it to T? Special case of edit distance which only allows deletions 16 163 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Complexity of Tree Inclusion Ordered trees Concept of embeddings (restriction of mappings) O(|S||T|) using the algorithm of Kilpelainen and Mannila Unordered trees NP-complete (what did you expect ?) Special cases 17 Related Problems on Trees Tree Alignment (covered in the survey paper) Robinson-Fould's Distance for leaf labeled trees, where edge = bipartition of leaves Tree Pattern Matching Maximum Agreement Subtree Largest Common Subtree Smallest Common Supertree Many are generalizations of problems on strings 18 164 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Summary of Tree Distance Edit distance Concept of edit mapping Dynamic programming for ordered trees Constrained edit distance for unordered trees Tree inclusion Special case of edit distance Specialized algorithms are more efficient Useful for determining embedded trees 19 Outline Importance of Trees Distance between Trees Fast Edit Distance Approximation for Trees 20 165 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Similarity Measurement Edit Distance EDist(T1, T2) Edit Operation e; cost γ(e), a->b b->λ λ->b a a si(ei1,ei2,…,eik) : T1->T2; cost(si)= ∑j γ(eij) EDist(T1,T2)=mini(cost(si)) unit cost: EDist(T1,T2)=min(k) Computational Complexity: O (| T1 | × | T2 | × min(depth(T1 ), leaves(T1 )) × min(depth(T2 ), leaves(T2 ))) 7/31/2010 21 Edit Operation Mapping Edit operations mapping One-to-one Preserve sibling order Preserve ancestor order a a d b e b c d c d c d b M(T1,T2) c d e T1 T2 7/31/2010 22 166 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Observation Edit operations do not change many sibling relationship a a c->λ c d b e b f g h i d e f g h i Sibling relation: (b,c)->(b,f) (c,d)->(i,d) Node: Varying number of children v.s. at most 2 siblings 7/31/2010 23 Binary Tree Representation a Binary Tree Representation b e Left-child, right sibling b Normalized Binary Tree c d c d a(1,8) a b b b b c d d d e b … c … c … c … e … ε … ε …ε … ε … ε b(2,3) ε b c e ε d b e ε ε ε b(5,6) T1 c(3,1) c(6,4) e(8,7) 1 …1 …0 … 1 … 0 … 2 …0 …0 … 2 … 1 ε d(4,2) T2 ε ε ε d(7,5) ε ε 1 …0 …1 … 0 … 1 … 2 …0 …1 … 0 … 1 1 ε ε |Γ | BBDist (T1 , T2 ) = ∑ | b1i − b2i | = 8 Triangular Inequality i =1 7/31/2010 24 167 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees One Edit Operation Effect v’ v’ ... ... ... ... ... ... w1 w2 wl w l+m w l+m+1 w1 w2 wl v w l+m+1 ... ... ... ... ... Each node appears in w l+1 w l+m at most two binary v’ v’ branches ... ... w1 w1 w2 ... w2 ... ... ... wl wl v ... w l+1 ... ... w l+m+1 w l+m w l+1 ... ... w l+m+1 w l+2 ... ... w l+m ... ε 7/31/2010 25 Theorem 1 insertion/deletion incurs at most 5 difference on BBDist 1 rellabeling incurs at most 4 difference on BBDist T, T’, EDIST(T, T’) = k = ki + kd + kr , BDist(T,T’) <= 4kr+5ki+5kd <= 5k; 1/5 BDist is a lower bound of edit distance; 7/31/2010 26 168 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Positional Binary Branch a(1,8) a(1,8) a a b(2,3) b(2,5) ε ε b(5,6) c(3,1) c(7,6) b c(3,1) d b e c d c(6,4) e(8,7) d(4,2) ε d(8,7) ε d(4,2) ε ε ε c d b ε b(5,4) d ε ε ε d(7,5) ε ε c c d e(6,3) ε ε ε ε ε T1 T2 e B(T1) B(T2) a Incurs 0 difference for BBDist(T1,T2) c d e Positional binary branch: PosBiB(T(u)) T’2 PosBiB(T1(e))=(BiB(e,ε,ε),8,7) ≠ PosBib(T2(e))=(BiB(e,ε,ε),6,3) Positional Binary Branch Distance 7/31/2010 27 Computational Complexity D: dataset; |D|: dataset size; Vector construction part: | D| time, space : O(∑ | Ti |) Traverse the data trees for once i =1 Optimistic bound computation: time: each binary search O(|Ti|+|Tq|), | D| O (∑ (| Ti | + | Tq |) log(min(| Ti |,| Tq |))) i =1 totally: | D| space: O (∑ (| Ti | + | Tq |)) i =1 7/31/2010 28 169 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Generalized Study Extend the sliding window to q level The images vector gives multiple level binary branch profiles. (T,T’ [4*(q- 1)+1]*EDist(T,T’ BDist_q(T,T’) <= [4*(q-1)+1]*EDist(T,T’) v’ v’ ... ... w1 w1 w2 ... w2 ... ... ... wl wl v ... w l+1 ... ... w l+m+1 w l+m w l+1 ... ... w l+m+1 w l+2 ... ... w l+m ... 7/31/2010 29 Query Processing Strategy Filter-and-refine frameworks Lower bound distances filter out most objects The lower bound computation is much succinct Lower bound distance is a close approximation of the real dist Remaining objects be validated by real distance 7/31/2010 30 170 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Experimental Settings Compare with histogram methods[KKSS04] Lower bound: feature vector distance (Leaf Distance Height histogram vector, Degree histogram vector, Label histogram vector) Synthetic dataset: Tree size, Fanout, Label, Decay factor Real dataset: dblp XML document Performance measure: Percentage of data accessed: | false positive | + | true positive | × 100% | dataset | CPU time consumed Space requirement 7/31/2010 31 Sensitivity to the Data Properties Sensitivity test Range: N{}N{50,2.0}L8D0.05 35 0.5 Range: N{4,0,5}N{}L8D0.05 % of Accessed Data CPU Cost (Second) 30 0.4 25 80 3 % of Accessed Data CPU Cost (Second) 20 0.3 70 2.5 60 15 0.2 2 50 10 0.1 40 1.5 5 30 1 0 0 20 10 0.5 2 4 6 8 Fanout 0 0 BiBranch % Histo % Result % 25 50 75 125 BiBranch Sequ Tree Size BiBranch % His to % Result % BiBranch Sequ KNN: N{}N{50,2.0}L8D0.05 KNN: N{4,0.5}N{}L8D0.05 8 0.5 7 % of Accessed Data CPU Cost (Second) 0.4 100 3 6 % of Accessed Data CPU Cost (Second) 80 2.5 5 0.3 4 2 0.2 60 3 1.5 2 0.1 40 1 1 0 0 20 0.5 2 4 6 8 0 0 Fanout 25 50 75 125 BiBranch % Histo % BiBranch Sequ Tree Size BiBranch % Histo % BiBranch Sequ mean(fanout): 2 8; mean(|T|): 25 125; mean(|T|): 50; mean(fanout): 4; size(label): 8 size(label): 8 7/31/2010 32 171 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Sensitivity test (cont.) Range: N{4,0.5}N{50,2.0}L{}D0.05 KNN: N{4,0.5}N{50,2.0}L{}D0.05 35 0.5 7 0.45 % of Accessed Data CPU Cost (Second) 30 0.4 6 % of Accessed Data CPU Cost (Second) 0.4 25 0.35 5 0.3 0.3 20 4 0.25 15 0.2 3 0.2 10 0.15 0.1 2 5 0.1 1 0.05 0 0 0 0 8 16 32 64 8 16 32 64 Label Number Label Number BiBranch % Histo % Result % BiBranch Sequ BiBranch % Histo % BiBranch Sequ size(label): 8 64; mean(|T|): 50; mean(fanout): 4 7/31/2010 33 Queries with Different Parameters Dblp data (avg. distance: 5.031) Range queries KNN (k:5-20) Range: DBLP KNN: DBLP 100 0.35 0.3 6 0.35 % of Accessed Data CPU Cost (second) 80 0.25 5 0.3 CPU Cost (second) 60 % of Accessed Data 0.2 0.25 4 40 0.15 0.2 0.1 3 0.15 20 0.05 2 0.1 0 0 1 0.05 1 2 3 4 5 7 10 Range 0 0 BiBranch % Histo % Result % 5 7 10 12 15 17 20 BiBranch Sequ k BiBranch % Histo % BiBranch Sequ 7/31/2010 34 172 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees Pruning Power of Different Level Data distribution according to distances Edit distance Histogram distance Binary branch distance: 2, 3, 4 level DBLP 2000 1500 Data Distribution 1000 500 0 1 2 3 4 5 6 7 8 9 10 11 12 Distance Edit Histo BiBranch(2) BiBranch(3) BiBranch(4) 7/31/2010 35 Citations on the Paper Surprisingly, attract citations and questions from software engineering! Expect more impact along software mining direction soon. DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones - all 2 versions » L Jiang, G Misherghi, Z Su, S Glondu - Proceedings of the 29th International Conference on Software …, 2007 - portal.acm.org Detecting code clones has many software engineering applications. Existing approaches either do not scale to large code bases or are not robust against minor code modifications. In this paper, we present an efficient ... Fast Approximate Matching of Programs for Protecting Libre/Open Source Software by Using Spatial … - all 2 versions » AJM Molina, T Shinohara - Source Code Analysis and Manipulation, 2007. SCAM 2007. …, 2007 - doi.ieeecomputersociety.org To encourage open source/libre software development, it is desirable to have tools that can help to identify open source license violations. This paper describes the imple-mentation of a tool that matches open source programs ... 7/31/2010 36 173 Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees References • Philip Bille . A survey on tree edit distance and related problems. Theoretical Computer Science. Volume 337 , Issue 1-3 (June 2005) • Rui Yang, Panos Kalnis, Anthony K. H. Tung: Similarity Evaluation on Tree-structured Data. SIGMOD 2005. • Optional References: • JP Vert. "A tree kernel to analyze phylogenetic profiles" - Bioinformatics, 2002 - Oxford Univ Press 7/31/2010 174 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Searching and Mining Complex Structures Graph Similarity Search Anthony K. H. Tung(鄧锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung Research Group Link: http://nusdm.comp.nus.edu.sg/index.html Social Network Link: http://www.renren.com/profile.do?id=313870900 Outline • Introduction • Foundation • State of the Art on Graph Matching •Exact Graph Matching •Error-Tolerant Graph Matching • Search Graph Databases •Graph Indexing Methods • Our Works •Star Decomposition •Sorted Index For Graph Similarity Search 175 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Smart Graphs Chemical compound Protein structure Program flow Coil Image Fingerprint Letter Shape Motivation • Why graph? •Graph is ubiquitous •Graph is a general model •Graph has diversity •Graph problem is complex and challenging • Why graph search? •Manifold application areas • 2D and 3D image analysis • Video analysis • Document processing • Biological and biomedical applications 176 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Graph Search • Definition •Given a graph database D and a query graph Q, find all graphs in D supporting the users’ requirements: • The same as Q • Containing Q or contained by Q • Similarity to Q • Similarity to the subgraph of Q • Challenge •How to efficiently compare two graphs? •How to reduce the number of pairwise graph comparisons? How to efficiently compare two graphs? • The graph matching problem •Graph matching is the process of finding a correspondence between the vertices and the edges of two graphs that satisfies some (more or less stringent) constraints ensuring that similar substructures in one graph are mapped to similar substructures in the other. 177 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search How to reduce the number of pairwise graph comparisons? • Scalability issue •A full database scan •Complex graph matching between a pair of graphs • Index mechanisms are needed Outline • Introduction • Foundation • State of the Art on Graph Matching •Exact Graph Matching •Error-Tolerant Graph Matching • Search Graph Databases •Graph Indexing Methods • Our Works •Star Decomposition •Sorted Index For Graph Similarity Search 178 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Categories of Matching Exact Graph Matching • Graph Isomorphism •Two graphs G1=(V1,E1) and G2=(V2,E2) are isomorphic if there is a bijective function f: V1 → V2 such that for all u, v ∈ V1: {u,v} ∈ E1 ↔ {f(u),f(v)} ∈ E2 G1 G2 179 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Exact Graph Matching • Induced Subgraph •A subset of the vertices of a graph together with all edges whose endpoints are both in this subset • Subgraph Isomorphism •An isomorphism holds between one of the two graphs and an induced subgraph of the other Graph Similarity Measure • Graph Edit Distance •The minimum amount of distortion that is needed to transform one graph into another •The edit operations ei can be deletions, insertions, and substitutions of vertices and edges G1 G2 180 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Graph Similarity Measure • Graph Edit Distance •The minimum amount of distortion that is needed to transform one graph into another •The edit operations ei can be deletions, insertions, and substitutions of vertices and edges G1 G2 Graph Similarity Measure • Graph Edit Distance •The minimum amount of distortion that is needed to transform one graph into another •The edit operations ei can be deletions, insertions, and substitutions of vertices and edges G1 G2 181 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Graph Similarity Measure • Graph Edit Distance •The minimum amount of distortion that is needed to transform one graph into another •The edit operations ei can be deletions, insertions, and substitutions of vertices and edges G1 G2 Graph Similarity Measure • Graph Edit Distance •The minimum amount of distortion that is needed to transform one graph into another •The edit operations ei can be deletions, insertions, and substitutions of vertices and edges G1 G2 182 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Graph Similarity Measure • Graph Edit Distance •The minimum amount of distortion that is needed to transform one graph into another •The edit operations ei can be deletions, insertions, and substitutions of vertices and edges G1 G2 Graph Similarity Measure • Graph Edit Distance (GED) •Given two attributed graphs G1 = (V1,E1, Σ, l1) and G2 = (V2,E2, Σ, l2) , the GED between them is defined as •where T(G1,G2) denotes the set of edit paths transforming G1 into G2, and c denotes the edit cost function measuring the c(ei) of edit operation ei • GED provides a general dissimilarity measure for graphs • Most works on inexact graph matching focusing on the GED computation problem 183 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Outline • Introduction • Foundation • State of the Art on Graph Matching •Exact Graph Matching •Error-Tolerant Graph Matching • Search Graph Databases •Graph Indexing Methods • Our Works •Star Decomposition •Sorted Index For Graph Similarity Search Exact Matching Algorithms • Tree search based algorithms •Ullmann’s algorithm •VF and VF2 algorithm • Other algorithms •Nauty algorithm 184 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Tree Search based Algorithms • Basic Idea •A partial match (initially empty) is iteratively expanded by adding new pairs of matched vertices •The pair is selected using some necessary conditions, usually also some heuristic condition to prune unfruitful search paths •The algorithm ends when it finds a complete matching, or no further vertex pairs may be added (backtracking) •For attributed graphs, the attributes of vertices and edges can be used to constrain the desired matching The Backtracking Algorithm 1 • Depth-First Search (DFS): 2 5 6 •progresses by expanding the first child node of the search tree 3 4 7 8 •going deeper and deeper until a goal node is found, or until it hits a node that has no children. • Branch and Bound (B&B): •BFS(breadth-first search)-like search for optimal solution •Branch is that a set of solution candidates is splitted into two or more smaller sets •bound is that a procedure upper and lower bounds 1 2 3 4 5 6 7 8 185 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Tree Search based Algorithms • Ullmann’s Algorithm (DFS) •A refinement procedure based on matrix of possible future matched vertex pairs to prune unfruitful matches •The simple enumeration algorithm for the isomorphisms between a graph G1 and a subgraph of another graph G2 with the adjacency matrices A1and A2 • An M’ matrix with |V1| rows and |V2 | columns can be used to permute the rows and columns of A2 to produce a further matrix P. If , then M’ specifies an isomorphism between G1 and the subgraph of G2. (a1 i , j = 1) ⇒ ( pi , j = 1) P = M ' ( M ' A2 )T Tree Search based Algorithms • Ullmann’s Algorithm •Example for permutation matrix •The elements of M’ are 1’s and 0’s, such that each row contains 1 and each column contains 0 or 1 P = M ' (M ' A2 )T T ⎡ ⎡0 1 0 0⎤⎤ ⎡1 0 0 0⎤ ⎢⎡1 0 0 0⎤ ⎢ ⎥ G2 1 0 1 1⎥⎥ = ⎢0 0 1 0⎥ ⋅ ⎢⎢0 0 1 0⎥.⎢ ⎢ ⎥ ⎢⎢ ⎥ ⎢0 ⎥ 1 0 0⎥⎥ ⎣0 1 0 0⎥ ⎢⎢0 1 0 0⎥ ⎢0 ⎢ ⎦ ⎣ ⎦ 1 0 ⎥⎥ 0⎦⎥ ⎢ ⎣ ⎣ ⎦ ⎡0 0 1⎤ ⎡1 0 0 0⎤ ⎢ ⎡0 0 1⎤ 1 1 0⎥ ⎢ ⎥ = 0 0 1⎥ = ⎢0 0 1 0⎥.⎢ ⎢ ⎥ ⎢0 0 1⎥ ⎢ ⎥ ⎣0 1 0 0⎥ ⎢ ⎢ ⎦ ⎥ ⎢1 1 0⎥ ⎣ ⎦ ⎣0 0 1⎦ 186 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Tree Search based Algorithms • Ullmann’s Algorithm •Construction of another matrix M0 with the same size of M’ ⎧1 if deg(V2i ) ≥ deg(V1i ) mi0, j = ⎨ , mi , j ∈ {0,1} ⎩0 otherweise •Generation of all M’ by setting all but one of each row of M0 •A subgraph isomorphism has been found if (a1 i , j = 1) ⇒ ( pi , j = 1) ⎡0 1 0 0⎤ ⎢1 0 1 1⎥ G2 A2 = ⎢ ⎥ ⎡1 1 1 1⎤ ⎢0 1 0 0⎥ ⎢ ⎥ M 0 = ⎢1 1 1 1⎥ ⎢ ⎥ ⎣0 1 0 0⎦ ⎢0 1 0 0⎥ ⎣ ⎦ ⎡0 0 1⎤ G1 A1 = ⎢0 0 1⎥ ⎢ ⎥ ⎢1 1 0⎥ ⎣ ⎦ Tree Search based Algorithms • Ullmann’s Algorithm •An example ⎡1 1 1 1 ⎤ ⎢1 1 1 1 ⎥ ⎢ ⎥ ⎢0 1 0 0⎥ ⎣ ⎦ ⎡1 0 0 0 ⎤ ⎡0 1 0 0⎤ ⎡0 0 1 0⎤ ⎡0 0 0 1 ⎤ ⎢1 1 1 1 ⎥ ⎢1 1 1 1 ⎥ ⎢1 1 1 1 ⎥ ⎢1 1 1 1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0 1 0 0⎥ ⎣ ⎦ ⎢0 1 0 0⎥ ⎣ ⎦ ⎢0 1 0 0⎥ ⎣ ⎦ ⎢ ⎣0 1 0 0 ⎥ ⎦ ⎡1 0 0 0 ⎤ ⎡1 0 0 0 ⎤ ⎡0 0 1 0⎤ ⎡0 0 1 0⎤ ⎡0 0 0 1 ⎤ ⎡0 0 0 1 ⎤ ⎢0 0 1 0⎥ ⎢0 0 0 1 ⎥ ⎢1 0 0 0 ⎥ ⎢0 0 0 1 ⎥ ⎢1 0 0 0 ⎥ ⎢0 0 1 0⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0 1 0 0⎥ ⎣ ⎦ ⎢0 1 0 0 ⎥ ⎣ ⎦ ⎢0 1 0 0⎥ ⎣ ⎦ ⎢0 1 0 0⎥ ⎣ ⎦ ⎢0 1 0 0⎥ ⎣ ⎦ ⎢0 1 0 0⎥ ⎣ ⎦ 1 4 1 2 2 4 1 2 2 1 1 1 3 3 3 3 3 3 2 3 1 1 3 2 1 4 P = M ' ( M ' A2 )T 2 ⎡0 0 1 ⎤ ⎡0 0 1 ⎤ 1 = ⎢0 0 1 ⎥ compared with A1 = ⎢0 0 1 ⎥ ⎢ ⎥ 3 3 ⎢ ⎥ ⎢1 1 0 ⎥ ⎢ ⎣1 1 0 ⎥ ⎦ ⎣ ⎦ 2 187 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Tree Search based Algorithms • Ullmann’s Algorithm •A most widely used algorithm • VF or VF2 •VF defines a heuristic based on the analysis of vertices adjacent to the ones already considered in the partial mapping •VF2 reduces the memory requirement from O(n2) to O(n) • Other methods: Nauty Algorithm •Constructs the automorphism group of each of the input graphs and derives a canonical labeling. The isomorphism can be checked by verifying the equality of the adjacency matrices Exact Graph Matching • Summary •The matching problems are all NP-complete except for graph isomorphism, which has not yet been shown in NP or not. •Exact isomorphism is very seldom used. Subgraph isomorphism can be effectively used in many contexts. •Exact graph matching has exponential time complexity in the worst case. •Ullmann’ algorithm, VF2 algorithm and Nauty algorithm are mostly used algorithms. Most modified algorithms adopt some conditions to prune the unfruitful partial matching. 188 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Error-Tolerant Graph Matching • GED Computation •Optimal algorithms • Exact GED computation requires isomorphism testing • Tree search based algorithms (A* based algorithms) •Suboptimal algorithms • Heuristic algorithms • Formulated as a BLP problem A* Algorithm • A tree search based algorithm •Similar to isomorphism testing •Differently, the vertices of the source graph can potentially be mapped to any node of the target graph •Search tree is constructed dynamically •by creating successor vertices linked by edges to the currently vertex •A heuristic function is usually used to •determine the vertex for expansion 189 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Exact GED Computation • Summary •The complexity is exponential in the number of vertices of the involved graphs. •For graphs with unique vertex labels the complexity is linear. •Exact graph edit distance is feasible for small graphs only. •Several suboptimal methods have been proposed to speed up the computation and make GED applicable to large graphs. Bipartite Matching for GED • A Heuristic Algorithm •A new suboptimal procedure for the GED computation based on Hungarian algorithm (i.e., Munkres’ Algorithm). •Hungarian algorithm is used as a tree search heuristic. •Much faster than the exact computation and the other suboptimal methods •Application for larger graphs 190 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Bipartite Matching for GED • Assignment Problem •Find an optimal assignment of n elements in a set S1 = {u1, …, un} to n elements in a set S2 = {v1, …, vn} •Let cij be the costs of the assignment (ui → vj) •The optimal assignment is a permutation P = (p1, …, pn) of the integers 1, …, n that minimizes S1 S2 c11 c12 c13 Bipartite Matching for GED • Assignment Problem •Given the n × n matrix Mcij of the assignment costs •This problem can be formulated as finding a set of n independent elements of Mcij with minimum summation S1 S2 1 5 4 5 7 6 58 8 •Hungarian algorithm finds the minimum cost assignment in O(n3) time. 191 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Bipartite Matching for GED • Main Idea •Construct a vertex cost matrix Mcv and an edge cost matrix Mce •For each open vertex v in the search tree, run Hungarian algorithm on Mcv and Mce •The accumulated minimum cost of both assignments serves as a lower bound for the future costs to reach a leaf node •h(P) = Hungarian(Mcv) + Hungarian(Mce) is the tree search hearistic •Returns a suboptimal solution as an upper bound of GED Suboptimal Algorithms • Binary Linear Programming (BLP) •Use the adjacency matrix representation to formulate a BLP •Compute GED between G0 and G1 •Edit grid 192 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Binary Linear Programming • Isomorphisms of G0 on the edit grid • State vectors Binary Linear Programming • Definition: • Objective Function: 193 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Binary Linear Programming • Lower Bound: linear program (O(n7)) • Upper Bound: assignment problem (O(n3)) Summary • The complexity of the exact GED computation is exponential and unaccepted. • Suboptimal methods solve the graph matching problem by fast returning the suboptimal solution and can be applied to larger graphs. • An important application of the graph matching problem is searching a graph database. 194 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Outline • Introduction • Foundation • State of the Art on Graph Matching •Exact Graph Matching •Error-Tolerant Graph Matching • Search Graph Databases •Graph Indexing Methods • Our Works •Star Decomposition •Sorted Index For Graph Similarity Search Graph Search Problem • Query a graph database •Given a graph database D and a query graph Q, find all graphs in D supporting the users’ requirements. • Full graph search (all match ) • Subgraph search (partial match or containment search) • Similarity full graph search (based on GED) • Similarity subgraph search (based on GED) 195 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Scalability Issue • On-line searching algorithm 100,000 checking answer 100,000 • A full sequential scan Subgraph isomorphism testing •I/O costs •Subgraph isomorphism testing (GED computation) • An indexing mechanism is needed Indexing Graphs • Indexing is crucial 100,000 checking answer 100,00 0 filtering 100 checking Index answe 100,00 10 r 0 0 196 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Indexing Strategy • Filter-and-refine framework based on features Step 1. Index Construction Enumerate smaller units (features) in the database, build an index between units and graphs Step 2. Query Processing Enumerate smaller units in the query graph Use the index to first filter out non-candidates Prune the answers by exact checking Indexing Strategy • Feature-based Indexing methods •Break the database graphs into smaller units like paths, trees, and subgraphs, and use them as filtering features •Build inverted index between the smaller units and the database graphs •Filter graphs based on the number of smaller units or their locality information 197 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Featured-based Indexing Systems Small units Smaller units Query GraphGrep path Contain (Containment search) SING path Contain gIndex graph Contain + Edge relaxation FGIndex graph Contain TREE∆ tree+graph Contain Treepi tree Contain κ-AT tree Full similarity search CTree - Contain + Edge relaxation Path-based Algorithms [http://ews.uiuc.edu/~xyan/tutorial/kdd06_graph.ht m] 198 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Path-based Algorithms [http://ews.uiuc.edu/~xyan/tutorial/kdd06_graph.ht m] Path-based Algorithms: problem [http://ews.uiuc.edu/~xyan/tutorial/kdd06_graph.ht m] 199 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Feature-based Methods: limitation • Problem: •For similarity search, filtering is done by inferring the edit distance bound through the smaller units that exactly match the query structure • A rough bound • Not effective for large graphs (because features that may be rare in small graphs are likely to be found in enormous graphs just by chance) Outline • Introduction • Foundation • State of the Art on Graph Matching •Exact Graph Matching •Error-Tolerant Graph Matching • Search Graph Databases •Graph Indexing Methods • Our Works •Star Decomposition •Sorted Index For Graph Similarity Search 200 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Graph Similarity Search Problem • Definition •Given a graph database D and a query structure Q, similarity search is to find all the graphs in D that are similar to Q based on GED. • Two challenges in the filter-and-refine framework: •How to efficiently compute more effective edit distance bounds between two graphs for filtering? •How to reduce the number of pairwise graph dissimilarity computations to speed up the graph search? Our Solutions • Work 1: Star decomposition •Break each graph into a multiset of stars •Propose new effective and efficient lower and upper GED bounds through finding a mapping between the star sets of two graphs using Hungarian algorithm • Work 2: Sorted index for graph similarity search •Propose a novel indexing and query processing framework •Deploy a filtering strategy based on TA and CA methods to reduce the number of pairwise graph dissimilarity computations 201 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Outline • Introduction • Foundation • State of the Art on Graph Matching •Exact Graph Matching •Error-Tolerant Graph Matching • Search Graph Databases •Graph Indexing Methods • Our Works •Star Decomposition •Sorted Index For Graph Similarity Search Comparing Stars: On Approximating Graph Edit Distance Zhiping Zeng Anthony K.H. Tung Jianyong Wang Jianhua Feng Lizhu Zhou 202 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Star Decomposition • Star structure •A star structure s is an attributed, single-level, rooted tree which can be represented by a 3-tuple s=(r, L, l), where r is the root vertex, L is the set of leaves and l is a labeling function. • Star representation for graph •A graph can be broken into a multiset of star structures c b a c b c c a c d b d a c d G1 a c d c c Star Decomposition • Star edit distance •Given two star structures s1 and s2, • λ(s1, s2) = T(r1, r2) + d(L1, L2) •Where T(r1, r2) = 0 if l(r1) = l(r2); otherwise T(r1, r2) = 1 • d(L1, L2) = ||L1| − |L2|| + M(L1, L2) • M(L1, L2) = max{| ΨL1|, | ΨL2|} − |ΨL1∩ΨL2| Example: given s1 = abcc, and s2 = a d dcc, T(r1, r2) = 1, as l(a) ≠ l(d); b c c c c d(L1, L2) = |3-2| + 3 – 2 = 2; s1 s2 λ(s1, s2) = 1 + 2 = 3. 203 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Star Decomposition • Mapping distance •Given two multisets of star structures S(G1) and S(G2) from two graphs G1 and G2 with the same cardinality, and assume P: S(G1) → S(G2) is a bijection. The mapping distance between G1 and G2 is •This problem can be formulated as the assignment problem. Given a distance cost matrix between two star multisets, the mapping distance can be computed using Hungarian algorithm. A Simple Example 204 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Bounds of GED • Lower Bound •Let G1 and G2 be two graphs, then the mapping distance μ(G1, G2) between them satisfies μ(G1, G2) ≤ max{4, [min{δ(G1), δ(G2)]} + 1]} · λ(G1, G2) • Based on the above Lemma, μ provides a lower bound Lm of λ, i.e., Constructing the cost matrix takes Θ(n3), and running the Hungarian algorithm takes O(n3). Bounds of GED • Upper bound •The first upper bound τ comes naturally during the computation of μ •The output from the computation of using Hungarian algorithm leads to a mapping P’ from V(G1) to V(G2) •Recall the BLP method, exact GED is computed as •Therefore, is a naturally upper bound The mapping P’ might not be optimal, so τ (G1, G2)≥λ(G1, G2). C(G1, G2, P’) is solved in Θ(n2) time, therefore, τ can be computed in Θ(n3) time. 205 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Bounds of GED • Refined upper bound ρ: main idea •Given any two vertices v1 and v2 in G1 and their corresponding mapping f(v1) and f(v2) in G2 (assuming f is the mapping function corresponding to P’), we swap f (v1) and f (v2) if this reduce the edit distance. c b d ε c b d ε a c a c a c a c G1 G2 G1 G2 new mappings obtained might lead to better or worse bounds. Refining to get a better takes O(n6). Filtering Strategy • Integrating all the GED bounds into a filter-and-refine framework •Filtering features: Lm ≤ λ ≤ ρ ≤ τ. •Filtering orders: bounds with lower computation complexity are deployed first. 206 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Full Graph Similarity Search • Problem •Given a graph database D and a query structure Q, find all the graphs Gi in D with λ(Q, Gi) ≤ d (d is a threshold). • AppFULL algorithm: •if Lm(Q, Gi) > d, Gi can be safely filtered; •if τ(Q, Gi) ≤ d, Gi can be reported as a result directly; •if ρ(Q, Gi) ≤ d, Gi can be reported as a result directly; •otherwise, λ(Q, Gi) must be computed. Subgraph exact Search • Lemma •Given two graphs G1 and G2 , if no vertex relabelling is allowed in the edit operations, μ’(G1, G2) ≤ 4 · λ’(G1, G2), where μ’ and λ’ are computed without vertex relabelling. •(This Lemma can be used in subgraph search, because if a graph is subisomorphism to another graph, no vertex relabelling happens.) • AppSUB algorithm: •Filtering based on the lower bound . 207 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Experimental Results • Compare with the exact algorithm 1,000 graphs were generated, D = 1k,T = 10,V = 4. Randomly select 10 seed graphs to form D; a seed has 10 vertices. 6 query groups. Each group has 10 graphs. Graphs in the same group have the same number of vertices. Experimental Results • Compare with the BLP method 208 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Experimental Results • Scalability over real datasets Experimental Results • Scalability over synthetic datasets 209 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Experimental Results • Performance of AppFULL Experimental Results • Performance of AppSUB 210 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Outline • Introduction • Foundation • State of the Art on Graph Matching •Exact Graph Matching •Error-Tolerant Graph Matching • Search Graph Databases •Graph Indexing Methods • Our Works •Star Decomposition •Sorted Index For Graph Similarity Search SEGOS: SEarch similar Graphs On Star index Xiaoli Wang Xiaofeng Ding Anthony K.H. Tung Shanshan Ying Hai Jin 211 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Our Solutions • Work 1: Scalability issue •A full database scan •A index mechanism is needed • Existing indexing methods: Filtering power •Rough bounds with poor filtering power • Work 2: Sorted index for graph similarity search •Propose a novel indexing and query processing framework •Deploy a filtering strategy based on TA and CA methods •All exiting lower and upper GED bounds can be directly integrated into our filtering framework TA Method on the Top-k Query • The database model used in TA M Object Sorted L1 Sorted L2 A1 A2 ID 0.9 0.85 (a, 0.9) (d, 0.9) a 0.8 0.7 (b, 0.8) (a, 0.85) b 0.72 0.2 (c, 0.72) (b, 0.7) c . . d 0.6 0.9 . . . . . . . . . . . . . . . (d, 0.6) (c, 0.2) N . . . 212 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search TA method on the top-k query • A simple query •Find the top-2 objects on the ‘query’ of ‘A1&A2 ’ •This query results in the TA method combing the scores of A1 and A2 by an aggregation function like sum(A1,A2) Aggregation function: function that gives objects an overall score based on attribute scores examples: sum, min functions Monotonicity! Monotony on TA (Halting Condition) • Main idea •How do we know that scores of seen objects are higher than the grades of unseen objects? •Predict maximum possible score unseen objects: L1 L2 a: 0.9 d: 0.9 Seen b: 0.8 a: 0.85 c: 0.72 b: 0.7 ω = sum(0.72, 0.7) = . . 1.42 . f: 0.6 . . f: 0.65 . Possibly unseen . . Threshold value d: 0.6 c: 0.2 213 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search A Top-2 Query Example • Given 2 sorted lists for attributes A1 and A2, L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) (c, 0.72) (b, 0.7) . . . . . . . . (d, 0.6) (c, 0.2) A Top-2 Query Example • Step 1 •Parallel sorted access attributes from every sorted list L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 (c, 0.72) (b, 0.7) d 0.9 1.5 . . . . . . . . (d, 0.6) (c, 0.2) 214 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search A Top-2 Query Example • Step 1 •Sorted access attributes from every sorted list •For each object seen: • get all scores by random access • determine sum(A1,A2) • amongst 2 highest seen? keep in buffer L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 (c, 0.72) (b, 0.7) d 0.9 . . . . . . . . (d, 0.6) (c, 0.2) A Top-2 Query Example • Step 1 •Sorted access attributes from every sorted list •For each object seen: • get all scores by random access • determine sum(A1,A2) • amongst 2 highest seen? keep in buffer L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 (c, 0.72) (b, 0.7) d 0.9 . . . . . . . . (d, 0.6) (c, 0.2) 215 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search A Top-2 Query Example • Step 1 •Sorted access attributes from every sorted list •For each object seen: • get all scores by random access • determine sum(A1,A2) • amongst 2 highest seen? keep in buffer L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 1.75 (c, 0.72) (b, 0.7) d 0.9 . . . . . . . . (d, 0.6) (c, 0.2) A Top-2 Query Example • Step 1 •Sorted access attributes from every sorted list •For each object seen: • get all scores by random access • determine sum(A1,A2) • amongst 2 highest seen? keep in buffer L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 1.75 (c, 0.72) (b, 0.7) d 0.6 0.9 1.5 . . . . . . . . (d, 0.6) (c, 0.2) 216 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search A Top-2 Query Example • Step 2 •Determine threshold value based on objects currently seen under sorted access. ω = sum(L1, L2) L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 1.75 (c, 0.72) (b, 0.7) d 0.6 0.9 1.5 . . . . . . . . (d, 0.6) (c, 0.2) A Top-2 Query Example • Step 2 •Determine threshold value based on objects currently seen under sorted access. ω = sum(L1, L2) L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 1.75 (c, 0.72) (b, 0.7) d 0.6 0.9 1.5 . . . . . . . . (d, 0.6) (c, 0.2) ω = sum(0.9, 0.9) = 1.8 217 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search A Top-2 Query Example • Step 2 •Determine threshold value based on objects currently seen under sorted access. ω = sum(L1, L2) •2 objects with overall score ≥ threshold value ω? Stop •else go to next entry position in sorted list and go to step 1 L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 1.75 (c, 0.72) (b, 0.7) d 0.6 0.9 1.5 . . . . . . . . (d, 0.6) (c, 0.2) ω = sum(0.9, 0.9) = 1.8 A Top-2 Query Example • Step 2 •Determine threshold value based on objects currently seen under sorted access. ω = sum(L1, L2) •2 objects with overall score ≥ threshold value ω? Stop •else go to next entry position in sorted list and go to step 1 L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 1.75 (c, 0.72) (b, 0.7) d 0.6 0.9 1.5 . . . . . . . . (d, 0.6) (c, 0.2) 218 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search A Top-2 Query Example • Step 2 •Determine threshold value based on objects currently seen under sorted access. ω = sum(L1, L2) •2 objects with overall score ≥ threshold value ω? Stop •else go to next entry position in sorted list and go to step 1 L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 1.75 (c, 0.72) (b, 0.7) d 0.6 0.9 1.5 . . . . . . . . (d, 0.6) (c, 0.2) A Top-2 Query Example • Step 1 (Again) •Sorted access attributes from every sorted list L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 1.75 (c, 0.72) (b, 0.7) d 0.6 0.9 1.5 . . . . . . . . (d, 0.6) (c, 0.2) 219 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search A Top-2 Query Example • Step 1 (Again) •Sorted access attributes from every sorted list •For each object seen: • get all scores by random access • determine sum(A1,A2) • amongst 2 highest seen? keep in buffer L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 1.75 (c, 0.72) (b, 0.7) d 0.6 0.9 1.5 . . . . b 0.8 0.7 1.5 . . . . (d, 0.6) (c, 0.2) A Top-2 Query Example • Step 1 (Again) •Sorted access attributes from every sorted list •For each object seen: • get all scores by random access • determine sum(A1,A2) • amongst 2 highest seen? keep in buffer L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 1.75 (c, 0.72) (b, 0.7) d 0.6 0.9 1.5 . . . . . . . . (d, 0.6) (c, 0.2) 220 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search A Top-2 Query Example • Step 2 (Again) •Determine threshold value based on objects currently seen under sorted access. ω = sum(L1, L2) •2 objects with overall score ≥ threshold value ω? Stop •else go to next entry position in sorted list and go to step 1 L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 1.75 (c, 0.72) (b, 0.7) d 0.6 0.9 1.5 . . . . . . . . (d, 0.6) (c, 0.2) ω = sum(0.8, 0.85) = 1.65 A Top-2 Query Example • Step 2 (Again) •Determine threshold value based on objects currently seen under sorted access. ω = sum(L1, L2) •2 objects with overall score ≥ threshold value ω? Stop •else go to next entry position in sorted list and go to step 1 L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 1.75 (c, 0.72) (b, 0.7) d 0.6 0.9 1.5 . . . . . . . . (d, 0.6) (c, 0.2) 221 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search A Top-2 Query Example Situation at stopping: ω = sum(0.72, 0.7) = 1.42 < 1.5 L1 L2 (a, 0.9) (d, 0.9) ID A1 A2 sum(A1,A2) (b, 0.8) (a, 0.85) a 0.9 0.85 1.75 (c, 0.72) (b, 0.7) d 0.6 0.9 1.5 . . . . . . . . (d, 0.6) (c, 0.2) TA-based Filtering Strategy for Graph Search Problem • Main idea •Each graph is broken into a multiset of stars •Each distinct star generated from the database graphs can be seen as an index attribute in the TA database model •Each entry in the sorted lists contains the graph identity (denoted by gi) and its score (denoted by λ) in that star attribute, the score is defined as the star edit distance between a star of gi and the index star •Halting condition: given m sorted lists, if the aggregation function of ω = sum(λ1,…, λm)≥d (d is the edit distance threshold bound for graph mapping distance), TA stops. 222 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search TA-based Filtering Strategy for Graph Search Problem • Challenges: •How do we know that the distance threshold is larger than those of unseen graphs (these graphs can be safely filtered out)? Predict minimum possible mapping distance for unseen graphs: L1 L2 g1: 0 g4: 0 Seen g2: 1 g1: 1 g3: 2 g2: 3 ω = sum(2, 3) = 5 > d (= 4) . . . : g6. 5 Possibly unseen g6. 5 : . . . Threshold value g4: 6 g3: 9 TA-based Filtering Strategy for Graph Search Problem • A graph database with a query example Sorted list L1 Sorted list L2 Sorted list L3 223 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Requirement • An index structure •Convenient for score-sorted lists construction • Efficient star search algorithm •Quickly return similar stars to a query star • Sorted properties for the halting condition of TA •The mapping distance of any unseen graph gi satisfies λ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’) q is the query graph, τ is the distance threshold, and where D’ is the set of all unseen graphs. •Requirement distance in our previous work Recall the mapping satisfy: •μAn index structure (q, gi) ≤ max{4, [min{δ (q), δ(gi)]} + 1]} · λ(q, gi) •Convenient for score-sorted lists construction Efficient δ search max{4, [min{δ (q), δ(gi)]} + 1]}, •We denotestar(q, gi) =algorithm •Quickly gi) ≤ δ’. then δ (q,return similar stars to a query star If μ(q, g ) > τ*δ’, then λ(q, gi) > τ*δ’/δ > τ, • Sorted iproperties for the halting condition of TA and this graph can be safely filtered out. •The mapping distance of any unseen graph gi satisfies • μ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’) •q is the query graph, τ is the distance threshold, and •δ’ = max{4, [min{δ(q), δ(D’)]} + 1]} •where D’ is the set of all unseen graphs. 224 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Requirement • An index structure •Convenient for score-sorted lists construction • Efficient star search algorithm •Quickly return similar stars to a query star • Sorted properties for the halting condition of TA •The mapping distance of any unseen graph gi satisfies • λ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’) •q is the query graph, τ is the distance threshold, and • •where D’ is the set of all unseen graphs. Build Inverted Index Structures based on the Star Decomposition • The upper-level index •Build an inverted index between stars and graphs •Used to quickly returned graph lists • The lower-level index •Build an inverted index between labels and stars •Used to construct the sorted lists •for top-k star search based on TA •filtering strategy 225 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Build Inverted Index Structures based on the Star Decomposition Top-k Star Search Algorithm • Construct sorted lists 226 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Graph Score-sorted Lists • Construct lists based on the top-k results TA-based Graph Range Query • Definition •Given a graph database D and a query q, find all gi ∈ D that are similar to q with λ(q, gi) ≤ τ. τ is the distance threshold. • Steps: given m sorted lists for a query graph q •Perform sorted retrieval in a round-robin schedule to each sorted list. For a retrieved graph gi, if Lm(q, gi) > τ, filter out the graph; if Um(q, gi) ≤ τ, report the graph to the answer set. •For each sorted list SLj, let χj be the corresponding distance last seen under sorted access. If ω = sum(χ1,…, χm) > τ∗δ’, then halt. Otherwise, go to step 1. 227 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search CA-based Filtering Strategy • The difference between TA and CA •TA computes the mapping distance between two graphs when retrieving a new graph through sorted accesses •Only in each h depth of the sorted scan, for seen and unprocessed graphs, CA uses estimated mapping distance bounds to first filter graphs; Then, it uses Incremental Hungarian algorithm to compute the partial mapping distances for filtering CA-based Filtering Strategy • Suppose l(g) = {l1,…,ly} ⊆ {1,2,…,m} is a set of known lists of g seen below q. Let χ(g) be the multiset of distances of the distinct stars of g last seen in known lists. •Lower bound denoted by Lμ(q, g) is obtained by substituting the missing lists j ∈ {1,2,…,m}\l(g) with χj (the distance last seen under the jth list) in ζ(q, g) •Upper bound denoted by Uμ(q, g) is computed as Uμ(q, g) = t′(χ(g)) + χ ∗ (|g| − |χ(g)|) • Theorem: Let g1 and g2 be two graphs, the bounds obtained as above satisfies ζ(g1, g2) ≤ Lμ(g1, g2) ≤ μ(g1, g2) ≤ Uμ(g1, g2) 228 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search CA-based Filtering Strategy • Dynamic hungarian for partial mapping distance •Given m sorted lists for q, suppose S′(g) ⊆ S(g) is a multiset of stars in g seen below lists. Then we have μ(S(q), S′(g)) ≤ μ(q, g) CA-based Graph Range Query • Steps: given m sorted lists for a query graph q •Perform sorted retrieval in a round-robin schedule to each sorted list. At each depth h of lists: • Maintain the lowest values χ1, . . . , χm encountered in the lists. Maintain a distance accumulator ζ(q, gi) and a multiset of retrieved stars S′(gi) ⊆ S(gi) for each gi seen under lists. • For each gi that is retrieved but unprocessed, if ζ(q, gi) > τ∗δgi, filter out it; if Lμ(q, gi) > τ∗δgi, filter out it; if Uμ(q, gi) ≤ τ∗δgi , add the graph to the candidate set. Otherwise, if μ(S(q), S′(gi ) > τ∗δgi, filter out the graph. Finally, run the Dynamic Hungarian to obtain Lm(q, gi) and Um(q, gi) for filtering. •When a new distance is updated, compute a new ω. If ω = t′(χ) > τ∗δ′, then halt. Otherwise, go to step 1. 229 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Experimental Results: Sensitivity test Experimental Results: Index construction 230 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search Experimental Results: compare with other works varying distance thresholds Experimental Results: compare with other works varying dataset sizes 231 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search References • D. Conte, Pasquale Foggia, Carlo Sansone, and Mario Vento. Thirty Years of Graph Matching in Pattern Recognition. • P. Foggia, C. Sansone and M. Vento. A performance comparison of five algorithms for graph isomorphism. In 3rd IAPR-TC15 workshop on graph-based representations in pattern recognition, 2001. • K. Riesen, M. Neuhaus, and H. Bunke. Bipartite graph matching for computing the edit distance of graphs. In GBRPR, 2007. • P. Hart, N. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. SSC, 1966. References • D. Justice. A binary linear programming formulation of the graph edit distance. IEEE TPAMI, 2006. • R. Giugno and D. Shasha. Graphgrep: A fast and universal method for querying graphs. In ICPR, 2002. • R. D. Natale, A. Ferro, R. Giugno, M. Mongiovì, A. Pulvirenti, and D. Shasha. SING: subgraph search in non-homogeneous graphs. BMC Bioinformatics, 2010. • X. Yan, P.S. Yu, and J. Han. Graph indexing: a frequent structure-based approach. In SIGMOD, 2005. • J. Cheng, Y. Ke, W. Ng, and A. Lu. Fg-index: towards verification-free query processing on graph databases. In SIGMOD, 2007. 232 Mining and Searching Complex Structures Chapter 5 Graph Similarity Search References • D.W. Williams, J. Huan, and W. Wang. Graph database indexing using structured graph decomposition. In ICDE, 2007. • S. Zhang, M. Hu, and J. Yang. Treepi: a novel graph indexing method. In ICDE, 2007. • P. Zhao, J. X. Yu, and P. S. Yu. Graph indexing: tree + delta >= graph. In VLDB, 2007. • G. Wang, B. Wang, X. Yang, and G. Yu. Efficiently indexing large sparse graphs for similarity search. IEEE TKDE, 2010. 233 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Searching and Mining Complex Structures Massive Graph Mining Anthony K. H. Tung(鄧锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung Research Group Link: http://nusdm.comp.nus.edu.sg/index.html Social Network Link: http://www.renren.com/profile.do?id=313870900 Graph applications: everywhere And often, they are huge and messy. social network Bio Pathway Co-authorship network 234 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Knowledge: NOWHERE Unless we manage to find where they hide. Too many clues is like no clue. Roadmap Part I (1.5 hrs) •Graph Mining Primer •Recent advances in Massive Graph Mining Part 2(1.5 hrs) •CSV: cohesive subgraph Mining •Dngraph mining: a triangle based approach 235 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Roadmap • Graph Mining Primer • Data mining vs. Graph mining • Massive graph mining domain • Types of graph patterns • Properties of large graph structure • Recent advances in Massive Graph Mining • CSV: cohesive sub graph Mining • DNgraph mining: a triangle based approach From Data Mining to Graph Mining • Data Mining raph Mining • Classification • Captures more complicated • Clustering entity relationships. • Association rule • Output: patterns, which are learning smaller subgraphs with interpretable meanings. 236 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Massive graph mining domains • Financial data analyzing • Bioinformatics network • User profiling for customized search • Identify financial crime Financial data analysis In stock market, correlations among stocks helps in profit making. Mining stock correlation graphs Stocks Correlation Tabular Form predicting stocks' price change for estimating future return, allocating portfolio and controlling risks etc. Stocks Correlation Patterns 237 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Financial data analysis In stock market, correlations among stocks helps in profit making. Mining stock correlation graphs Stocks Correlation Tabular Form predicting stocks' price change for estimating future Highly return, allocating correlated portfolio and stock sets controlling risks etc. Stocks Correlation Patterns Bioinformatics network •Protein-protein interaction • The fundamental activities for very numerous living cells. • A dense graph pattern indicates these proteins have similar functionalities. one representation of an assembled NEDD9 network 238 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining User profiling for customized search The Internet Movie Database (IMDB) Registered users can comment on movies of their interest. Mining on comments sharing network provides insight of user’s interest thus further facilitate customized search. Movie centric view of IMDB review network Identify financial crime Large classes of financial crimes such as money laundering, follow certain transactional patterns. Geospatial information of suspects A money laundering pattern 239 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Dense Graph Patterns Clique/Quasi-Clique A clique represents the highest level of internal interactions. Quasi-clique is an ``almost'' clique with few missing edges. High Degree Patterns Concern the average vertex degree, which is the number of edges intercepting the vertex. Dense graph patterns (cont.) Dense Bipartite Patterns Heavy Patterns Weighted, directed graph of Bipartite graph of pathways and online citation network, by genes for the AML/ALL dataset. Rosvall & Bergstrom 240 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Properties of large graph structure Static •Power law degree distributions. •Small world phenomenon. •Communities and clusters. Dynamic •Shrinking diameters of enlarging graphs •Densification along time Power law 241 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Large graph: properties and laws (cont.) Dynamic •Shrinking diameters of enlarging graphs. •Densification along time Roadmap • Graph Mining Primer • Large graph: properties and laws • Approaches in Graph mining • Pattern based Mining algorithms • Practical techniques in Massive Graph Mining • Graph summarization with randomized sampling •Connectivity based traversal •MapReduce based • CSV: cohesive subgraph Mining • Dngraph mining: a triangle based approach 242 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Pattern based Mining algorithms Greedy methods SUBDUE (PWKDD04), GBI(JAI94) Apriori-based approaches (detail in next few slides) AGM , FSG, gSpan Inductive logic programming (ILP) oriented solutions WARMR, FARMAR Kernel based solutions Kernels for graph classification Apriori Paradigm Recall Search in breadth-first manner Use a Lattices structure to count candidate subgraph sets efficiently. A search lattice for item set mining 243 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Apriori-based Graph Mining Performance bottleneck: candidate subgraph generation. Solution: 1. Build a lexicographic order among graphs. 2. Search using depth-first strategy. Very effective in mining large collections of small to medium size graphs. Graph summarization with randomized sampling • Efficient Aggregation for Graph Summarization – SIGMOD 2008 • Graph Summarization with Bounded Error-SIGMOD 2008 • Mining graph patterns efficiently via randomized summaries - VLDB 2009 244 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Efficient Aggregation for Graph Summarization As graph size increases, graphs summarization becomes crucial when visualize the whole graph. Criteria for an efficient summarization solution Able to produce meaningful summarization for real application. Scalable to large graphs. The choice: graph aggregation Graph Aggregation 1. Summarization based on user-selected node attributes and relationships. 2. Produce summaries with controllable resolutions. “drill-down” and “roll-up” abilities to navigate Propose two aggregation operations SNAP – address 1 k-SNAP - address 2 245 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Operation SNAP Group nodes by user-selected node attributes & relationships Nodes in each group are homogenous (in terms of attributes and relationships). Goal: minimum # of groups How does SNAP work? Top down approach Initial Step: Use user selected attributes to group nodes. Iterative Step: If a group are not homogeneous w.r.t. relationships, split the group based on its relationships with other groups. 246 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining SNAP limitation Homogeneity requirement for relationships Noise and uncertainty Users have no control over the resolutions of summaries SNAP operation can result in a large number of small groups Operation k-SNAP The entities inside a group are not necessarily homogenous in terms of relationships with other groups. Users can control resolution by specifying k (# groups). Varying of k provides “drill-down” and “roll-up” abilities. 247 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Access quality of summarization Determined by sum of noisy relations. When the relationship between two relationships are strong (>50%), count missing participants. When the relationship between two relationships are weak (<=50%), count extra participants. K-SNAP goal Find the summary of size k with best quality. I.e. minimal Δ. The exact solution to minimize Δ is NP-Complete. Heuristics Top down: split a group into 2 at each iteration. Choose the group with worst quality and split. Bottom up: merge 2 groups into 1 Choose same attribute values, similar neighbors, or similar participation ratio. 248 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Major results Double-blind review’s effect on LP authors. k-SNAP: Top down vs. Bottom up 249 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Graph Summarization with Bounded Error Large graph data needs compression Compression can reduce size to 1/10 (web graph) Graph compression vs. Clustering Compression Clustering use urls, node labels works for generic for compression graphs Result lacks meaning No compression for space saving Solution: MDL Based Compression for Graphs Intuition d e f g Many nodes with similar neighborhoods • Communities in social networks; a b c link-copying in webpages Collapse such nodes into supernodes (clusters) and the edges into X = {d,e,f,g} Summary superedges • Bipartite subgraph to two Y = {a,b,c} supernodes and a superedge • Clique to supernode with a “self-edge” 250 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining How to choose vertex sets to compress Cost = 14 edges d e f g h i j a b c MDL based compression S is a high-level summary graph: C is a set of edge corrections: X = {d,e,f,g} Summary minimize cost of S+C i h Novel Approximate Representation: Y = {a,b,c} i reconstructs graph with bounded error (є); results in better compression Corrections +(a,h) Cost = 5 (1 superedge + +(c,i) 4 corrections) +(c,j) ‐(a,d) Compress (cont.) Summary S(VS, ES) X = {d,e,f,g} Each supernode v represents a set of nodes Av h i Each superedge (u,v) represents all pair of edges πuv = Au x Av Y = {a,b,c} j Corrections C: {(a,b); a and b are nodes C = {+(a,h), +(c,i), +(c,j), -(a,d)} of G} Supernodes are key, superedges/corrections easy Auv actual edges of G between Au and Av Cost with (u,v) = 1 + |πuv – Euv| Cost without (u,v) = |Euv| d e f g Choose the minimum, decides whether edge (u,v) is in S h i a b c j 251 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Reconstruct Reconstructing the graph from R For all superedges (u,v) in S, insert all pair of edges πuv For all +ve corrections +(a,b), insert edge (a,b) For all -ve corrections -(a,b), delete edge (a,b) Approximate Representation Rє X = {d,e,f,g} Approximate representation Recreating the input graph exactly is not always necessary Y = {a,b} Reasonable approximation enough: to compute communities, anomalous traffic patterns, etc. Use approximation leeway to get further cost reduction C = {-(a,d), -(a,f)} Generic Neighbor Query Given node v, find its neighbors Nv in G Apx-nbr set N’v estimates Nv with є-accuracy d e f g Bounded error: error(v) = |N’v - Nv| + |Nv - N’v| < є |Nv| Number of neighbors added or deleted is at most є-fraction of the true neighbors a b Intuition for computing Rє If correction (a,d) is deleted, it adds error for both a and d From exact representation R for G, remove (maximum) For є=.5, we can remove corrections s.t. є-error guarantees still hold one correction of a d e f g a b 252 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Main Results: cost reduction Reduces the cost down to 40% Cost of GREEDY 20% lower than RANDOMIZED RANDOMIZED is 60% faster than GREEDY Comparison with other schemes Techniques give much better compression 253 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Approximate-Representation Cost reduces linearly as є is increased; With є=.1, 10% cost reduction over R Mining graph patterns efficiently via randomized summaries Motivation In a graph with large number of identical labeled vertices, graph isomorphism becomes a bottleneck. How to avoid enumerate identical patterns? 3 (triangular) × 4 (square) = 12 (total) 254 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Solution framework Summarization->Mining->Verification Raw DB Summarized DB Raw DB Reduce false positive • Technique 1: merge vertices that are far away from each other. •The length of the shortest path •The probability of random walk • Technique 2: merge vertices whose neighborhood overlap. •Cosine, Chi^2, Lift, Coherence • Technique 3: Go back to raw database to do verification It is guaranteed that there is no false positives. Summarization may cause false positive a b a False Embeddings False Positives a b b b 255 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Summarization: Reduce false negative a b a Miss Embeddings False Negatives a c b c Technique 1: For raw database with frequency threshold min_sup, we adopt a lower frequency threshold pseudo min_sup for summarized database. Technique 2: Iterate the mining steps for T times and combine the results generated in each time. It is NOT guaranteed that there is no false positives, but the possibility is bounded by Connectivity based traversal CSV: Cohesive Subgraph Mining –SIGMOD 2008 (Discussed in detail in Part II) Progressive Clustering of Networks using Structure-Connected Order of Traversal –ICDE 2010 256 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Progressive clustering of networks using structure- connected order of traversal SCAN Algorithm •Similar to DBSCAN: connectivity-based •Average O(n) time •Uses structural similarity measure, minimum cluster size mu, and minimum similarity epsilon •Finds outliers and hubs Problems •No automated way to find good epsilon •Must rerun algorithm for each possible epsilon •Epsilon is global threshold • No hierarchical clusters • No variation in cluster subtlety Solution • Structure-Connected Order of Traversal (SCOT) •Contains all possible epsilon-clusterings • Efficient method to find global epsilon • New Contiguous Subinterval Heap structure (ContigHeap) • New Progressive Mean Heap Clustering (ProClust) •Epsilon-free •Hierarchical • Refinement by Gap Constraint (GapMerge) 257 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Original Network: SCOT plot: Optimal Global Epsilon SCAN paper only contains supervised sampling method. Sample points, find k-NN similarities, sort, plot, find knee visually O(nd log n) time In addition to clustering time Our solution: Knee hypothesis implies approx concave plot Optimal epsilon minimizes obtuse angle between segments Modified histogram and binary search: O(n) time Uses already done SCOT result 258 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining ContigHeap BuildContigHeap produces heap containing all contiguous subintervals from SCOT output in O(n) time, and integrates with SCOT Example: GapMerge: Gap Constraint Refinement Merges chained clusters, heap branches with single children Does not merge across pruned heap nodes (local maxima boundary) Gap constraint prevents clusters whose left or right boundaries differ by more than mu from being merged Such clusters are not redundant relative to the minimum interesting cluster size Steps 1.Identify chains that meet gap constraint 2.When a node has more than one child or violates gap constraint, begin new chain. 3.Within each chain, calculate significance of each cluster in both up and down directions 4.Begin with most redundant node, merge nodes in direction of least significance 5.After each merge, recalculate significances 6.Continue until chain contains one node, or no merging possible under gap constraint. 259 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining MapReduce based approach PEGASUS: A Peta-Scale Graph Mining System –ICDM 2009 Pregel: a system for large-scale graph processing SIGMOD 2010 PEGASUS: A Peta-Scale Graph Mining System Dealing with real graph such as Yahoo! Web graph up to 6.7 billion edges. A Hadoop based graph mining package. Target at primitive matrix operations such as matrix multiplication (GMI-v). 260 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Motivation Many Graph mining tasks require matrix multiplication PageRank, Random Walk with Restart(RWR), Diameter estimation, and Connected components … MapReduce provides a simplified programming concept for large data processing Details of the data distribution, replication, load balancing are taken care of. Provides a similar programming structure. i.e. functional programming GIM-V: Generalized Iterative Matrix- Vector multiplication Intuition: Matrix Multiplication M × v = v' combine2 v i' = ∑ j =1 m n i, j vj combineall Assign Operator× G are matrix multiplication expressed by above 3 steps × G is iteratively carried out until converge. 261 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining × G and SQL The matrix multiplication operation can be expressed by an SQL query. If view graphs are two table: ×G edge table E(sid, did, val) and a vector table V(id, val) becomes ×G SELECT E.sid, combineAllE.sid(combine2(E.val, V.val)) FROM E, V WHERE E.did = V.id GROUP BY E.sid Generalize × G Vary definition of three steps to generalize × G PageRank row normalization adj. matrix p = (cE T + (1 - c)U)p All element = 1/n Damping factor = 0.85 262 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Generalize × G Vary definition of three steps to generalize PageRank ×G p = (cE T + (1 - c)U)p combine2 = c × mi, jvj 1- c + ∑ j=1 xj n combineAll = n Generalize (cont.) By altering three functions, GIM-V adapts to • Random Walk with Restart • Diameter Estimation • Connected Components 263 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining GIM-V: How to Stage 1 Combine2 V: Key = id, v: vval, E: Key = idsrc State 2 Combineall & assign Bottleneck: shuffling and disc I/O GIM_V Block Multiplication (BL) Advantage Save on sorting Data compressing Clustered Edge 264 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Block advantage (cont.) Clustered edge: GIM-V DI Dialogonal Block Iteration Intuition Increase multiplication inside an iteration to reduce # of iterations. How Reach local convergence within a block first before iterate Compare GIM-V BL and DI 265 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Main Results Scalability GIM-V BL DI is ~5 times faster than GIM-V Base Main Results (cont.) Evolution of LinkedIn Distribution of connected components are stable after a ‘gelling’ point in 2003. 266 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Main Results (cont.) Bimodal structure of Radius Pregel: A System for Large-Scale Graph Processing A scalable and fault-tolerant platform with an API that is sufficiently flexible to express arbitrary graph algorithms. Model of computing: Vertex centric, synchronized iterative model 267 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Graph Algorithms Implementation in Pregel Graph data are in respect machines, pass messages only, NO graph state passing. Pregel C++ API • Compute() - executed at each active vertex in every superstep. •Query information about the current vertex and its edges. •Send messages to other vertices. •Inspect or modify the value associated with its vertex/out- edges. •state updates are visible immediately. no data races on concurrent value access from diefferent vertices • Limiting the graph state managed by the framework to single value per vertex or edge simplifies the main computation cycle, graph distribution, and failure recovery. 268 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Pregel C++ API (cont.) • Message Passing •No guaranteed order, but it will be delivered and no duplication. • Combiners •Combine several messages to reduce overhead • Aggregators •Mechanism for global communication, monitoring, and data. •A number of predefined aggregators, such as min, max, or sum operations • Topology mutation •Change graph toplogy, resolve conflicts when individual vertices sent conflict messages. Pregel C++ API (cont.) • Input and output • Readers and writers 269 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Pregel implementation • Design for Google cluster architecture •Each consists of thousands of commercial PCs • Persistent data •Stored in files on distributed file systems such as GFS or BigTable • Temporary data •Stored as buffered message on local disk. Pegel: Assign load • Divide graph vertices into partitions and assign to different machines •controllable by users, default method: hash • In absence of fault: •One master, many other workers on a cluster of machines. • master assign load jobs, i/o and instruct on super steps • Fault tolerent: •Use checkpoint: master ping workers •Confined recovery (undergoing): master log outgoing message 270 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Graph Application PageRank Shortest Path Bipartite Matching Semi Cluster Pregel: Main Result 271 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Reference (partial) Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations by J. Leskovec, J. Kleinberg, C. Faloutsos. (KDD), 2005. Substructure Discovery in the SUBDUE System. L. B. Holder, D. J. Cook and S. Djoko. In (PWKDD), 1994. Efficient Aggregation for Graph Summarization – Yuanyuan Tian, Richard A. Hankins, Jignesh M. Patel SIGMOD 2008 Graph Summarization with Bounded Error-Saket Navlakha, Rajeev Rastogi, Nisheeth Shrivastava SIGMOD 2008 Mining graph patterns efficiently via randomized summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han - VLDB 2009 Progressive Clustering of Networks using Structure-Connected Order of Traversal Dustin Bortner, Jiawei Han –ICDE 2010 PEGASUS: A Peta-Scale Graph Mining System U. Kang, Charalampos E. Tsourakakis, ChristosFaloutsos, ICDM Graph based induction as a unified learning framework, K. Yoshida, H. Motoda, and N. Indurkhya. Applied Intelligence volume 4, 1994. Complete mining of frequent patterns from graphs: Mining graph data. Akihiro,W. Takashi, and M. Hiroshi. Mach. Learn., 50(3):321–354, 2003. Reference (cont.) Frequent subgraph discovery, K. Michihiro and G. Karypis. In ICDM, pages 313–320, 2001. gSpan: Graph-based substructure pattern mining, X. Yan and J. Han. ICDM 2002. WARMR Discovery of frequent datalog patterns. L. Dehaspe and H. Toivonen. Data Mining and Knowledge Discovery, 3(7-36), 1999. FARMAR Fast association rules for multiple relations. S. Nijssen and J. Kok. Data Mining and Knowledge Discovery, 3(7–36), 1999. 272 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Roadmap Part I (1.5 hrs) Graph Mining Primer Recent advances in Massive Graph Mining Part 2(1.5 hrs) CSV: cohesive subgraph Mining Dngraph mining: a triangle based approach CSV 1. Cohesive sub-graph mining, with visualization 2. Existing approaches 3. CSV provides effective visual solution – Algorithm principle – Connectivity Estimation 4. Experimental Study 273 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Existing solutions 1. Current state-of-the-art to abstract information from huge graphs. information Yes, 1. Graph partition algorithms. structure No. Spectral clustering[Ng01]: high computational cost METIS[Karypis96]: favors balanced pattern 2. Graph Pattern Mining algorithms CODENSE[Hu05], CLAN[Zeng06]: exponentially running time 2. Graph Layout Tools: Osprey [Breitkreutz03] Visant [Mellor04]: Do not have mining capability information No, We want structured information structure Yes. CSV: General Approach • Separate vertices in the graph into VISITED, UNVISITED • Start: Pick a vertex and add into VISITED • Repeat until UNVISITED=empty –Among all vertices that are in UNVISITED, pick one vertex V most highly connected to VISITED –Plot V’s connectivity –Add V into VISITED But how do we measure connectivity? 274 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Connectivity measurement Connectivity measurement is closely related to clique (fully connected sub- graph) size. The connectivity between two vertices in a graph (ηmax) is defined to be the The “connectivity” of a vertex biggest clique in the graph such that (ζmax) is similarly defined both are members of the clique as the biggest clique it can participate. b b a c a c e d e d ηmax(a, d) = 0 ηmax(a, c) = 4 ζmax(a) = 5 CSV: Step by Step heap From Graph to Plot connectivity D 4 A B C 3 F H I 2 E G 1 J B A vertices unvisited neighbors Start from A, explore A’s neighbor B. visiting Calculate ζmax (A)=2 and output it visited 275 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining CSV algorithm on a synthetic graph heap From graph to plot connectivity D 4 A B C 3 C F H I E 2 F G J B H 1 unvisited AB vertices neighbors Mark A visited, from B, explore B’s visiting immediate neighbors CFH. visited Calculate ηmax (AB)=2 and output it CSV algorithm on a synthetic graph heap From graph to plot connectivit y D 4 A B C F 3 H C F H I 2 E F G G 1 J D H A BC vertices unvisited neighbors Mark B visited, choose the closely visiting connected C as next visiting vertex. From C, explore C’s immediate neighbors DFGH, visited update ηmax when necessary. Calculate ηmax (BC)=4 and output it 276 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining CSV algorithm on a synthetic graph From graph to plot connectivity Cohesive sub- graph D 4 A B C 3 F H I 2 E G 1 J ABCH FGDE I J vertices unvisited neighbors visiting Visit every vertex accordingly to produce a visited plot. Peaks represent cohesive sub-graphs. Important Theorem 277 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Connectivity computation is a hard problem However, if graphs are very huge and massive, exact computation of connectivity is prohibitive. Direct computation is costly Connectivity computation is prohibitive •Exact algorithm relays on D A B C clique detection (NP-hard). •Even approximation is hard. F H I •Solution Part 1: Spatial E G Mapping J •Pick k pivots P1 I •Map graph into k- dimensional space based on 3 A E their shortest distance to the F GJ pivots 2 B C D •A clique will map into the same grid. 1 H I 0 1 2 3 P0 A 278 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Connectivity computation •Solution Part 2: Approximate Upper Bound for ζmax(v) and ηmax(v, v’) •Each vertex in a clique of size k must have •degree=k-1 •k-1 neighbors with degree k-1 Let estimate ηmax(a, f) •For each vertex v, find it immediate neighbors in the same grid cell and Locate the immediate neighborhood of a construct a sub-graph and f, {a, b, c, d, e, f, g}. After sorting the degree array in descending order, we have array •Iteratively readjust estimation for 6(a), 6(f), 5(d), 4(b), 4(c), 4(e), 3(g). clique size =5? =6? =7? Experimental study on real datasets DBLP: co-authorship graphs. DBLP: v 2819, e 54990 Two groups of German researchers Peaks in DBLP CSV plot represents different research groups 279 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining SMD: Stock Market Data Bridging vertex Partial clique Partial clique Peaks in SMD CSV plot represents highly cohesive stocks DIP: Database of interacting proteins 8 SMD3 9 PFS2 89 LSM8 PRP4 10 RNA14 89 LSM2 PRP8 10 FIP1 89 DCP1 PRP6 89 LSM6 LUC7 Structure of a nucleotide-bound Clp1-Pcf11 10 REF2 89 LSM3 SMX2 polyadenylation factor 10 CFT1 89 LSM4 SNP1 Christian G. Noble, Barbara Beuth, and Ian 10 CFT2 89 PAT1 STO1 A. Taylor*. Nucleic Acids Res. 2007 January; 10 MPE1 89 LSM7 NAM8 35(1): 87–99. 10 GLC7 10 PAP1 89 LSM5 SNU71 8 PRP31 “CPF is also required in both the cleavage 10 PTA1 8 YHC1 and polyadenylation reactions. It contains a 10 YSH1 8 PRP40 core of eight subunits Cft1, Cft2, Ysh1, Pta1 10 YTH1 10 PTI1 8 MUD1 Mpe1, Pfs2, Fip1 and Yth1” 8 SNU56 280 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Experimental Study CSV as a pre-selection step How? •Apply CSV to identify potential cohesive sub-graphs first. •Use exact algorithm CLAN to run on these candidates. Result •Get the exact cohesive sub-graphs as running CLAN alone. •Saves 28-84% of the time compared CSV as a pre-selection methods to running CLAN alone. DNgraph mining: A triangle based approach • Mining dense patterns out of an extremely large graph •When the graph is extremely large, it is even difficult to mine dense patterns. • An iterative improvement mining approach is more desirable •Users are able to obtain the most updated results on demand. • Dense patterns have strong connection with triangles inside a graphs. • This has already observed and explained by the preferential attachment property of large scaled graphs. 281 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining DNgraph mining: A triangle based approach • What makes a pattern dense? Intuitively B C •A collection of vertices with high relevance. •They share large number of common. A D • With that we propose the definition of Dngraph •A DNgraph is the largest sub graph sharing A’ E F the most neighbors. •Require each connected vertex pair sharing at λ(G) = 3, λ(GA’)=0 least λ neighbors. Compare Dngraph with other dense pattern definition • Two interesting patterns • 4-clique and a Turan graph T(14, 4) [14 vertices, 4 groups, fully connected between groups] • If mining quasi-clique, may ends up discovering 1 pattern, as in (d) • If searching for closed clique, may only find (e) 282 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining DNgraph mining: challenge • Find common neigbhors for every connected vertices is expensive •Require O(E) join operations. •Need random disc access. •In fact, finding an DN-graph is an NP-problem. • Solution •Using triangles that two vertices participates to approximate common neighbor size. •Iterative refine the approximation following graph edge’s locality. DNgraph mining: How 1. Initially: count # triangles each edge participates. •Sort vertices and its neighbors in descending order of their degrees •Scan the graphs to get # triangles for every vertex. •The # triangle set the initial value of λ . 2. Next, Iteratively refine λ for every vertex •Using streams of triangles. •Iterative refine λcur. 283 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Triangle Counting: how? 1. Sort vertices and its neighbors in descending order of their degrees a bde e dbacgf b acde Sort d eacgh a e f c bde b edac d acegh a edb b g e abcdgf c edb c f eg g edf d h g def f eg h d h d Triangle counting (cont.) 1. Sort vertices and its neighbors in a f e descending order of their degrees 2. Join neighborhood for triangle count for b g every edge c d h • The two vertices inhibits locality, due to reordering and preferential attachment 3 e dbacgf property of large graphs d eacgh 3 b edac a edb c edb g edf f eg h d 284 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Triangle counting (cont.) a f e 1. Sort vertices and its neighbors in descending order of their degrees b g 2. Join neighborhood for triangle count for c d h every edge vertex λcur 3. Use that as the initial λ value for every edge/vertex e 3 • Vertex λ value is the maximal edge λ value d 3 it participates … … •λcur(e) = 3 DNgraph mining: How (cont.) • Initially: count # triangles each edge participates. • Next, Iteratively refine λ for every vertex •Using streams of triangles. •Iterative refine λcur. 285 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Triangle stream •Follow the same order of visiting graph during triangle counting •Triangles are not materialized, saving storage n1 nx n2 n2 n2 n1 n1 a b n1 a b nx nx nx a b a b a lambda=k b lambda=k lambda=k Iteratively refine λ •Follow the same order of visiting graph during triangle counting •Triangles are not materialized, saving storage •For every vertex v, when its triangles come, bound λcur(v) using two other vertices’ λcur 286 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Iteratively refine λ (cont.) a f e • Initially: count # triangles each edge 3 participates. b g 3 • Next, Iteratively refine λ for every vertex c d h •Using streams of triangles. vertex λcur •Iterative refine λcur. e 3 • Until all vertices’ λcur are converged b 3 … … DNgraph: Experiment •Large scaled graph •Flicker Dataset with with 1,715,255 vertices an 22,613,982 edges. •1 iteration requires 1 hour, a workstation with a Quad-Core AMD Opteron(tm) processor 8356, 128GB RAM and 700GB hard disk. •Converge in 66 iterations, almost stable after 35 iterations 287 Mining and Searching Complex Chapter 6 Structures Massive Graph Mining Advantage • Abstraction Within the triangulation algorithm. The abstraction ensures our approach’s extensibility to different input settings. • Iteratively refine results • The estimation of common neighborhood improves along every iteration, users are able to obtain the most updated results on demand. • Pre-collection of Statistics to support effective buffer management • Process can be easily mapped to key->value pair for further distributed processing. Reference (partial) [Hu05] H.Hu, X.Yan, Y.Huang, J.Han, and X.J.Zhou. Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics, 21(1):213--221, 2005. [Ng01] A.Y. Ng, M.I. Jordan, and Y.Weiss. On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems, volume~14, 2001. [Karypis96] G.Karypis and V.Kumar. Parallel multilevel k-way partitioning scheme for irregular graphs. Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), page~35, Washington, DC, USA, 1996. IEEE Computer Society. [Breitkreutz03] B.J.Breitkreutz, C.Stark, and M.Tyers.Osprey: a network visualization system. Genome Biology, 4, 2003. [Mellor04] J.W.J. Z., Mellor and C. DeLisi. An online visualization and analysis tool for biological interaction data. BMC Bioinformatics, 5:17--24, 2004. [Zeng06]J. Wang, Z.Zeng, and L. Zhou. Clan: An algorithm for mining closed cliques from large dense graph databases. Proceedings of the International Conference on Data Engineering}, page~73, 2006. [Turan41] P. Turan. On an extremal problem in graph theory. Mat. Fiz. Lapok, 48:436–452, 1941 [Ankerst99] M.Ankerst, M.Breunig, H.P. Kriegel, and J.Sander. OPTICS: Ordering points to identify the clustering structure. Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'99), pages 49--60, Philadelphia, PA, June 1999. [DNgraph10] On Triangle based DNgraph Mining. NUS technical report TRB4/10 288