VLDB Database School (China) 2010
August 3-7, 2010, Shenyang
Lecture Notes
Part 1
Mining and Searching Complex
Structures
Anthony K.H. Tung(邓锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung
Mining and Searching Complex Structures
Contents
Chapter 1: Introduction ------------------------------------------ 1
Chapter 2: High Dimensional Data ------------------------- 34
Chapter 3: Similarity Search on Sequences ------------ 110
Chapter 4: Similarity Search on Trees ------------------- 156
Chapter 5: Graph Similarity Search ---------------------- 175
Chapter 6: Massive Graph Mining ------------------------ 234
Mining and Searching Complex Structures Chapter 1 Introduction
Mining and Searching Complex
Structures
Introduction
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung
Research Group Link: http://nusdm.comp.nus.edu.sg/index.html
Social Network Link: http://www.renren.com/profile.do?id=313870900
What is data mining?
Really nothing different from what scientists had been doing for
Correct,
Generate useful
data model
Collect data and verify or Nobel
Real World construct model of real world
Prize
Output most likely model
based on some statistical
Feed in data measure
What’s new?
Systematically and
efficiently test
many statistical
models
1
Mining and Searching Complex Structures Chapter 1 Introduction
Components of data mining
Structure of model
geneA=high and geneB=low ===> cancer
geneA, geneB and geneC exhibit strong correlation
Statistical Score for the model
Accuracy of rule 1 is 90%
Similarity function: Are they sufficiently similar group of records
that support a certain model or hypothesis?
Search method for the correct model parameters
Given 200 genes, there could be 2^200 rules. Which rule give the
best prediction power?
Database access method
Given 1 million records, how to quickly find relevant records to
compute the accuracy of a rule?
The Apriori Algorithm
• Bottom-up, breadth first a,b,c,e
search
• Only read is perform on
the databases a,b,c a,b,e a,c,e b,c,e
• Store candidates in
memory to simulate the
lattice search a,b a,c a,e b,c b,e c,e
• Iteratively follow the two
steps:
–generate candidates a b c e
–count and get actual
frequent items
start {}
4
2
Mining and Searching Complex Structures Chapter 1 Introduction
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in 4
steps:
–Partition objects into k nonempty subsets
–Compute seed points as the centroids of the clusters of the
current partition. The centroid is the center (mean point) of the
cluster.
–Assign each object to the cluster with the nearest seed point.
–Go back to Step 2, stop when no more new assignment.
5
The K-Means Clustering Method
• Example
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
6
3
Mining and Searching Complex Structures Chapter 1 Introduction
Training Dataset (Decision Tree)
Outlook Temp Humid Wind PlayTennis
Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rain Mild High Weak Yes
Rain Cool Normal Weak Yes
Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes
Rain Mild High
7
Strong No
Selecting the Next Attribute
S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940
Humidity Wind
High Normal Weak Strong
[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]
E=0.985 E=0.592 E=0.811 E=1.0
Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
8
4
Mining and Searching Complex Structures Chapter 1 Introduction
Selecting the Next Attribute
S=[9+,5-]
E=0.940
Outlook
Over
Sunny Rain
cast
[2+, 3-] [4+, 0] [3+, 2-]
E=0.971 E=0.0 E=0.971
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
9
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]
Sunny Overcast Rain
Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]
[2+,3-] [4+,0-] [3+,2-]
? Yes ?
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
10
5
Mining and Searching Complex Structures Chapter 1 Introduction
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes
11
Can we fit what we learn into the framework?
Apriori K-means ID3
task rule pattern discovery clustering classification
structure of the model association rules clusters decision tree
or pattern
search space lattice of all possible choice of any k all possible
combination of items points as center combination of
size= 2m size=infinity decision tree
size= potentially
infinity
score function support, confidence square error accuracy,
information gain
search /optimization breadth first with gradient descent greedy
method pruning
data management TBD TBD TBD
12
technique
6
Mining and Searching Complex Structures Chapter 1 Introduction
Components of data mining(II)
Models Enumeration
Algorithm
Statistical Score Function
Similarity/Search Function
Database Access Method
Database
Background knowledge
• We assume you have some basic knowledge about data
mining, some of the slides here will be very useful for this
purpose
• Association Rule Mining
http://www.comp.nus.edu.sg/~atung/Renmin56.pdf
• Classification and Regression
http://www.comp.nus.edu.sg/~atung/Renmin67.pdf
• Clustering
http://www.comp.nus.edu.sg/~atung/Renmin78.pdf
7
Mining and Searching Complex Structures Chapter 1 Introduction
IT Trend
Processors are cheap and will become cheaper(multi-core processor,
graphic cards)
Storage will be cheap but might not be fast
Bandwidth will be growing
What can we do with this?
Play more realistic games!
Not exactly a joke since any technologies that speed up games can speed up scientific
simulation
Smarter (more intensive) computation
Can store more personal semantic/ontology
People can collaborate more over the Internet (Flickr, Wikipedia) to make
things more intelligent
The AI dream now have the support of much better hardwares
Essentially, data mining can be made much more simple for the man
on the street
Data mining should be human-centered, not machine centered
2010-7-31
15
What is complex data?
What is “simple” data? data?
What are complex
tabular table, with small number of attributes (of the same type), no
Test1 Regular Progress
Gene1 comments
values.
Pos missing Fever
2.0 ……
Neg -0.3 Unconscious
N.A 5.7
High dimensional data: Lots of attributes with different data types with missing values
Sequences/ time series Trees Graphs
8
Mining and Searching Complex Structures Chapter 1 Introduction
Why complex data?
They come naturally in many applications. Bring research nearer to real
world
Lots of challenges which mean more fun!
Some fundamental challenges:
How do you compare complex objects effectively and efficiently?
How do you find special subset in the data that is interesting?
Test1 Gene1 Progress comments
Pos What new type of models and score function must you used?
2.0 Fever ……
Neg
How do you handle noise and error ?
-0.3 Unconscious
N.A 5.7
a a
d b e b c d
c d c d b
c d
e
T1 T2
Personalized Semantic for Personal Data Management
everyone will own terabytes of data soon
improve query/search interface by mining and extracting personalized semantics like
entities and their relationship etc. by comparing them against high quality tagged databases
Query by Query by Query by
documents audio/music Query by video
photographs/images
Wikipedia
singers
authors
High Quality
Data semantic
actor/actress songs
Sources
papers
layer
places
movies
Personal Data
documents audio video photographs/i Webpage/Blogs/Bookmarks
music mages
9
Mining and Searching Complex Structures Chapter 1 Introduction
Integrated Approach to Mining Software Engineering Data
software engineering data: code base, change history, bug reports, runtime trace
integrated into a data warehouse to support decision making and mining,
Example: Which code module should I modify to create a new function? Which
module need maintenance?
programming defect detection testing debugging maintenance …
software engineering tasks helped by data mining
association/
classification clustering …
patterns
Data Warehouse
code change program structural bug
bases history states entities reports/nl …
software engineering data
WikiScience
Web 2.0: Facebook for scientists
Collaborative platform for scientist to build scientific models/hypothesis and share
data, applications
Based on some
articles, I make some
changes to Model A supporting
to create Model B articles tagged to
Model B
Centralized,
Centralized,
Model A Hybrid Model
Hybrid Model Model B
Model A Model B
C Constructed
C Constructed
by System
by System
supporting
dataset tagged to
Model A
This is my model of
the solar system base
on my supporting
dataset
10
Mining and Searching Complex Structures Chapter 1 Introduction
Hey, why not Cloud Computing, Map/Reduce?
• These are platform for scaling up services to large
number of users on large amount of data
• But what exactly do you want to scale up?
• Services that provide useful and semantically
correct information to the users
• We have too many scalable data mining
algorithms that find nothing or too many things
• Let’s focus on finding useful things first
(assuming we have lot’s of processing power) and
then try to scale it up
Schedule of the Course
Date/Time Content
Lesson 1 Introduction
Lesson 2 Mining and Search High Dimensional Data I
Lesson 3 Mining and Search High Dimensional Data II
Lesson 4 Mining and Search High Dimensional Data III
Lesson 5 Similarity Search for Sequences and Trees I
Lesson 6 Similarity Search for Sequences and Trees III
Lesson 7 Similarity Search for Graph I
Lesson 8 Similarity Search for Graph II
Lesson 9 Similarity Search for Graph III
Lesson 10 Mining Massive Graph I
Lesson 11 Mining Massive Graph II
Lesson 12 Mining Massive Graph III
11
Mining and Searching Complex Structures Chapter 1 Introduction
Focus of the course
• Techniques that can handle high dimensional, complex
structures
–Providing semantics to similarity search
–Shotgun and Assembly: Column/Feature Wise Processing using
Inverted Index
–Row-wise Enumeration
–Using local properties to infer global properties
• Throughout the course, please try to think of how these
techniques are applicable across different type of complex
structures
Databases Queries
To start off, we will consider something very basic call
ranking queries since we need ranking any similarity search
(usually from most similar to most dissimilar)
In relational database, SQL returns all results at one go
How many tuples can be fitted in one screen?
How many tuples can you remember?
Options:
Summarize the results
Display representative tuples
How to select representative tuples?
12
Mining and Searching Complex Structures Chapter 1 Introduction
Retrieve Relevant Information
Search videos related to Shanghai Expo
Too many results: as long as you click “next”, there are 20
more new results
Are we interested in all results?
No, only most relevant ones
Search engines have to rank the results, out of which they
make money from
Question: How to Select a Small Result Set
Selecting the most representative or most interesting results
is not trivial
Find an apartment with rental cheaper than 1000, the
cheaper the better
The result tuples can be sorted in the ascending order of rental prices,
those in front are more favorable
Find an apartment with rental cheaper than 1000 near NEU,
the lower the better, the nearer the better
Apartment with lower rent may not be near, nearer one may not be
cheap
Order by prices? Order by distances?
13
Mining and Searching Complex Structures Chapter 1 Introduction
Top-k Queries
Define a scoring function, which maps a tuple to a real
number, as a score
The higher the score is, the more favorable the tuple is
Define an integer k
Answer: k objects with highest scores
Different scoring function may give different top-k result
Price Distance to NEU
Apartment A $800 500 meter
Apartment B $1200 200 meters
Given k = 1, if the score function is defined as the sum of
price and distance, the first tuple is better; if it is defined as
the product, the second tuple is better
Brute Force Top-k
Compute scores for each result tuple
Sort the tuples according to the descending order of the
scores
Select the first k tuples
What if the number of tuples is unlimited? Search engines
can give unlimited number of results
Even if the number of tuples is limited, it is too slow to
compute score for each tuple
We have to do it efficiently
14
Mining and Searching Complex Structures Chapter 1 Introduction
Outline
Two well-known top-k algorithms
Fagin's Algorithm (FA)
The Threshold Algorithm (TA)
Take random access into consideration
No Random Access Algorithm (NRA)
The Combined Algorithm (CA)
Monotonicity
A score function f is monotone if f(x1,x2,...,xm)≤f(y1,y2,...,ym)
whenever xi≤yi for every i
Select top-3 students with highest total score in mathematics,
physics and computer science:
•
select name, math+phys+comp as score
from student
order by score desc limit 3
sum(x.math,x.phys,x.comp)≤sum(y.math,y.phys,y.comp) if
x.math≤y.math and x.phys≤y.phys and x.comp≤y.comp
15
Mining and Searching Complex Structures Chapter 1 Introduction
Sorted Lists
We shall think of a database consisting of m sorted lists L1,
L2, … Lm
Lmath Lphys Lcomp
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Outline
Two well-known top-k algorithms
Fagin's Algorithm (FA)
The Threshold Algorithm (TA)
Take random access into consideration
No Random Access Algorithm (NRA)
The Combined Algorithm (CA)
16
Mining and Searching Complex Structures Chapter 1 Introduction
Fagin's Algorithm (I)
Do sequential access until there are at least k matches
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Sequential accesses are stopped when 3 students are seen, i.e.
Ann, Hugh and Kurt
Fagin's Algorithm (II)
For each object that has been seen, do random accesses on
other lists to compute its score
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Random accesses need to be done for Ben, Carl, Jane and
Ryan
17
Mining and Searching Complex Structures Chapter 1 Introduction
Fagin's Algorithm (III)
Select the k objects with highest score as top-k result
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Why is FA correct? (I)
There are at least k objects seen on all attributes when
sequential access is stopped
By monotonicity, those objects that are not seen do not have
higher score than the above k objects
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
18
Mining and Searching Complex Structures Chapter 1 Introduction
Why is FA correct? (II)
For those that have been seen, it is either all attributes has
been seen, or random accesses are performed to know all
attributes
The k objects with highest scores are therefore the top-k
result
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Outline
Two well-known top-k algorithms
Fagin's Algorithm (FA)
The Threshold Algorithm (TA)
Take random access into consideration
No Random Access Algorithm (NRA)
The Combined Algorithm (CA)
19
Mining and Searching Complex Structures Chapter 1 Introduction
The Threshold Algorithm (I)
Do sequential access on all lists. If an object is seen, do
random access to the other lists to compute its score
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Random accesses on Ann, Hugh and Kurt first, then on Ben
and Ryan
The Threshold Algorithm (II)
Remember the k objects with highest scores, together with
their scores
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Score (Ann) = 285
Score (Hugh) = 280
Score (Kurt) = 280
20
Mining and Searching Complex Structures Chapter 1 Introduction
The Threshold Algorithm (III)
• Let threshold value τ be the function value on last seen values
on all sorted lists
• As soon as at least k objects with score at least τ, then halt
Ann 98 Hugh 97 Kurt 96
τ(1) = 291
Ben 96 Ryan 94 Ann 95 τ(2) = 285
Kurt 93 Ann 92 Jane 95 τ(3) = 280
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Why is TA correct?
• By monotonicity, those unseen objects do not have higher
score than τ
• For those that have been seen, random accesses are
performed, the k objects with highest scores are therefore the
top-k result
Ann 98 Hugh 97 Kurt 96
τ(1) = 291
Ben 96 Ryan 94 Ann 95 τ(2) = 285
Kurt 93 Ann 92 Jane 95 τ(3) = 280
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
21
Mining and Searching Complex Structures Chapter 1 Introduction
Comparing TA with FA
• Number of sequential accesses
At the time FA stops sequential accesses, τ is guaranteed not
higher than the k objects seen on all sorted lists
• Number of random accesses
TA requires m-1 random accesses for each object
But FA is expected to random access more objects
• Size of buffers used
Buffer used by FA can be unbounded
TA only needs to remember k objects with k scores, and the
threshold value τ
Outline
Two well-known top-k algorithms
Fagin's Algorithm (FA)
The Threshold Algorithm (TA)
Take random access into consideration
No Random Access Algorithm (NRA)
The Combined Algorithm (CA)
22
Mining and Searching Complex Structures Chapter 1 Introduction
Random Access
Random accesses are impossible
Text retrieval: sorted lists are results of search engines
Random accesses are expensive
Sequential accesses on disk are orders of magnitude faster
than random accesses
We need to consider not using random accesses or using
them as few as possible
No Random Access
Without random access, all we know are the upper bounds
Lmath Lphys Lcomp
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Carl’s scores on physics and computer science are not higher
than 89 and 92 respectively
23
Mining and Searching Complex Structures Chapter 1 Introduction
Lower and Upper Bounds
If an object has not been seen on one attribute
Lower bound is 0
Upper bound is the last seen value
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
The lower bound of Carl’s score on physics is 0
The upper bound of Carl’s score on physics is 89
Worse and Best Scores (I)
W (R): The worst possible score of tuple R
B (R): The best possible score of tuple R
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
W (Carl) = 90
B (Carl) = 90 + 89 + 92
24
Mining and Searching Complex Structures Chapter 1 Introduction
Worse and Best Scores (II)
W (R) ≤ Score of R ≤ B (R)
W (R) and B (R) get updated as its value gets sequential
accessed
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Ann Hugh Kurt
W 98 97 96
B 291 291 291
Worse and Best Scores (II)
W (R) ≤ Score of R ≤ B (R)
W (R) and B (R) get updated as its value gets sequential
accessed
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Ann Hugh Kurt Ben Ryan
W 98→193 97 96 96 94
B 291→287 291→288 291→286 285 285
25
Mining and Searching Complex Structures Chapter 1 Introduction
Worse and Best Scores (II)
W (R) ≤ Score of R ≤ B (R)
W (R) and B (R) get updated as its value gets sequential
accessed
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Ann Hugh Kurt Ben Ryan Jane
W 193→285 97 96→189 96 94 95
B 287→285 288→285 286→281 285→283 285→282 280
Outline
Two well-known top-k algorithms
Fagin's Algorithm (FA)
The Threshold Algorithm (TA)
Take random access into consideration
No Random Access Algorithm (NRA)
The Combined Algorithm (CA)
26
Mining and Searching Complex Structures Chapter 1 Introduction
No Random Access Algorithm (I)
Maintain the last-seen values x1,x2,…,xm
For every seen object, maintain its worst possible score, its
known attributes and their values
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
xmath = 96; xphys = 94; xcomp = 95
Ann:193:{;}
No Random Access Algorithm (II)
Why not maintain the best possible score for each objects
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Ann Hugh Kurt Ben Ryan Jane
W 193→285 97 96→189 96 94 95
B 287→285 288→285 286→281 285→283 285→282 280
Too Frequently Updated!
27
Mining and Searching Complex Structures Chapter 1 Introduction
No Random Access Algorithm (III)
Let M be the kth largest W value
An object R is viable if B (R) ≥ M
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Ann Hugh Kurt Ben Ryan Jane
W 285 97→188 189→280 96→189 94 95 M = 189
B 285 285→280 281→280 283→280 282→278 280→277
No Random Access Algorithm (III)
Let M be the kth largest W value
An object R is viable if B (R) ≥ M
Ann 98 Hugh 97 Kurt 96
Ben 96 Ryan 94 Ann 95
Kurt 93 Ann 92 Jane 95
Hugh 91 Kurt 91 Ben 93
Carl 90 Jane 89 Hugh 92
... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Ann Hugh Kurt Ben Ryan Jane
W 285 188→280 280 188 94 95→184 M = 280
B 285 285→280 280 280→278 278→276 277→274
28
Mining and Searching Complex Structures Chapter 1 Introduction
No Random Access Algorithm (IV)
Let set T contain objects with W (R) ≥ M
Halt when
There are at least k objects seen on all sorted lists
No viable objects left outside set T
Ann Hugh Kurt Ben Ryan Jane
W 285 188→280 280 188 94 95→184 M = 280
B 285 285→280 280 280→278 278→276 277→274
T = {Ann, Hugh, Kurt}
Why is NRA correct?
W (R) ≤ Score of R ≤ B (R) always holds
If an object R is not viable, Score of R ≤ B (R) ≤ M, then
there are at least k objects with scores not lower than R
Therefore, if there is no viable object outside T and T
contains at least k objects, T is the set of top-k result
29
Mining and Searching Complex Structures Chapter 1 Introduction
Comparing NRA with TA
• Number of sequential accesses
The number of sequential accesses of NRA is at least the last
position of top-k result on all attributes
• Number of random accesses
NRA is obviously 0
• Size of buffers used
TA remembers k objects with k scores, and the threshold
value τ
NRA remembers all viable objects with its scores on all seen
attributes, and the last-seen value on all attributes
How deep can NRA go?
Ann 98 Hugh 97 Kurt 96
Hugh 97 Kurt 96 Ann 95
Ben 60 Ryan 60 Jane 60
Ryan 60 Ben 60 Ben 60
Carl 60 Jane 60 Carl 60
... ... ... ... ... ...
Jane 60 Carl 60 Ryan 60
Kurt 0 Ann 0 Hugh 0
The set T can be identified quickly, but their scores will only
be certain at the end of lists
If we allow relatively fewer number of random accesses,
scanning the entire lists can be avoided
30
Mining and Searching Complex Structures Chapter 1 Introduction
Outline
Two well-known top-k algorithms
Fagin's Algorithm (FA)
The Threshold Algorithm (TA)
Take random access into consideration
No Random Access Algorithm (NRA)
The Combined Algorithm (CA)
The Combined Algorithm (I)
CA combines TA and NRA
cR: the cost of a random access
cS: the cost of a sequential access
h=
Run NRA, but every h steps to run random accesses, like TA
h = ∞ → never do random access, CA is then NRA
31
Mining and Searching Complex Structures Chapter 1 Introduction
The Combined Algorithm (II)
Ann 98 Hugh 97 Kurt 96
Hugh 97 Kurt 96 Ann 95
Ben 60 Ryan 60 Jane 60
Ryan 60 Ben 60 Ben 60
Carl 60 Jane 60 Carl 60
... ... ... ... ... ...
Jane 60 Carl 60 Ryan 60
Kurt 0 Ann 0 Hugh 0
Random accesses for Ann, Hugh and Kurt quickly find out
the scores for Ann, Hugh and Kurt
The Combined Algorithm (III)
In CA, by doing random accesses, we wish to either
Confirm an object is a top-k result, or
Prune a viable object
As the number of random accesses in CA is limited, various
heuristics can be made to optimize CA in terms of total cost
32
Mining and Searching Complex Structures Chapter 1 Introduction
Reference
• Ronald Fagin, Amnon Lotem, Moni Naor: Optimal
aggregation algorithms for middleware. J. Comput. Syst.
Sci. 66(4): 614-656 (2003)
33
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Mining and Searching Complex
Structures
High Dimensional Data
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung
Research Group Link: http://nusdm.comp.nus.edu.sg/index.html
Social Network Link: http://www.renren.com/profile.do?id=313870900
Outline
• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
34
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Sources of High Dimensional Data
• Microarray gene expression
• Text documents
• Images
• Features of Sequences, Trees and Graphs
• Audio, Video, Human Motion Database (spatio-
temporal as well!)
Mining and Searching Complex Structures
Challenges of High Dimensional Data
• Indistinguishable
–Distance between two nearest points and two furthest points
could be almost the same
• Sparsity
–As a result of the above, data distribution are very sparse
giving no obvious indication on where the interesting
knowledge is
• Large number of combination
–Efficiency: How to test the number of combinations
–Effectiveness: How do we understand and interpret so many
combinations?
Mining and Searching Complex Structures
35
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Outline
• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
Similarity Search : Traditional Approach
• Objects represented by multidimensional vectors
Elevation Aspect Slope Hillshade (9am) Hillshade (noon) Hillshade (3pm) …
2596 51 3 221 232 148
…
• The traditional approach to similarity search: kNN query
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist
P1 1.1 1 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.93
P2 1.4 1.4 1.4 1.5 1.4 1 1.2 1.2 1 1 0.98
P3 1 1 1 1 1 1 2 1 2 2 1.73
P4 20 20 21 20 22 20 20 19 20 20 57.7
P5 19 21 20 20 20 21 18 20 22 20 60.5
P6 21 21 18 19 20 19 21 20 20 20 59.8
Mining and Searching Complex Structures
36
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Deficiencies of the Traditional Approach
• Deficiencies
–Distance is affected by a few dimensions with high dissimilarity
–Partial similarities can not be discovered
• The traditional approach to similarity search: kNN query
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist
P1 1.1 1
100 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.93
99.0
P2 1.4 1.4 1.4 1.5 1.4 1
100 1.2 1.2 1 1 99.0
0.98
P3 1 1 1 1 1 1 2 1
100 2 2 1.73
99.0
P4 20 20 21 20 22 20 20 19 20 20 57.7
P5 19 21 20 20 20 21 18 20 22 20 60.5
P6 21 21 18 19 20 19 21 20 20 20 59.8
Mining and Searching Complex Structures
Thoughts
• Aggregating too many dimensional differences into a single value
result in too much information loss. Can we try to reduce that loss?
• While high dimensional data typically give us problem when in
come to similarity search, can we turn what is against us into
advantage?
• Our approach: Since we have so many dimensions, we can
compute more complex statistics over these dimensions to
overcome some of the “noise” introduce due to scaling of
dimensions, outliers etc.
Mining and Searching Complex Structures
37
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
The N-Match Query : Warm-Up
• Description
–Matches between two objects in n dimensions. (n ≤ d)
–The n dimensions are chosen dynamically to make the two objects match best.
• How to define a “match”
–Exact match
–Match with tolerance δ
• The similarity search example
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
n=6
ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist
P1 1.1 1
100 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.2
P2 1.4 1.4 1.4 1.5 1.4 1
100 1.2 1.2 1 1 0.4
0.98
P3 1 1 1 1 1 1 2 1
100 2 2 1.73
0
P4 20 20 21 20 22 20 20 19 20 20 19
P5 19 21 20 20 20 21 18 20 22 20 19
P6 21 21 18 19 20 19 21 20 20 20 19
Mining and Searching Complex Structures
The N-Match Query : The Definition
• The n-match difference
Given two d-dimensional points P(p1, p2, …, pd) and Q(q1, q2, …, qd), let δi
= |pi - qi|, i=1,…,d. Sort the array {δ1 , …, δd} in increasing order and let
the sorted array be {δ1’, …, δd’}. Then δn’ is the n-match difference
y
between P and Q.
1-match=A
10 E
• The n-match query
8 D 2-match=B
Given a d-dimensional database DB, a query point Q and an
integer n (n≤d), find the point P ∈ DB that has the smallest 6
n-match difference to Q. P is called the n-match of Q. 4 A
B
2 C
• The similarity search example
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 7
8
n=6 Q 2 4 6 8 10 x
ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist
P1 1.1 1
100 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.2
0.6
P2 1.4 1.4 1.4 1.5 1.4 1
100 1.2 1.2 1 1 0.4
0.98
P3 1 1 1 1 1 1 2 1
100 2 2 1.73
0
1
P4 20 20 21 20 22 20 20 19 20 20 19
P5 19 21 20 20 20 21 18 20 22 20 19
P6 21 21 18 19 20 19 21 20 20 20 19
Mining and Searching Complex Structures
38
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
The N-Match Query : Extensions
• The k-n-match query
Given a d-dimensional database DB, a query point Q, an integer k, and an
integer n, find a set S which consists of k points from DB so that for any
point P1 ∈ S and any point P2∈ DB-S, P1’s n-match difference is smaller
than P2’s n-match difference. S is called the k-n-match of Q. y
• The frequent k-n-match query 2-1-match={A,D}
10 E
Given a d-dimensional database DB, a query point Q, an integer
k, and an integer range [n0, n1] within [1,d], let S0, …, Si be 8 D 2-2-match={A,B}
the answer sets of k-n0-match, …, k-n1-match, respectively, 6
find a set T of k points, so that for any point P1 ∈ T and any point 4 A
P2 ∈ DB-T, P1’s number of appearances in S0, …, Si is larger B C
2
than or equal to P2’s number of appearances in S0, …, Si .
• The similarity search example Q 2 4 6 8 10 x
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) n=6
ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist
P1 1.1 1
100 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.2
P2 1.4 1.4 1.4 1.5 1.4 1
100 1.2 1.2 1 1 0.4
0.98
P3 1 1 1 1 1 1 2 1
100 2 2 1.73
0
P4 20 20 21 20 22 20 20 19 20 20 19
P5 19 21 20 20 20 21 18 20 22 20 19
P6 21 21 18 19 20 19 21 20 20 20 19
Mining and Searching Complex Structures
Cost Model
• The multiple system information retrieval model
–Objects are stored in different systems and scored by each system
–Each system can sort the objects according to their scores
–A query retrieves the scores of objects from different systems and then combine them using some
aggregation function
Q : color=“red” & shape=“round” & texture “cloud”
System 1: Color System 2: Shape System 3: Texture
Object ID Score Object ID Score Object ID Score
1 0.4
0.4 1 1.0
1.0 1 1.0
1.0
2 2.8
2.8 2
5 1.5
5.5 2 2.0
2.0
3
5 3.5
6.5 3
2 5.5
7.8 3 5.0
5.0
3
4 6.5
9.0 4
3 7.8
9.0 4
5 8.0
9.0
4
5 9.0
3.5 5
4 9.0
1.5 5
4 9.0
8.0
• The cost
–Retrieval of scores – proportional to the number of scores retrieved
• The goal
–To minimize the scores retrieved
Mining and Searching Complex Structures
39
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
The AD Algorithm
• The AD algorithm for the k-n-match query
–Locate the query’s attributes value in every dimension
–Retrieve the objects’ attributes value from the query’s attributes in both directions
–The objects’ attributes are retrieved in Ascending order of their Differences to the query’s attributes. An n-match is found
when it appears n times.
2-2-match 3.0 ( 3.0 , 7.0 , 4.0 )
shape=“round” )
Q : color=“red” &Q : (of Q ,:7.0 , 4.0& texture “cloud”
System 1: Color
d1 2:
Systemd2 Shape 3:
System d3 Texture
Object ID Score
Attr Object ID Score
Attr Object ID Score
Attr
1 0.4 1 1.0 1 1.0
2 2.8 3.0 5 1.5 2 2.0 4.0
5 3.5 2 5.5 7.0 3 5.0
3 6.5 3 7.8 5 8.0
4 9.0 4 9.0 4 9.0
Auxiliary structures d1 d2 d3
Next attribute to retrieve g[2d]
2 , 0.2
1 , 2.6 3 ,, 3.5
5 0.5 2 , 1.5 4 , 0.8
3 , 2.0 2 , 2.0 3 1.0
5 ,, 4.0
Number of appearances appear[c] 1 2 3 4 5
0 0
2
1 0
2
1 0 0
1
Answer set S
{
{ 3 , {23} }
Mining and Searching Complex Structures
The AD Algorithm : Extensions
• The AD algorithm for the frequent k-n-match query
–The frequent k-n-match query
• Given an integer range [n0, n1], find k-n0-match, k-(n0+1)-match, ... , k-n1-
match of the query, S0, S1, ... , Si.
• Find k objects that appear most frequently in S0, S1, ... , Si.
–Retrieve the same number of attributes as processing a k-n1-match query.
• Disk based solutions for the (frequent) k-n-match query
–Disk based AD algorithm
• Sort each dimension and store them sequentially on the disk
• When reaching the end of a disk page, read the next page from disk
–Existing indexing techniques
• Tree-like structures: R-trees, k-d-trees
• Mapping based indexing: space-filling curves, iDistance
• Sequential scan
• Compression based approach (VA-file)
Mining and Searching Complex Structures
40
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Experiments : Effectiveness
• Searching by k-n-match
–COIL-100 database
–54 features extracted, such as color histograms, area moments
k-n-match query, k=4
kNN query
n Images returned
k Images returned
5 36, 42, 78, 94
10 13, 35, 36, 40, 42
10 27, 35, 42, 78
64, 85, 88, 94, 96
15 3, 38, 42, 78
20 27, 38, 42, 78
25 35, 40, 42, 94
30 10, 35, 42, 94
35 35, 42, 94, 96
40 35, 42, 94, 96
45 35, 42, 94, 96
50 35, 42, 94, 96
Searching by frequent k-n- Data sets (d) IGrid HCINN Freq. k-n-match
match Ionosphere (34) 80.1% 86% 87.5%
UCI Machine learning repository
Competitors: Segmentation (19) 79.9% 83% 87.3%
IGrid Wdbc (30) 87.1% N.A. 92.5%
Human-Computer Interactive NN
search (HCINN) Glass (9) 58.6% N.A. 67.8%
Iris (4) 88.9% N.A. 89.6%
Mining and Searching Complex Structures
Experiments : Efficiency
• Disk based algorithms for the Frequent k-n-mach query
–Texture dataset (68,040 records); uniform dataset (100,000 records)
–Competitors:
• The AD algorithm
• VA-file
• Sequential scan
Mining and Searching Complex Structures
41
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Experiments : Efficiency (continued)
• Comparison with other similarity search techniques
–Texture dataset ; synthetic dataset
–Competitors:
• Frequent k-n-match query using the AD algorithm
• IGrid
• scan
Mining and Searching Complex Structures
Future Work(I)
• We now have a natural way to handle similarity search for
data with categorical , numerical and attributes. Investigating
k-n-match performance on such mixed-type data is currently
under way
• Likewise, applying k-n-match on data with missing or
uncertain attributes will be interesting
• Query={1,1,1,1,1,1,1,M,No,R}
ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10
P1 1.1 1 1.2 1.6 1.1 1.6 1.2 M Yes R
P2 1.4 1.4 1.4 1.5 1.4 1 1.2 F No B
P3 1 1 1 1 1 1 2 M No B
P4 20 20 21 20 22 20 20 M Yes G
P5 19 21 20 20 20 21 18 F Yes R
P6 21 21 18 19 20 19 21 F Yes Y
Mining and Searching Complex Structures
42
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Future Work(I)
• We now have a natural way to handle similarity search for
data with categorical , numerical and attributes. Investigating
k-n-match performance on such mixed-type data is currently
under way
• Likewise, applying k-n-match on data with missing or
uncertain attributes will be interesting
• Query={1,1,1,1,1,1,1,M,No,R}
ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10
P1 1 1.2 1.6 1.1 1.6 1.2 M R
P2 1.4 1.4 1.5 1 1.2 F No B
P3 1 1 1 1 1 2 M No B
P4 20 20 20 22 20 20 M G
P5 19 21 20 20 20 18 Yes R
P6 21 18 20 21 F Yes Y
Mining and Searching Complex Structures
Future Work(II)
• In general, three things affect the result from a similarity search:
noise, scaling and axes orientation. K-n-match reduce the effect of
noise. Ultimate aim is to have a similarity function that is robust
to noise, scaling and axes orientation
• Eventually will look at creating mining algorithms using k-n-
match
Mining and Searching Complex Structures
43
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Outline
• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
Motivation
query
Large
results
Data Sets
Ever-increasing data collection rates of modern
enterprises and the need for effective, guaranteed-
quality approximate answers to queries
Concern: compress as much as possible.
Mining and Searching Complex Structures 22
44
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Conventional Compression Method
• Try to find the optimal encoding of arbitrary strings for
the input data:
–Huffman Coding
–Lempel-Ziv Coding (gzip)
• View the whole table as a large byte string
• Statistical or dictionary based
• Operate at the byte level
Mining and Searching Complex Structures 23
Why not just “syntactic”?
• Do not exploit the complex dependency patterns in the table
• Individual retrieval of tuple is difficult
• Do not utilize lossy compression
Mining and Searching Complex Structures 24
45
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Semantic compression methods
• Derive a descriptive model M
• Identify the data values which can be derived from M (within
some error tolerance), which are essential for deriving, and
which are the outliers
• Derived values need not to be stored, only the outliers need
Mining and Searching Complex Structures 25
Advantages
• More Complex Analysis
–Example: detect correlation among columns
• Fast Retrieval
–Tuple-wise access
• Query Enhancement
–Possible to answer query directly from discover semantic
–Compress in way which enhanced answering of some complex
queries, eg. “Go Green: Recycle and Reuse Frequent Patterns”, C.
Gao, B. C. Ooi, K. L. Tan and A. K. H. Tung. ICDE’2004.
Choose a combination of compression methods
based on semantic and syntactic information
Mining and Searching Complex Structures 26
46
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Fascicles
• Key observation
–Often, numerous subsets of records in T have similar values for
many attributes
Protocol Duration Bytes Packets • Compress data by storing
http 12 20K 3
http 16 24K 5 representative values (e.g.,
http 15 20K 8 “centroid”) only once for each
http 19 40K 11 attribute cluster
http 26 58K 18
ftp 27 100K 24
ftp 32 300K 35
• Lossy compression:
ftp 18 80K 15 information loss is controlled by
the notion of “similar values” for
attributes (user-defined)
Mining and Searching Complex Structures 27
ItCompress: Compression Format
Representative Rows (Patterns)
Original Table
RRid age salary credit sex
age salary credit sex
1 30 90k good F
20 30k poor M
2 70 35k poor M
25 76k good F
Compressed Table
30 90k good F
Outlying
40 100k poor M RRid bitmap
value
50 110k good F 2 0111 20
60 50k good M 1 1111
70 35k poor F 1 1111
75 15k poor M 1 0100 40, poor, M
Error Tolerance: 1 0111 50
age salary credit sex 1 0010 60, 50k, M
5 25k 0 0 2 1110 F
28
2 1111
Mining and Searching Complex Structures
47
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Some definitions
• Error tolerance
–Numeric attributes
• The upper bound that x’ can be different from x
• x ∈ [ x’-ei, x’+ei ]
–Categorical attributes
• The upper bound on the probability that the compressed
value differs from actual value
• Given an actual value x and its error tolerance ei, the
compressed value x’ should satisfy: Prob( x=x’ ) ≥ 1 - ei
Mining and Searching Complex Structures 29
Some definitions
• Coverage
–Let R be a row in the table T, and Pi be a pattern
–The coverage of Pi on R :
cov( Pi , R ) = number of attributes X i in which
R[ X i ] is match by Pi [ X i ]
• Total coverage
–Let P be a set of patterns P1,…,Pk; and the table T
contains n rows R1,…,Rn
–
totalcov ( P, T ) = ∑ cov( P
i =1..n
max ( Ri ), Ri )
30
Mining and Searching Complex Structures
48
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
ItCompress: basic algorithm
• First randomly choose k rows as initial patterns
• Scan the table T: Phase1
–For each row R, compute the coverage of each pattern on it,
then try to find Pmax(R)
–Allocate R to its most covered pattern
• After each iteration, re-compute all patterns’ Phase2
attributes, always using the most frequent values
• Iterate until sum of total coverage does not increase
Mining and Searching Complex Structures 31
Example: the 1st iteration begins
age salary credit sex RRid age salary credit sex
20 30k poor M 1 20 30k poor M
25 76k good F 2 25 76k good F
30 90k good F
40 100k poor M
50 110k good F
60 50k good M
70 35k poor F
75 15k poor M
Error Tolerance:
age salary credit sex
5 25k 0 0 32
Mining and Searching Complex Structures
49
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Example: Phase 1
RRid age salary credit sex
age salary credit sex 1 20 30k poor M
20 30k poor M 2 25 76k good F
25 76k good F
age salary credit sex
30 90k good F
20 30k poor M
40 100k poor M
40 100k poor M
50 110k good F
60 50k good M
60 50k good M
70 35k poor F
70 35k poor F
75 15k poor M
75 15k poor M
age salary credit sex
Error Tolerance:
25 76k good F
age salary credit sex
30 90k good F
5 25k 0 0 33
50 110k good F
Mining and Searching Complex Structures
Example: Phase 2
RRid age salary credit sex
age salary credit sex 1 20 M
70 30k poor M
20 30k poor M
2 25
25 90k
76k good F
F
25 76k good F
30 90k good F age salary credit sex
40 100k poor M 20 30k poor M
50 110k good F 40 100k poor M
60 50k good M 60 50k good M
70 35k poor F 70 35k poor F
75 15k poor M 75 15k poor M
Error Tolerance: age salary credit sex
25 76k good F
age salary credit sex
30 90k good F
5 25k 0 0 34
50 110k good F
Mining and Searching Complex Structures
50
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Convergence(I)
• Phase 1:
–When we assign the rows to their most coverage patterns:
• For each row, the coverage increases or maintain
So the total coverage also increases or maintain
• Phase 2:
–When we re-compute the attribute values for the patterns:
• For each pattern, the coverage increases or maintains
So the total coverage also increases or maintains
Mining and Searching Complex Structures 35
Convergence(II)
• In both Phase 1&2, the total coverage is either increased
or maintained, and it has a obvious upper bound (cover
the whole table)
The algorithm will converge eventually
Mining and Searching Complex Structures 36
51
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Complexity
• Phase 1:
–In l iterations, we need to go through the n rows in the table and
match each row against the k patterns(2m comparisons,)
The running time complexity is O(kmnl) where m is the
number of attributes
• Phase 2:
–Computing each new pattern Pi will require going through all
the domain values/intervals of each value
Assuming the total number of domain values/intervals is d, the
running time complexity is O(kdl)
The total time complexity is O(kmnl+kdl)
Mining and Searching Complex Structures 37
Advantages of ItCompress
• Simplicity and Directness
–Two phases process of Fascicle and Spartan
• Find rules/patterns
• Compress database using discovered rules/patterns
–ItCompress optimize the compression directly without finding
rules/patterns that may not be useful (a.k.a microeconomic approach)
• Less constraints
–Do not need patterns to be matched completely or rules that apply
globally
• Easily tuned parameters
Mining and Searching Complex Structures 38
52
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Performance Comparison
• Algorithms
–ItCompress, ItCompress+gzip
–Fascicles, Fascicles+gzip
–SPARTAN+gzip
• Platform
–ItCompress,Fascicles: AMD Duron 700Mhz, 256MB Memory
–SPARTAN: Four 700Mhz Pentium CPU, 1GB Memory)
• Datasets
–Corel: 32 numeric attributes, 35000 rows, 10.5MB
–Census: 7 numeric, 7 categorical, 676000 rows, 28.6MB
–Forest-cover: 10 numeric, 44 categorical, 581000 rows, 75.2MB
Mining and Searching Complex Structures 39
Effectiveness (Corel)
Mining and Searching Complex Structures 40
53
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Effectiveness (Census)
Mining and Searching Complex Structures 41
Effectiveness (Forest Cover)
Mining and Searching Complex Structures 42
54
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Efficiency
Mining and Searching Complex Structures 43
Varying k
Mining and Searching Complex Structures 44
55
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Varying Sample Ratio
Mining and Searching Complex Structures 45
Adding Noises (Census)
Mining and Searching Complex Structures 46
56
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Effect of Corruption 20%
Corruption?
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12
47
Mining and Searching Complex Structures
Effect of Corruption 20%
Corruption?
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12
48
Mining and Searching Complex Structures
57
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Findings
• ItCompress is
–More efficient than SPARTAN
–More effective than Fascicles
–Insensitive to parameter setting
–Robust to noises
Mining and Searching Complex Structures 49
Future work
• Can we perform mining on the compressed datasets using
only the patterns and the bitmap ?
–Example: Building Bayesian Belief Network
• Is ItCompress a good “bootstrap” semantic compression
algorithm ?
ItCompress
Compressed
database database
Other Semantic
Compression Algorithms
50
Mining and Searching Complex Structures
58
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Outline
• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
Metric v.s. Non-Metric
• Euclidean distance dominates DB queries
• Similarity in human perception
• Metric distance is not enough!
2010-7-31 Mining and Searching Complex Structures 52
59
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Bregman Divergence
h
(q,f(q))
convex function f(x)
(p,f(p))
Bregman divergence
Df(p,q)
q p
Euclidean dist.
2010-7-31 Mining and Searching Complex Structures 53
Bregman Divergence
• Mathematical Interpretation
–The distance between p and q is defined as the difference
between f(p) and the first order Taylor expansion at q
f(x) at p first order Taylor expansion at q
2010-7-31 Mining and Searching Complex Structures 54
60
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Bregman Divergence
• General Properties
–Non-Negativity
• Df(p,q)≥0 for any p, q
–Identity of Indiscernible
• Df(p,p)=0 for any p
–Symmetry and Triangle Inequality
• Do NOT hold any more
2010-7-31 Mining and Searching Complex Structures 55
Examples
Distance f(x) Df(p,q) Usage
KL-Divergence x logx p log (p/q) distribution,
color histogram
Itakura-Saito -logx p/q-log (p/q)-1 signal, speech
Distance
Squared x2 (p-q)2 Euclidean space
Euclidean
Von-Nuemann tr(X log X – X) tr(X logX – X symmetric matrix
Entropy logY – X + Y)
2010-7-31 Mining and Searching Complex Structures 56
61
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Why in DB system?
• Database application
–Retrieval of similar images, speech signals, or time series
–Optimization on matrices in machine learning
–Efficiency is important!
• Query Types
–Nearest Neighbor Query
–Range Query
2010-7-31 Mining and Searching Complex Structures 57
Euclidean Space
• How to answer the queries
–R-Tree
2010-7-31 Mining and Searching Complex Structures 58
62
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Euclidean Space
• How to answer the queries
–VA File
2010-7-31 Mining and Searching Complex Structures 59
Our goal
• Re-use the infrastructure of existing DB system to support
Bregman divergence
–Storage management
–Indexing structures
–Query processing algorithms
2010-7-31 Mining and Searching Complex Structures 60
63
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Basic Solution
• Extended Space
–Convex function f(x) = x2
point D1 D2 point D1 D2 D3
p 0 1 p+ 0 1 1
q 0.5 0.5 q+ 0.5 0.5 0.5
r 1 0.8 r+ 1 0.8 1.64
t 1.5 0.3 t+ 1.5 0.3 3.15
2010-7-31 Mining and Searching Complex Structures 61
Basic Solution
• After the extension
–Index extended points with R-Tree or VA File
–Re-use existing algorithms with lower and upper bounds on
the rectangles
2010-7-31 Mining and Searching Complex Structures 62
64
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
How to improve?
• Reformulation of Bregman divergence
• Tighter bounds are derived
• No change on index construction or query processing
algorithm
2010-7-31 Mining and Searching Complex Structures 63
A New Formulation
h
h’ query vector vq
Df(p,q)+Δ
q p
D*f(p,q)
2010-7-31 Mining and Searching Complex Structures 64
65
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Math. Interpretation
• Reformulation of similarity search queries
–k-NN query: query q, data set P, divergence Df
• Find the point p, minimizing
–Range query: query q, threshold θ, data set P
• Return any point p that
2010-7-31 Mining and Searching Complex Structures 65
Naïve Bounds
• Check the corners of the bounding rectangles
2010-7-31 Mining and Searching Complex Structures 66
66
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Tighter Bounds
• Take the curve f(x) into consideration
2010-7-31 Mining and Searching Complex Structures 67
Query distribution
• Distortion of rectangles
–The difference between maximum and minimum distances
from inside the rectangle to the query
2010-7-31 Mining and Searching Complex Structures 68
67
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Can we improve it more?
• When Building R-Tree in Euclidean space
–Minimize the volume/edge length of MBRs
–Does it remain valid?
2010-7-31 Mining and Searching Complex Structures 69
Query distribution
• Distortion of bounding rectangles
–Invariant in Euclidean space (triangle inequality)
–Query-dependent for Bregman Divergence
2010-7-31 Mining and Searching Complex Structures 70
68
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Utilize Query Distribution
• Summarize query distribution with O(d) real number
• Estimation on expected distortion on any bounding
rectangle in O(d) time
• Allows better index to be constructed for both R-Tree and
VA File
2010-7-31 Mining and Searching Complex Structures 71
Experiments
• Data Sets
–KDD’99 data
• Network data, the proportion of packages in 72 different
TCP/IP connection Types
–DBLP data
• Use co-authorship graph to generate the probabilities of the
authors related to 8 different areas
2010-7-31 Mining and Searching Complex Structures 72
69
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Experiment
• Data Sets
–Uniform Synthetic data
• Generate synthetic data with uniform distribution
–Clustered Synthetic data
• Generate synthetic data with Gaussian Mixture Model
2010-7-31 Mining and Searching Complex Structures 73
Experiments
• Methods to compare
Basic Improved Query
Bounds Distribution
R-Tree R R-B R-BQ
VA File V V-B V-BQ
Linear Scan LS
BB-Tree BBT
2010-7-31 Mining and Searching Complex Structures 74
70
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Existing Solution
• BB-Tree (L. Clayton, ICML 2009)
–Memory-based indexing tree
–Construct with k-means clustering
–Hard to update
–Ineffective in high-dimensional space
2010-7-31 Mining and Searching Complex Structures 75
Experiments
• Index Construction Time
2010-7-31 Mining and Searching Complex Structures 76
71
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Experiments
• Varying dimensionality
2010-7-31 Mining and Searching Complex Structures 77
Experiments
• Varying dimensionality (cont.)
2010-7-31 Mining and Searching Complex Structures 78
72
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Experiments
• Varying data cardinality
2010-7-31 Mining and Searching Complex Structures 79
Conclusion
• A general technique on similarity for Bregman Divergence
• All techniques are based on existing infrastructure of
commercial database
• Extensive experiments to compare performances with R-
Tree and VA File with different optimizations
2010-7-31 Mining and Searching Complex Structures 80
73
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Outline
• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
Motivation
• Probabilistic data is ubiquitous
–To represent the data uncertainty (WSN, RFID, moving
object monitoring)
–To compress data (image processing)
• Histogram is a good way to represent the prob. data
–Easy to capture
–Is very useful in image representation
• Colors
• Textures
• Gradient
• Depth
Mining and Searching Complex Structures
74
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Motivation
• Similarity search is important for managing prob. data
–Given a threshold θ, can answer which sensors’ readings
are similar with sensor A (range query)
–Can answer which k pictures are similar (top-k query)
• Similarity function for prob. data should be carefully
chosen
–Bin by bin methods
• L1 and L2 norms
• χ2 distance
–Cross-bin methods
• Earth Mover’s Distance (EMD)
• Quadratic form
Mining and Searching Complex Structures
Outline
• Motivation
• Introduction to Earth Mover’s Distance (EMD)
• Related works
• Indexing the probabilistic data based on EMD
• Experimental results
• Conclusion and future work
Mining and Searching Complex Structures
75
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Introduction to Earth Mover’s Dist
• Bin by bin vs. cross bin
Bin-by-bin
Not good!
Cross bin
Good!
Can handle
distribution shift
Mining and Searching Complex Structures
Introduction to Earth Mover’s Dist
• What is EMD?
–Earth (泥土)
–Mover (搬运)
–Distance (代价)
–Can be understood as 搬运泥土的代价
• See an example…
Mining and Searching Complex Structures
76
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Moving Earth
≠
Mining and Searching Complex Structures
Moving Earth
≠
Mining and Searching Complex Structures
77
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Moving Earth
=
Mining and Searching Complex Structures
The Difference?
(amount moved)
=
Mining and Searching Complex Structures
78
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
The Difference?
Difference (amount moved) * (distance moved)
=
Mining and Searching Complex Structures
Linear programming
P
m bins
(distance moved) * (amount moved)
Q All movements
n bins
Mining and Searching Complex Structures
79
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Linear programming
P
m clusters
(distance moved) * (amount moved)
Q
n clusters
Mining and Searching Complex Structures
Linear programming
P
m clusters
* (amount moved)
Q
n clusters
Mining and Searching Complex Structures
80
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Linear programming
P
m clusters
Q
n clusters
Mining and Searching Complex Structures
Constraints
1. Move “earth” only from P to Q
P
m clusters
P’
Q
n clusters Q’
Mining and Searching Complex Structures
81
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Constraints
2. Cannot send more “earth” than
P there is
m clusters
P’
Q
n clusters Q’
Mining and Searching Complex Structures
Constraints
3. Q cannot receive more “earth”
P than it can hold
m clusters
P’
Q
n clusters Q’
Mining and Searching Complex Structures
82
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Constraints
4. As much “earth” as possible
P must be moved
m clusters
P’
Q
n clusters Q’
Mining and Searching Complex Structures
The Formal Definition of EMD
• Earth Mover’s Distance (EMD)
–the minimum amount of work needed to change one
histogram into another
• Challenge of EMD
–O(N^3logN)
Mining and Searching Complex Structures
83
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Related Works
• Filter-and-refine framework
–[1] Approximation Techniques for
Indexing the Earth Mover's Distance in
Multimedia Databases. ICDE 2006
• Cannot handle high
dimensional histograms
–[2] Efficient EMD-based Similarity
Search in Multimedia Databases via
Flexible Dimensionality Reduction.
SIGMOD 2008
• Based on scan framework and
influence the scalability
• Use scanning scheme to
process queries
–Merit: can obtain a good order to access
when execute the k-NN queries and thus
can minimize the number of candidates
–Demerit: need to scan the whole dataset
to obtain the order and thus low algo.
scalability
Mining and Searching Complex Structures
Related Works
• Related works
–Based on the filter-and-refine framework
–Based on scanning method and low scalability
• Our work
–Also based on the filter-and-refine method
–But avoid to scan the whole data set
• Use B+ trees
• And thus can obtain high scalability
• Our contributions
–To the best of our knowledge, the 1st paper to index the high
dimensional prob. data based on the EMD
–Proposed algorithms of processing the similarity query based on B+ tree
filter
–Improve the efficiency and scalability of EMD-based similarity search
Mining and Searching Complex Structures
84
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Indexing the probabilistic data
based on EMD
• Our intuition:
–primal-dual theory in linear programming
• Primal problem (EMD)
• Dual problem
Mining and Searching Complex Structures
Indexing the probabilistic data based on
EMD
• Good properties of dual space
–Constrains of dual space are independent of prob. data points (i.e., p and
q in this example)
• Thus, give any feasible solution (π, Ф) in dual space we can derives a
lower bound for EMD(p, q)
• Lower bound can help to filter out the not-hit histograms.
–given any feasible solution (π, Ф) in dual space, a histogram p can be
mapped as a value, using the operation of
• Can index histograms using B+ tree
Mining and Searching Complex Structures
85
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Indexing the probabilistic data based on EMD
• 1. Mapping Construction
–Key and counter key
Key Counter key
–Assuming p is a histogram in DB, given a feasible solution
(π, Ф), we calculate the Key for each record in DB
–We can index those keys using B+ tree
–For each feasible solution (π, Ф), a B+ tree can be
constructed
Mining and Searching Complex Structures
Answering Range Query
• Range query based on B+ index
–Given any feasible solution (π, Ф) , we construct a B+ tree
using keys of histograms
–Given a query histogram, we calculate its counter key using
the operation of
–Given a similarity search threshold θ, we have proved that
all candidate histogram’s key can be bounded by
–To further filter the candidates, we use L B+ tree and make
an intersection among their candidate results
Mining and Searching Complex Structures
86
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Answering KNN Query
• K-NN query based on B+ index
–Given a query q, we issue search on
each B+ tree Tl with key(q, Фl)
–We create two cursors for each tree and
let them to fetch records from different
directions (one left and one right)
–Whenever record r has already been
accessed by all B+ tree, it can be output
as a candidate for k-NN query
Mining and Searching Complex Structures
Experimental Setup
• 3 real data set
–RETINA1
• an image data set consists of 3932 feline retina scans labeled
with various antibodies.
–IRMA
• contains 10000 radiography images from the Image Retrieval
in Medical Application (IRMA) project
–DBLP
• With parameter setting
Mining and Searching Complex Structures
87
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Experimental Results on
Query CPU Time
Mining and Searching Complex Structures
Experimental Results on
Scalability
sigmod
our
Mining and Searching Complex Structures
88
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Conclusions
• We present a new indexing scheme for the general
purposes of similarity search on Earth Mover's Distance
• Our index method relies on the primal-dual theory to
construct mapping functions from the original
probabilistic space to one-dimensional domain
• Our B+ tree-based index framework has
–High scalability
–High efficiency
–can handle High dimensional data
Mining and Searching Complex Structures
Outline
• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
89
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
A Microarray Dataset
1000 - 100,000 columns
Class Gene1 Gene2 Gene3 Gene4 Gene Gene Ge
5 6
Sample1 Cancer
Sample2 Cancer
100-
500 .
rows .
.
SampleN-1 ~Cance
r
SampleN ~Cance
r
• Find closed patterns which occur frequently among genes.
• Find rules which associate certain combination of the
columns that affect the class of the rows
–Gene1,Gene10,Gene1001 -> Cancer
Mining and Searching Complex Structures
Challenge I
• Large number of patterns/rules
–number of possible column combinations is extremely high
• Solution: Concept of a closed pattern
–Patterns are found in exactly the same set of rows are grouped together
and represented by their upper bound
• Example: the following patterns are found in row 2,3 and 4
upper
aeh bound i ri Class
(closed 1 a ,b,c,l,o,s C
pattern) 2 a ,d, e , h ,p,l,r C
ae ah 3 a ,c, e , h ,o,q,t C
eh 4 a , e ,f, h ,p,r ~C
5 b,d,f,g,l,q,s,t ~C
e h
lower bounds “a” however not part of
the group
Mining and Searching Complex Structures
90
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Challenge II
• Most existing frequent pattern discovery algorithms perform
searches in the column/item enumeration space i.e. systematically
testing various combination of columns/items
• For datasets with 1000-100,000 columns, this search space is
enormous
• Instead we adopt a novel row/sample enumeration algorithm for
this purpose. CARPENTER (SIGKDD’03) is the FIRST
algorithm which adopt this approach
Mining and Searching Complex Structures
Column/Item Enumeration Lattice
• Each nodes in the lattice represent
a combination of columns/items a,b,c,e
• An edge exists from node A to B if
A is subset of B and A differ from a,b,c a,b,e a,c,e b,c
B by only 1 column/item
• Search can be done
breadth first a,b a,c a,e b,c b
i ri Class
1 a,b,c,l,o,s C
a b c
2 a,d,e,h,p,l,r C
3 a,c,e,h,o,q,t C
4 a,e,f,h,p,r ~C
5 b,d,f,g,l,q,s,t ~C
start {}
Mining and Searching Complex Structures
91
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Column/Item Enumeration Lattice
• Each nodes in the lattice represent
a combination of columns/items
a,b,c,e
• An edge exists from node A to B if
A is subset of B and A differ from
B by only 1 column/item a,b,c a,b,e a,c,e b,c
• Search can be done depth first
• Keep edges from parent to child
only if child is the prefix of parent a,b a,c a,e b,c b
i ri Class
1 a,b,c,l,o,s C
a b c
2 a,d,e,h,p,l,r C
3 a,c,e,h,o,q,t C
4 a,e,f,h,p,r ~C
5 b,d,f,g,l,q,s,t ~C
start {}
Mining and Searching Complex Structures
General Framework for Column/Item Enumeration
Read-based Write-based Point-based
Association Mining Apriori[AgSr94], Eclat, Hmine
DIC MaxClique[Zaki01],
FPGrowth [HaPe00]
Sequential Pattern GSP[AgSr96] SPADE
Discovery [Zaki98,Zaki01],
PrefixSpan
[PHPC01]
Iceberg Cube Apriori[AgSr94] BUC[BeRa99], H-
Cubing [HPDW01]
Mining and Searching Complex Structures
92
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
A Multidimensional View
types of data others
or knowledge other interest
measure
associative
pattern constraints
pruning method
sequential
pattern compression method
closed/max
iceberg pattern
cube
lattice transversal/
main operations
read write point
Mining and Searching Complex Structures
Sample/Row Enumeration Algorihtms
• To avoid searching the large column/item enumeration space, our
mining algorithm search for patterms/rules in the sample/row
enumeration space
• Our algorithms does not fitted into the column/item enumeration
algorithms
• They are not YAARMA (Yet Another Association Rules Mining
Algorithm)
• Column/item enumeration algorithms simply does not scale for
microarray datasets
Mining and Searching Complex Structures
93
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Existing Row/Sample Enumeration Algorithms
• CARPENTER(SIGKDD'03)
–Find closed patterns using row enumeration
• FARMER(SIGMOD’04)
–Find interesting rule groups and building classifiers based on them
• COBBLER(SSDBM'04)
–Combined row and column enumeration for tables with large
number of rows and columns
• Topk-IRG(SIGMOD’05)
–Find top-k covering rules for each sample and build classifier
directly
• Efficiently Finding Lower Bound Rules(TKDE’2010)
–Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao.
What is Unequal among the Equals? Ranking Equivalent Rules from
Gene Expression Data. Accepted in TKDE
Mining and Searching Complex Structures
Concepts of CARPENTER
ij R (ij )
C ~C
i ri Class a 1,2,3 4
b 1 5
1 a,b,c,l,o,s C C ~C
c 1,3
2 a,d,e,h,p,l,r C d 2 5 a 1,2,3 4
3 a,c,e,h,o,q,t C e 2,3 4 e 2,3 4
4 a,e,f,h,p,r ~C f 4,5 h 2,3 4
5 b,d,f,g,l,q,s,t ~C g 5
h 2,3 4 TT|{2,3}
Example Table l 1,2 5
o 1,3
p 2 4
q 3 5
r 2 4
s 1 5
t 3 5
Transposed Table,TT
Mining and Searching Complex Structures
94
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
ij R (ij )
Row Enumeration a
C
1,2,3 4
~C
b 1 5
c 1,3
d 2 5
e 2,3 4
123 12345 f 4,5
{a} 1234 {}
{a} g 5
12 124 h 2,3 4
{al} {a} 1235 l 1,2 5
{} ij R (ij )
13 125 o 1,3
{aco} {l} C ~C p 2 4
1 1245 a 1,2,3 4 q 3 5
14 134 {}
{abclos} {a} {a} TT|{1} b 1 5 r 2 4
s 1 5
15 135 c 1,3
{bls} {} 1345 t 3 5
{} l 1,2 5
23 145 o 1,3
2 {} ij R (ij )
{aeh} s 1 5
{adehplr} C ~C
24 234 2345
{aeh} {} a 1,2,3 4
{aehpr}
TT|{12} l
{}
1,2 5
3 25 235
{dl} {} ij R (ij )
{acehoqt}
245 C ~C
34 {}
{aeh} a 1,2,3 4
4 TT|{123}
{124}
{aefhpr} 345
35 {}
{q}
5 45
{bdfglqst} {f}
Mining and Searching Complex Structures
Pruning Method 1
• Removing rows that appear in all tuples
of transposed table will not affect results
C ~C
a 1,2,3 4
e 2,3 4
h 2,3 4
r2 r3 r2 r3 r4
TT|{2,3}
{aeh} {aeh}
r4 has 100% support in the conditional table of
“r2r3”, therefore branch “r2 r3r4” will be
pruned.
Mining and Searching Complex Structures
95
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Pruning method 2
123
{a} 1234
{a}
• if a rule is discovered
12345
{}
12
{al} 124
{a} 1235
before, we can prune
{}
13 125
{l}
enumeration below this
{aco}
1 14 134
{a}
1245
{} node
{abclos} {a}
15 135 –Because all rules below
{bls} {} 1345
{} this node has been
23 145
2 {aeh}
{} discovered before
{adehplr} 234
24 {aeh} 2345 –For example, at node 34, if
{} {aehpr} {}
3 25 235 we found that {aeh} has
{dl} {}
{acehoqt}
245 been found, we can prune
34 {} C ~Coff all branches below it
{aeh}
4
345 a 1,2,3 4
{aefhpr} 35 {}
{q} e 2,3 4
h 2,3 4
5 45
{f}
TT|{3,4}
{bdfglqst}
Mining and Searching Complex Structures
Pruning Method 3: Minimum Support
• Example: From TT|{1}, we can see ij R (ij )
that the support of all possible
pattern below node {1} will be at C ~C
most 5 rows.
TT|{1}
a 1,2,3 4
b 1 5
c 1,3
l 1,2 5
o 1,3
s 1 5
Mining and Searching Complex Structures
96
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
From CARPENTER to FARMER
• What if classes exists ? What more can we
do ?
• Pruning with Interestingness Measure
–Minimum confidence
–Minimum chi-square
• Generate lower bounds for classification/
prediction
Mining and Searching Complex Structures
Interesting Rule Groups
• Concept of a rule group/equivalent class
–rules supported by exactly the same set of rows are grouped together
• Example: the following rules are derived from row 2,3 and 4 with
66% confidence
i ri Class
upper 1 a ,b,c,l,o,s C
aeh--> C(66%)
bound 2 a ,d, e , h ,p,l,r C
3 a ,c, e , h ,o,q,t C
ae-->C (66%) ah--> C(66%) eh-->C (66%) 4 a , e ,f, h ,p,r ~C
5 b,d,f,g,l,q,s,t ~C
a-->C however is not in
e-->C (66%) h-->C (66%)
the group
lower bounds
Mining and Searching Complex Structures
97
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Pruning by Interestingness Measure
• In addition, find only interesting rule groups (IRGs) based
on some measures:
–minconf: the rules in the rule group can predict the class on
the RHS with high confidence
–minchi: there is high correlation between LHS and RHS of
the rules based on chi-square test
• Other measures like lift, entropy gain, conviction etc. can
be handle similarly
Mining and Searching Complex Structures
ij R (ij )
C ~C
Ordering of Rows: All Class C before ~C a
b
1,2,3 4
1 5
c 1,3
d 2 5
e 2,3 4
123 12345 f 4,5
{a} 1234 {}
{a} g 5
12 124 h 2,3 4
{al} {a} 1235 l 1,2 5
{} ij R (ij )
13 125 o 1,3
{aco} {l} C ~C p 2 4
1 1245 a 1,2,3 4 q 3 5
14 134 {}
{abclos} {a} {a} TT|{1} b 1 5 r 2 4
s 1 5
15 135 c 1,3
{bls} {} 1345 t 3 5
{} l 1,2 5
23 145 o 1,3
2 {} ij R (ij )
{aeh} s 1 5
{adehplr} C ~C
24 234 2345
{aeh} {} a 1,2,3 4
{aehpr}
TT|{12} l
{}
1,2 5
3 25 235
{dl} {} ij R (ij )
{acehoqt}
245 C ~C
34 {}
{aeh} a 1,2,3 4
4 TT|{123}
{124}
{aefhpr} 345
35 {}
{q}
5 45
{bdfglqst} {f}
Mining and Searching Complex Structures
98
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Pruning Method: Minimum Confidence
• Example: In TT|{2,3} on the right, C ~C
the maximum confidence of all rules a 1,2,3,6 4,5
below node {2,3} is at most 4/5 e 2,3,7 4,9
h 2,3 4
TT|{2,3}
Mining and Searching Complex Structures
Pruning method: Minimum chi-square
C ~C
Same as in computing
maximum confidence a 1,2,3,6 4,5
e 2,3,7 4,9
h 2,3 4
TT|{2,3}
C ~C Total
A max=5 min=1 Computed
~A Computed Computed Computed
Constant Constant Constant
Mining and Searching Complex Structures
99
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Finding Lower Bound, MineLB
–Example: An upper bound
rule with antecedent A=abcde
a,b,c,d,e
and two rows (r1 : abcf ) and
(r2 : cdeg)
ad ae bd be –Initialize lower bounds {a, b,
abc
cde c, d, e}
–add “abcf”--- new lower
{d ,e}
a e
b c d –Add “cdeg”--- new lower
bound{ad, bd, ae, be}
Candidate lower bound: ad, ae, bd, be,
Candidate lower bound: ad, ae, bd, be cd, ce
Removed since d,e are still lower them
Kept since no lower bound overridebound
Mining and Searching Complex Structures
Implementation
• In general, CARPENTER FARMER can be ij R (ij )
implemented in many ways: C ~C
a 1,2,3 4
–FP-tree b 1 5
–Vertical format c 1,3
d 2 5
• For our case, we assume the dataset can be e 2,3 4
fitted into the main memory and used f 4,5
g 5
pointer-based algorithm similar to BUC h 2,3 4
l 1,2 5
o 1,3
p 2 4
q 3 5
r 2 4
s 1 5
t 3 5
Mining and Searching Complex Structures
100
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Experimental studies
• Efficiency of FARMER
–On five real-life dataset
• lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALL-
AML leukemia (ALL), Colon Tumor(CT)
–Varying minsup, minconf, minchi
–Benchmark against
• CHARM [ZaHs02] ICDM'02
• Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99
• Usefulness of IRGs
–Classification
Mining and Searching Complex Structures
Example results--Prostate
100000
FA RM ER
10000 Co lumnE
1000 CHA RM
100
10
1
3 4 5 6 7 8 9
mi ni mum sup p o r t
Mining and Searching Complex Structures
101
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Example results--Prostate
1200
FA RM ER:minsup=1:minchi=10
1000
FA RM ER:minsup =1
800
600
400
200
0
0 50 70 80 85 90 99
minimum confidence(%)
Mining and Searching Complex Structures
Top k Covering Rule Groups
• Rank rule groups (upper bound) according to
– Confidence
– Support
• Top k Covering Rule Groups for row r
– k highest ranking rule groups that has row r as support and support
> minimum support
• Top k Covering Rule Groups =
TopKRGS for each row
Mining and Searching Complex Structures
102
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Usefulness of Rule Groups
• Rules for every row
• Top-1 covering rule groups sufficient to build CBA classifier
• No min confidence threshold, only min support
• #TopKRGS = k x #rows
Mining and Searching Complex Structures
Top-k covering rule groups
• For each row, we find the most
significant k rule groups:
class Items
–based on confidence first
–then support
C1 a,b,c
• Given minsup=1, Top-1
–row 1: abc C1(sup = 2, conf= 100%) C1 a,b,c,d
–row 2: abc C1
C1 c,d,e
• abcd C1(sup=1,conf = 100%)
–row 3: cd C1(sup=2, conf = 66.7%)
• If minconf = 80%, ?
–row 4: cde C2 (sup=1, conf = 50%) C2 c,d,e
Mining and Searching Complex Structures
103
Mining and Searching Complex Chapter 2 Structures High Dimensional Data
Main advantages of Top-k coverage rule group
• The number is bounded by the product of k and the number
of samples
• Treat each sample equally provide a complete description
for each row (small)
• The minimum confidence parameter-- instead k.
• Sufficient to build classifiers while avoiding excessive
computation
Mining and Searching Complex Structures
Top-k pruning
• At node X, the maximal set of rows covered by rules to
be discovered down X-- rows containing X and rows
ordered after X.
– minconf MIN confidence of the discovered TopkRGs for all rows in the above
set
– minsup the corresponding minsup
• Pruning
–If the estimated upper bound of confidence down X 0, j>0
⎧V (i − 1, j − 1) + δ ( S [i ], T [ j ]) Match/mismatch
⎪
V (i, j ) = max ⎨ V (i − 1, j ) + δ ( S [i ], _) Delete
⎪ V (i, j − 1) + δ (_, T [ j ])
⎩ Insert
In the alignment, the last pair must be either
match/mismatch, delete, insert.
xxx…xx xxx…xx xxx…x_
| | |
xxx…yy yyy…y_ yyy…yy
match/mismatch delete insert
2010-7-31
Example (I)
_ A G C A T G C
_ 0 -1 -2 -3 -4 -5 -6 -7
A -1
C -2
A -3
A -4
T -5
C -6
C -7
2010-7-31
129
Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences
Example (II)
_ A G C A T G C
_ 0 -1 -2 -3 -4 -5 -6 -7
A -1 2 1 0 -1 -2 -3 -4
C -2 1 1 ?
3 2
A -3
A -4
T -5
C -6
C -7
2010-7-31
Example (III)
_ A G C A T G C
_ 0 -1 -2 -3 -4 -5 -6 -7
A -1 2 1 0 -1 -2 -3 -4
C -2 1 1 3 2 1 0 -1
A -3 0 0 2 5 4 3 2
A -4 -1 -1 1 4 4 3 2
T -5 -2 -2 0 3 6 5 4
C -6 -3 -3 0 2 5 5 7
C -7 -4 -4 -1 1 4 4 7
2010-7-31
130
Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences
“q-grams” of strings
universal
2-grams
2010-7-31
q-gram inverted lists
at 4
ch 0 2
id strings
ck 1 3
0 rich
2-grams
ic 0 1 2 4
1 stick
ri 0
2 stich
st 1 2 3 4
3 stuck
ta 4
4 static
ti 1 2 4
tu 3
uc 3
2010-7-31
131
Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences
Searching using inverted lists
Query: “shtick”, ED(shtick, ?)≤1
sh ht ti ic ck # of common grams >= 3
at 4
ch 0 2
id strings
ck 1 3
0 rich
2-grams
ic 0 1 2 4
1 stick
ri 0
2 stich
st 1 2 3 4
3 stuck
ta 4
4 static
ti 1 2 4
tu 3
uc 3
2010-7-31
2-grams -> 3-grams?
Query: “shtick”, ED(shtick, ?)≤1
sht hti tic ick # of common grams >= 1
ati 4
ich 0 2
id strings ick 1
0 rich ric 0
1 stick 3-grams sta 4
2 stich sti 1 2
3 stuck stu 3
4 static tat 4
tic 1 2 4
tuc 3
2010-7-31
uck 3
132
Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences
Observation 1: dilemma of choosing “q”
Increasing “q” causing:
Longer grams Shorter lists
Smaller # of common grams of similar strings
at 4
ch 0 2
id strings
ck 1 3
0 rich
2-grams
ic 0 1 2 4
1 stick
ri 0
2 stich
st 1 2 3 4
3 stuck
ta 4
4 static
ti 1 2 4
tu 3
uc 3
2010-7-31
Observation 2: skew distributions of gram
frequencies
DBLP: 276,699 article titles
Popular 5-grams: ation (>114K times), tions, ystem, catio
2010-7-31
133
Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences
VGRAM: Main idea
Grams with variable lengths (between qmin and qmax)
zebra
ze(123)
corrasion
co(5213), cor(859), corr(171)
Advantages
Reduce index size ☺
Reducing running time ☺
Adoptable by many algorithms ☺
2010-7-31
Challenges
Generating variable-length grams?
Constructing a high-quality gram dictionary?
Relationship between string similarity and their
gram-set similarity?
Adopting VGRAM in existing algorithms?
2010-7-31
134
Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences
Challenge 1: String Variable-length grams?
Fixed-length 2-grams
universal
Variable-length grams
[2,4]-gram dictionary
universal ni
ivr
sal
uni
vers
2010-7-31
Representing gram dictionary as
a trie
ni
ivr
sal
uni
vers
2010-7-31
135
Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences
Challenge 2: Constructing gram
dictionary
Step 1: Collecting frequencies of grams with length in [qmin,
qmax]
st 0, 1, 3
sti 0, 1
stu 3
stic 0, 1
stuc 3
Gram trie with frequencies
2010-7-31
Step 2: selecting grams
Pruning trie using a frequency threshold T (e.g., 2)
2010-7-31
136
Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences
Step 2: selecting grams (cont)
Threshold T = 2
2010-7-31
Final gram dictionary
[2,4]-grams
2010-7-31
137
Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences
Challenge 3: Edit operation’s effect on grams
Fixed length: q
universal
k operations could affect k * q grams
2010-7-31
Deletion affects variable-length grams
Not affected Affected Not affected
i-qmax+1 i i+qmax- 1
Deletion
2010-7-31
138
Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences
Grams affected by a deletion
Affected?
i-qmax+1 i i+qmax- 1
Deletion
[2,4]-grams
Deletion ni
ivr
universal
sal
uni
Affected? vers
2010-7-31
Grams affected by a deletion (cont)
Affected?
i-qmax+1 i i+qmax- 1
Deletion
Trie of grams
2010-7-31 Trie of reversed grams
139
Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences
# of grams affected by each operation
Deletion/substitution Insertion
0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0
_u_n_i_v_e_r_s_a_l_
2010-7-31
Max # of grams affected by k operations
Vector of s =
With 2 edit operations, at most 4 grams can be affected
Called NAG vector (# of affected grams)
Precomputed and stored
2010-7-31
140
Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences
Summary of VGRAM index
2010-7-31
Challenge 4: adopting VGRAM
Easily adoptable by many algorithms
Basic interfaces:
String s grams
String s1, s2 such that ed(s1,s2) =:
(|s1|- q + 1) – k * q
Variable lengths: # of grams of s1 – NAG(s1,k)
2010-7-31
Example: algorithm using inverted lists
Query: “shtick”, ED(shtick, ?)≤1
sh ht tick
2-grams 2-4 grams
… Lower bound = 3 …
ck 1 3 ck 1 3
ic 0 1 2 4 ic 1 4
… ich 0 2
ti 1 2 4 …
… id strings tic 2 4
0 rich tick 1
1 stick …
2 stich
3 stuck Lower bound = 1
2010-7-31
4 static
142
Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences
PartEnum + VGRAM
PartEnum, fixed q-grams:
ed(s1,s2) b b->λ λ->b
a a
si(ei1,ei2,…,eik) : T1->T2; cost(si)= ∑j γ(eij)
EDist(T1,T2)=mini(cost(si)) unit cost: EDist(T1,T2)=min(k)
Computational Complexity:
O (| T1 | × | T2 | × min(depth(T1 ), leaves(T1 )) × min(depth(T2 ), leaves(T2 )))
7/31/2010 21
Edit Operation Mapping
Edit operations mapping
One-to-one
Preserve sibling order
Preserve ancestor order
a a
d b e b c d
c d c d b
M(T1,T2) c d
e
T1 T2
7/31/2010 22
166
Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees
Observation
Edit operations do not change many sibling
relationship
a a
c->λ
c d
b e
b f g h i d e
f g h i Sibling relation:
(b,c)->(b,f)
(c,d)->(i,d)
Node: Varying number of children v.s. at most 2 siblings
7/31/2010 23
Binary Tree Representation
a
Binary Tree Representation
b e
Left-child, right sibling b
Normalized Binary Tree c d c d
a(1,8)
a b b b b c d d d e
b … c … c … c … e … ε … ε …ε … ε … ε
b(2,3)
ε b c e ε d b e ε ε
ε
b(5,6) T1
c(3,1)
c(6,4) e(8,7) 1 …1 …0 … 1 … 0 … 2 …0 …0 … 2 … 1
ε d(4,2)
T2
ε ε ε d(7,5) ε ε
1 …0 …1 … 0 … 1 … 2 …0 …1 … 0 … 1
1
ε ε
|Γ |
BBDist (T1 , T2 ) = ∑ | b1i − b2i | = 8 Triangular Inequality
i =1
7/31/2010 24
167
Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees
One Edit Operation Effect
v’ v’
... ... ... ... ... ...
w1 w2 wl w l+m w l+m+1 w1 w2 wl v w l+m+1
... ... ... ...
... Each node appears in
w l+1 w l+m at most two binary
v’ v’ branches
... ...
w1 w1
w2 ...
w2
... ... ...
wl wl v ...
w l+1 ...
...
w l+m+1
w l+m w l+1
... ...
w l+m+1 w l+2
...
...
w l+m
...
ε
7/31/2010 25
Theorem
1 insertion/deletion incurs at most 5 difference on BBDist
1 rellabeling incurs at most 4 difference on BBDist
T, T’, EDIST(T, T’) = k = ki + kd + kr ,
BDist(T,T’) d, Gi can be safely filtered;
•if τ(Q, Gi) ≤ d, Gi can be reported as a result directly;
•if ρ(Q, Gi) ≤ d, Gi can be reported as a result directly;
•otherwise, λ(Q, Gi) must be computed.
Subgraph exact Search
• Lemma
•Given two graphs G1 and G2 , if no vertex relabelling is
allowed in the edit operations, μ’(G1, G2) ≤ 4 · λ’(G1, G2),
where μ’ and λ’ are computed without vertex relabelling.
•(This Lemma can be used in subgraph search, because if a
graph is subisomorphism to another graph, no vertex
relabelling happens.)
• AppSUB algorithm:
•Filtering based on the lower bound .
207
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
Experimental Results
• Compare with the exact algorithm
1,000 graphs were generated, D = 1k,T = 10,V = 4.
Randomly select 10 seed graphs to form D; a seed has 10 vertices.
6 query groups. Each group has 10 graphs. Graphs in the same
group have the same number of vertices.
Experimental Results
• Compare with the BLP method
208
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
Experimental Results
• Scalability over real datasets
Experimental Results
• Scalability over synthetic datasets
209
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
Experimental Results
• Performance of AppFULL
Experimental Results
• Performance of AppSUB
210
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
Outline
• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search
SEGOS: SEarch similar
Graphs On Star index
Xiaoli Wang
Xiaofeng Ding
Anthony K.H. Tung
Shanshan Ying
Hai Jin
211
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
Our Solutions
• Work 1: Scalability issue
•A full database scan
•A index mechanism is needed
• Existing indexing methods: Filtering power
•Rough bounds with poor filtering power
• Work 2: Sorted index for graph similarity search
•Propose a novel indexing and query processing framework
•Deploy a filtering strategy based on TA and CA methods
•All exiting lower and upper GED bounds can be directly
integrated into our filtering framework
TA Method on the Top-k Query
• The database model used in TA
M
Object Sorted L1 Sorted L2
A1 A2
ID
0.9 0.85 (a, 0.9) (d, 0.9)
a
0.8 0.7 (b, 0.8) (a, 0.85)
b
0.72 0.2 (c, 0.72) (b, 0.7)
c
. .
d 0.6 0.9 . .
. . . . .
. . . . .
. . . (d, 0.6) (c, 0.2)
N . . .
212
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
TA method on the top-k query
• A simple query
•Find the top-2 objects on the ‘query’ of ‘A1&A2 ’
•This query results in the TA method combing the scores of
A1 and A2 by an aggregation function like
sum(A1,A2)
Aggregation function:
function that gives objects an overall score based on attribute
scores
examples: sum, min functions
Monotonicity!
Monotony on TA (Halting Condition)
• Main idea
•How do we know that scores of seen objects are higher than
the grades of unseen objects?
•Predict maximum possible score unseen objects:
L1 L2
a: 0.9 d: 0.9
Seen
b: 0.8 a: 0.85
c: 0.72 b: 0.7 ω = sum(0.72, 0.7) =
. . 1.42
. f: 0.6
.
.
f: 0.65 .
Possibly unseen . . Threshold value
d: 0.6 c: 0.2
213
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Given 2 sorted lists for attributes A1 and A2,
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
(c, 0.72) (b, 0.7)
. .
. .
. .
. .
(d, 0.6) (c, 0.2)
A Top-2 Query Example
• Step 1
•Parallel sorted access attributes from every sorted list
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9
(c, 0.72) (b, 0.7)
d 0.9 1.5
. .
. .
. .
. .
(d, 0.6) (c, 0.2)
214
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 1
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9
(c, 0.72) (b, 0.7)
d 0.9
. .
. .
. .
. .
(d, 0.6) (c, 0.2)
A Top-2 Query Example
• Step 1
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9 0.85
(c, 0.72) (b, 0.7)
d 0.9
. .
. .
. .
. .
(d, 0.6) (c, 0.2)
215
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 1
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9 0.85 1.75
(c, 0.72) (b, 0.7)
d 0.9
. .
. .
. .
. .
(d, 0.6) (c, 0.2)
A Top-2 Query Example
• Step 1
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9 0.85 1.75
(c, 0.72) (b, 0.7)
d 0.6 0.9 1.5
. .
. .
. .
. .
(d, 0.6) (c, 0.2)
216
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9 0.85 1.75
(c, 0.72) (b, 0.7)
d 0.6 0.9 1.5
. .
. .
. .
. .
(d, 0.6) (c, 0.2)
A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9 0.85 1.75
(c, 0.72) (b, 0.7)
d 0.6 0.9 1.5
. .
. .
. .
. .
(d, 0.6) (c, 0.2) ω = sum(0.9, 0.9) = 1.8
217
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9 0.85 1.75
(c, 0.72) (b, 0.7)
d 0.6 0.9 1.5
. .
. .
. .
. .
(d, 0.6) (c, 0.2) ω = sum(0.9, 0.9) = 1.8
A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9 0.85 1.75
(c, 0.72) (b, 0.7)
d 0.6 0.9 1.5
. .
. .
. .
. .
(d, 0.6) (c, 0.2)
218
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9 0.85 1.75
(c, 0.72) (b, 0.7)
d 0.6 0.9 1.5
. .
. .
. .
. .
(d, 0.6) (c, 0.2)
A Top-2 Query Example
• Step 1 (Again)
•Sorted access attributes from every sorted list
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9 0.85 1.75
(c, 0.72) (b, 0.7)
d 0.6 0.9 1.5
. .
. .
. .
. .
(d, 0.6) (c, 0.2)
219
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 1 (Again)
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9 0.85 1.75
(c, 0.72) (b, 0.7)
d 0.6 0.9 1.5
. .
. . b 0.8 0.7 1.5
. .
. .
(d, 0.6) (c, 0.2)
A Top-2 Query Example
• Step 1 (Again)
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9 0.85 1.75
(c, 0.72) (b, 0.7)
d 0.6 0.9 1.5
. .
. .
. .
. .
(d, 0.6) (c, 0.2)
220
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
A Top-2 Query Example
• Step 2 (Again)
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9 0.85 1.75
(c, 0.72) (b, 0.7)
d 0.6 0.9 1.5
. .
. .
. .
. .
(d, 0.6) (c, 0.2) ω = sum(0.8, 0.85) = 1.65
A Top-2 Query Example
• Step 2 (Again)
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1
L1 L2
(a, 0.9) (d, 0.9)
ID A1 A2 sum(A1,A2)
(b, 0.8) (a, 0.85)
a 0.9 0.85 1.75
(c, 0.72) (b, 0.7)
d 0.6 0.9 1.5
. .
. .
. .
. .
(d, 0.6) (c, 0.2)
221
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
A Top-2 Query Example
Situation at stopping:
ω = sum(0.72, 0.7) = 1.42 d (= 4)
. .
. :
g6. 5
Possibly unseen g6. 5
: .
. . Threshold value
g4: 6 g3: 9
TA-based Filtering Strategy for Graph
Search Problem
• A graph database with a query example
Sorted list L1 Sorted list L2 Sorted list L3
223
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
Requirement
• An index structure
•Convenient for score-sorted lists construction
• Efficient star search algorithm
•Quickly return similar stars to a query star
• Sorted properties for the halting condition of TA
•The mapping distance of any unseen graph gi satisfies
λ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’)
q is the query graph, τ is the distance threshold, and
where D’ is the set of all unseen graphs.
•Requirement distance in our previous work
Recall the mapping
satisfy:
•μAn index structure
(q, gi) ≤ max{4, [min{δ (q), δ(gi)]} + 1]} · λ(q, gi)
•Convenient for score-sorted lists construction
Efficient δ search max{4, [min{δ (q), δ(gi)]} + 1]},
•We denotestar(q, gi) =algorithm
•Quickly gi) ≤ δ’.
then δ (q,return similar stars to a query star
If μ(q, g ) > τ*δ’, then λ(q, gi) > τ*δ’/δ > τ,
• Sorted iproperties for the halting condition of TA
and this graph can be safely filtered out.
•The mapping distance of any unseen graph gi satisfies
• μ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’)
•q is the query graph, τ is the distance threshold, and
•δ’ = max{4, [min{δ(q), δ(D’)]} + 1]}
•where D’ is the set of all unseen graphs.
224
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
Requirement
• An index structure
•Convenient for score-sorted lists construction
• Efficient star search algorithm
•Quickly return similar stars to a query star
• Sorted properties for the halting condition of TA
•The mapping distance of any unseen graph gi satisfies
• λ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’)
•q is the query graph, τ is the distance threshold, and
•
•where D’ is the set of all unseen graphs.
Build Inverted Index Structures based
on the Star Decomposition
• The upper-level index
•Build an inverted index between stars and graphs
•Used to quickly returned graph lists
• The lower-level index
•Build an inverted index between labels and stars
•Used to construct the sorted lists
•for top-k star search based on TA
•filtering strategy
225
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
Build Inverted Index Structures based
on the Star Decomposition
Top-k Star Search Algorithm
• Construct sorted lists
226
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
Graph Score-sorted Lists
• Construct lists based on the top-k results
TA-based Graph Range Query
• Definition
•Given a graph database D and a query q, find all gi ∈ D that
are similar to q with λ(q, gi) ≤ τ. τ is the distance
threshold.
• Steps: given m sorted lists for a query graph q
•Perform sorted retrieval in a round-robin schedule to each
sorted list. For a retrieved graph gi, if Lm(q, gi) > τ, filter out
the graph; if Um(q, gi) ≤ τ, report the graph to the answer
set.
•For each sorted list SLj, let χj be the corresponding distance
last seen under sorted access. If ω = sum(χ1,…, χm) >
τ∗δ’, then halt. Otherwise, go to step 1.
227
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
CA-based Filtering Strategy
• The difference between TA and CA
•TA computes the mapping distance between two graphs
when retrieving a new graph through sorted accesses
•Only in each h depth of the sorted scan, for seen and
unprocessed graphs, CA uses estimated mapping distance
bounds to first filter graphs; Then, it uses Incremental
Hungarian algorithm to compute the partial mapping
distances for filtering
CA-based Filtering Strategy
• Suppose l(g) = {l1,…,ly} ⊆ {1,2,…,m} is a set of known
lists of g seen below q. Let χ(g) be the multiset of
distances of the distinct stars of g last seen in known lists.
•Lower bound denoted by Lμ(q, g) is obtained by substituting
the missing lists j ∈ {1,2,…,m}\l(g) with χj (the distance
last seen under the jth list) in ζ(q, g)
•Upper bound denoted by Uμ(q, g) is computed as
Uμ(q, g) = t′(χ(g)) + χ ∗ (|g| − |χ(g)|)
• Theorem: Let g1 and g2 be two graphs, the bounds
obtained as above satisfies
ζ(g1, g2) ≤ Lμ(g1, g2) ≤ μ(g1, g2) ≤ Uμ(g1, g2)
228
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
CA-based Filtering Strategy
• Dynamic hungarian for partial mapping distance
•Given m sorted lists for q, suppose S′(g) ⊆ S(g) is a
multiset of stars in g seen below lists. Then we have μ(S(q),
S′(g)) ≤ μ(q, g)
CA-based Graph Range Query
• Steps: given m sorted lists for a query graph q
•Perform sorted retrieval in a round-robin schedule to each
sorted list. At each depth h of lists:
• Maintain the lowest values χ1, . . . , χm encountered in the
lists. Maintain a distance accumulator ζ(q, gi) and a multiset
of retrieved stars S′(gi) ⊆ S(gi) for each gi seen under lists.
• For each gi that is retrieved but unprocessed, if ζ(q, gi) > τ∗δgi,
filter out it; if Lμ(q, gi) > τ∗δgi, filter out it; if Uμ(q, gi) ≤ τ∗δgi ,
add the graph to the candidate set. Otherwise, if μ(S(q), S′(gi )
> τ∗δgi, filter out the graph. Finally, run the Dynamic
Hungarian to obtain Lm(q, gi) and Um(q, gi) for filtering.
•When a new distance is updated, compute a new ω. If ω =
t′(χ) > τ∗δ′, then halt. Otherwise, go to step 1.
229
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
Experimental Results: Sensitivity test
Experimental Results: Index construction
230
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
Experimental Results: compare with other
works varying distance thresholds
Experimental Results: compare with other
works varying dataset sizes
231
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
References
• D. Conte, Pasquale Foggia, Carlo Sansone, and Mario Vento.
Thirty Years of Graph Matching in Pattern Recognition.
• P. Foggia, C. Sansone and M. Vento. A performance
comparison of five algorithms for graph isomorphism. In 3rd
IAPR-TC15 workshop on graph-based representations in
pattern recognition, 2001.
• K. Riesen, M. Neuhaus, and H. Bunke. Bipartite graph
matching for computing the edit distance of graphs. In GBRPR,
2007.
• P. Hart, N. Nilsson, and B. Raphael. A formal basis for the
heuristic determination of minimum cost paths. IEEE Trans.
SSC, 1966.
References
• D. Justice. A binary linear programming formulation of the
graph edit distance. IEEE TPAMI, 2006.
• R. Giugno and D. Shasha. Graphgrep: A fast and universal
method for querying graphs. In ICPR, 2002.
• R. D. Natale, A. Ferro, R. Giugno, M. Mongiovì, A. Pulvirenti,
and D. Shasha. SING: subgraph search in non-homogeneous
graphs. BMC Bioinformatics, 2010.
• X. Yan, P.S. Yu, and J. Han. Graph indexing: a frequent
structure-based approach. In SIGMOD, 2005.
• J. Cheng, Y. Ke, W. Ng, and A. Lu. Fg-index: towards
verification-free query processing on graph databases. In
SIGMOD, 2007.
232
Mining and Searching Complex Structures Chapter 5 Graph Similarity Search
References
• D.W. Williams, J. Huan, and W. Wang. Graph database
indexing using structured graph decomposition. In ICDE, 2007.
• S. Zhang, M. Hu, and J. Yang. Treepi: a novel graph indexing
method. In ICDE, 2007.
• P. Zhao, J. X. Yu, and P. S. Yu. Graph indexing: tree + delta
>= graph. In VLDB, 2007.
• G. Wang, B. Wang, X. Yang, and G. Yu. Efficiently indexing
large sparse graphs for similarity search. IEEE TKDE, 2010.
233
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Searching and Mining Complex
Structures
Massive Graph Mining
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung
Research Group Link: http://nusdm.comp.nus.edu.sg/index.html
Social Network Link: http://www.renren.com/profile.do?id=313870900
Graph applications: everywhere
And often, they are huge and messy.
social network
Bio Pathway
Co-authorship
network
234
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Knowledge: NOWHERE
Unless we manage to find where they hide.
Too many clues is like no clue.
Roadmap
Part I (1.5 hrs)
•Graph Mining Primer
•Recent advances in Massive Graph Mining
Part 2(1.5 hrs)
•CSV: cohesive subgraph Mining
•Dngraph mining: a triangle based approach
235
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Roadmap
• Graph Mining Primer
• Data mining vs. Graph mining
• Massive graph mining domain
• Types of graph patterns
• Properties of large graph structure
• Recent advances in Massive Graph Mining
• CSV: cohesive sub graph Mining
• DNgraph mining: a triangle based approach
From Data Mining to Graph Mining
•
Data Mining
raph Mining
• Classification
• Captures more complicated
• Clustering entity relationships.
• Association rule • Output: patterns, which are
learning smaller subgraphs with
interpretable meanings.
236
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Massive graph mining domains
• Financial data analyzing
• Bioinformatics network
• User profiling for customized search
• Identify financial crime
Financial data analysis
In stock market,
correlations among
stocks helps in profit
making.
Mining stock
correlation graphs Stocks Correlation Tabular Form
predicting stocks'
price change for
estimating future
return, allocating
portfolio and
controlling risks etc.
Stocks Correlation Patterns
237
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Financial data analysis
In stock market,
correlations among
stocks helps in profit
making.
Mining stock
correlation graphs Stocks Correlation Tabular Form
predicting stocks'
price change for
estimating future Highly
return, allocating correlated
portfolio and stock sets
controlling risks etc.
Stocks Correlation Patterns
Bioinformatics network
•Protein-protein interaction
• The fundamental
activities for very
numerous living cells.
• A dense graph pattern
indicates these proteins
have similar functionalities.
one representation of an assembled
NEDD9 network
238
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
User profiling for customized search
The Internet Movie Database (IMDB)
Registered users can comment on movies of their interest.
Mining on comments sharing network provides insight of
user’s interest thus further facilitate customized search.
Movie centric
view of IMDB
review network
Identify financial crime
Large classes of financial crimes such as money laundering,
follow certain transactional patterns.
Geospatial information of suspects A money laundering pattern
239
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Dense Graph Patterns
Clique/Quasi-Clique
A clique represents the highest level of internal interactions.
Quasi-clique is an ``almost'' clique with few missing edges.
High Degree Patterns
Concern the average vertex degree, which is the number of
edges intercepting the vertex.
Dense graph patterns (cont.)
Dense Bipartite Patterns Heavy Patterns
Weighted, directed graph of
Bipartite graph of pathways and online citation network, by
genes for the AML/ALL dataset. Rosvall & Bergstrom
240
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Properties of large graph structure
Static
•Power law degree distributions.
•Small world phenomenon.
•Communities and clusters.
Dynamic
•Shrinking diameters of enlarging graphs
•Densification along time
Power law
241
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Large graph: properties and laws (cont.)
Dynamic
•Shrinking diameters of enlarging graphs.
•Densification along time
Roadmap
• Graph Mining Primer
• Large graph: properties and laws
• Approaches in Graph mining
• Pattern based Mining algorithms
• Practical techniques in Massive Graph Mining
• Graph summarization with randomized sampling
•Connectivity based traversal
•MapReduce based
• CSV: cohesive subgraph Mining
• Dngraph mining: a triangle based approach
242
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Pattern based Mining algorithms
Greedy methods
SUBDUE (PWKDD04), GBI(JAI94)
Apriori-based approaches (detail in next few slides)
AGM , FSG, gSpan
Inductive logic programming (ILP) oriented solutions
WARMR, FARMAR
Kernel based solutions
Kernels for graph classification
Apriori Paradigm Recall
Search in breadth-first
manner
Use a Lattices structure
to count candidate
subgraph sets
efficiently.
A search lattice for item set mining
243
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Apriori-based Graph Mining
Performance bottleneck: candidate subgraph generation.
Solution:
1. Build a lexicographic order among graphs.
2. Search using depth-first strategy.
Very effective in mining large collections of small to medium
size graphs.
Graph summarization with randomized
sampling
• Efficient Aggregation for Graph Summarization –
SIGMOD 2008
• Graph Summarization with Bounded Error-SIGMOD
2008
• Mining graph patterns efficiently via randomized
summaries - VLDB 2009
244
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Efficient Aggregation for Graph
Summarization
As graph size increases, graphs summarization becomes
crucial when visualize the whole graph.
Criteria for an efficient summarization solution
Able to produce meaningful summarization for real
application.
Scalable to large graphs.
The choice: graph aggregation
Graph Aggregation
1. Summarization based on user-selected node attributes and
relationships.
2. Produce summaries with controllable resolutions.
“drill-down” and “roll-up” abilities to navigate
Propose two aggregation operations
SNAP – address 1
k-SNAP - address 2
245
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Operation SNAP
Group nodes by user-selected node attributes & relationships
Nodes in each group are homogenous (in terms of attributes
and relationships).
Goal: minimum # of groups
How does SNAP work?
Top down approach
Initial Step: Use user selected attributes to group nodes.
Iterative Step:
If a group are not homogeneous w.r.t. relationships, split the
group based on its relationships with other groups.
246
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
SNAP limitation
Homogeneity requirement for relationships
Noise and uncertainty
Users have no control over the resolutions of summaries
SNAP operation can result in a large number of small groups
Operation k-SNAP
The entities inside a group are not necessarily
homogenous in terms of relationships with other
groups.
Users can control resolution by specifying k (#
groups).
Varying of k provides “drill-down” and “roll-up”
abilities.
247
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Access quality of summarization
Determined by sum of noisy relations.
When the relationship between two relationships are strong
(>50%), count missing participants.
When the relationship between two relationships are weak
(Mining->Verification
Raw DB
Summarized DB
Raw DB
Reduce false positive
• Technique 1: merge vertices that are far away from each
other.
•The length of the shortest path
•The probability of random walk
• Technique 2: merge vertices whose neighborhood overlap.
•Cosine, Chi^2, Lift, Coherence
• Technique 3: Go back to raw database to do verification
It is guaranteed that there is no false positives.
Summarization may cause false positive
a b a
False Embeddings
False Positives
a b b b
255
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Summarization: Reduce false negative
a b a
Miss Embeddings False Negatives
a c b c
Technique 1: For raw database with frequency threshold min_sup,
we adopt a lower frequency threshold pseudo min_sup for
summarized database.
Technique 2: Iterate the mining steps for T times and combine the
results generated in each time.
It is NOT guaranteed that there is no false positives, but the
possibility is bounded by
Connectivity based traversal
CSV: Cohesive Subgraph Mining –SIGMOD 2008
(Discussed in detail in Part II)
Progressive Clustering of Networks using
Structure-Connected Order of Traversal –ICDE 2010
256
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Progressive clustering of networks using structure-
connected order of traversal
SCAN Algorithm
•Similar to DBSCAN: connectivity-based
•Average O(n) time
•Uses structural similarity measure, minimum cluster size mu, and
minimum similarity epsilon
•Finds outliers and hubs
Problems
•No automated way to find good epsilon
•Must rerun algorithm for each possible epsilon
•Epsilon is global threshold
• No hierarchical clusters
• No variation in cluster subtlety
Solution
• Structure-Connected Order of Traversal (SCOT)
•Contains all possible epsilon-clusterings
• Efficient method to find global epsilon
• New Contiguous Subinterval Heap structure
(ContigHeap)
• New Progressive Mean Heap Clustering (ProClust)
•Epsilon-free
•Hierarchical
• Refinement by Gap Constraint (GapMerge)
257
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Original Network:
SCOT plot:
Optimal Global Epsilon
SCAN paper only contains supervised
sampling method.
Sample points, find k-NN similarities, sort,
plot, find knee visually
O(nd log n) time
In addition to clustering time
Our solution:
Knee hypothesis implies approx concave
plot
Optimal epsilon minimizes obtuse angle
between segments
Modified histogram and binary search: O(n)
time
Uses already done SCOT result
258
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
ContigHeap
BuildContigHeap produces heap containing
all contiguous subintervals from SCOT
output in O(n) time, and integrates with
SCOT
Example:
GapMerge: Gap Constraint Refinement
Merges chained clusters, heap branches with single children
Does not merge across pruned heap nodes (local maxima boundary)
Gap constraint prevents clusters whose left or right boundaries differ by more
than mu from being merged
Such clusters are not redundant relative to the minimum interesting cluster size
Steps
1.Identify chains that meet gap constraint
2.When a node has more than one child or violates gap constraint, begin new chain.
3.Within each chain, calculate significance of each cluster in both up and down
directions
4.Begin with most redundant node, merge nodes in direction of least significance
5.After each merge, recalculate significances
6.Continue until chain contains one node, or no merging possible under gap constraint.
259
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
MapReduce based approach
PEGASUS: A Peta-Scale Graph Mining System –ICDM
2009
Pregel: a system for large-scale graph processing SIGMOD
2010
PEGASUS: A Peta-Scale
Graph Mining System
Dealing with real graph such as Yahoo! Web graph up to 6.7
billion edges.
A Hadoop based graph mining package.
Target at primitive matrix operations such as matrix
multiplication (GMI-v).
260
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Motivation
Many Graph mining tasks require matrix
multiplication
PageRank,
Random Walk with Restart(RWR),
Diameter estimation, and
Connected components …
MapReduce provides a simplified programming
concept for large data processing
Details of the data distribution, replication, load balancing are
taken care of.
Provides a similar programming structure. i.e. functional
programming
GIM-V: Generalized Iterative Matrix-
Vector multiplication
Intuition: Matrix Multiplication
M × v = v'
combine2
v i' = ∑ j =1 m
n
i, j vj
combineall
Assign
Operator× G are matrix multiplication expressed by above 3
steps
× G is iteratively carried out until converge.
261
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
× G and SQL
The matrix multiplication operation can be expressed by
an SQL query.
If view graphs are two table: ×G
edge table E(sid, did, val) and
a vector table V(id, val)
becomes
×G
SELECT E.sid, combineAllE.sid(combine2(E.val, V.val))
FROM E, V
WHERE E.did = V.id
GROUP BY E.sid
Generalize × G
Vary definition of three steps to generalize × G
PageRank row normalization adj.
matrix
p = (cE T + (1 - c)U)p
All element = 1/n
Damping factor = 0.85
262
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Generalize × G
Vary definition of three steps to generalize
PageRank
×G
p = (cE T + (1 - c)U)p
combine2 = c × mi, jvj
1- c
+ ∑ j=1 xj
n
combineAll =
n
Generalize (cont.)
By altering three functions, GIM-V adapts to
• Random Walk with Restart
• Diameter Estimation
• Connected Components
263
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
GIM-V: How to
Stage 1
Combine2
V: Key = id, v: vval, E: Key = idsrc
State 2
Combineall & assign
Bottleneck: shuffling and disc I/O
GIM_V Block Multiplication (BL)
Advantage
Save on sorting
Data compressing
Clustered Edge
264
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Block advantage (cont.)
Clustered edge:
GIM-V DI Dialogonal Block Iteration
Intuition
Increase multiplication
inside an iteration to
reduce # of iterations.
How
Reach local convergence
within a block first before
iterate Compare GIM-V BL and DI
265
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Main Results
Scalability
GIM-V BL DI is ~5 times faster than GIM-V Base
Main Results (cont.)
Evolution of LinkedIn
Distribution of connected
components are stable after a
‘gelling’ point in 2003.
266
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Main Results (cont.)
Bimodal structure of Radius
Pregel: A System for Large-Scale
Graph Processing
A scalable and fault-tolerant platform with an API that is
sufficiently flexible to express arbitrary graph algorithms.
Model of computing:
Vertex centric, synchronized iterative model
267
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Graph Algorithms Implementation in
Pregel
Graph data are in respect machines, pass messages only, NO
graph state passing.
Pregel C++ API
• Compute() - executed at each active vertex in every
superstep.
•Query information about the current vertex and its edges.
•Send messages to other vertices.
•Inspect or modify the value associated with its vertex/out-
edges.
•state updates are visible immediately. no data races on
concurrent value access from diefferent vertices
• Limiting the graph state managed by the framework to
single value per vertex or edge simplifies the main
computation cycle, graph distribution, and failure
recovery.
268
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Pregel C++ API (cont.)
• Message Passing
•No guaranteed order, but it will be delivered and no
duplication.
• Combiners
•Combine several messages to reduce overhead
• Aggregators
•Mechanism for global communication, monitoring, and data.
•A number of predefined aggregators, such as min, max, or
sum operations
• Topology mutation
•Change graph toplogy, resolve conflicts when individual
vertices sent conflict messages.
Pregel C++ API (cont.)
• Input and output
• Readers and writers
269
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Pregel implementation
• Design for Google cluster architecture
•Each consists of thousands of commercial PCs
• Persistent data
•Stored in files on distributed file systems such as GFS or
BigTable
• Temporary data
•Stored as buffered message on local disk.
Pegel: Assign load
• Divide graph vertices into partitions and assign to
different machines
•controllable by users, default method: hash
• In absence of fault:
•One master, many other workers on a cluster of machines.
• master assign load jobs, i/o and instruct on super steps
• Fault tolerent:
•Use checkpoint: master ping workers
•Confined recovery (undergoing): master log outgoing
message
270
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Graph Application
PageRank
Shortest Path
Bipartite Matching
Semi Cluster
Pregel: Main Result
271
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Reference (partial)
Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations by J.
Leskovec, J. Kleinberg, C. Faloutsos. (KDD), 2005.
Substructure Discovery in the SUBDUE System. L. B. Holder, D. J. Cook and S. Djoko. In
(PWKDD), 1994.
Efficient Aggregation for Graph Summarization – Yuanyuan Tian, Richard A. Hankins, Jignesh M.
Patel SIGMOD 2008
Graph Summarization with Bounded Error-Saket Navlakha, Rajeev Rastogi, Nisheeth Shrivastava
SIGMOD 2008
Mining graph patterns efficiently via randomized summaries Chen Chen, Cindy X. Lin, Matt
Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han - VLDB 2009
Progressive Clustering of Networks using
Structure-Connected Order of Traversal Dustin Bortner, Jiawei Han –ICDE 2010
PEGASUS: A Peta-Scale Graph Mining System U. Kang, Charalampos E. Tsourakakis,
ChristosFaloutsos, ICDM
Graph based induction as a unified learning framework, K. Yoshida, H. Motoda, and N. Indurkhya.
Applied Intelligence volume 4, 1994.
Complete mining of frequent patterns from graphs: Mining graph data. Akihiro,W. Takashi, and
M. Hiroshi. Mach. Learn., 50(3):321–354, 2003.
Reference (cont.)
Frequent subgraph discovery, K. Michihiro and G. Karypis. In ICDM, pages 313–320, 2001.
gSpan: Graph-based substructure pattern mining, X. Yan and J. Han. ICDM 2002.
WARMR Discovery of frequent datalog patterns. L. Dehaspe and H. Toivonen. Data Mining and
Knowledge Discovery, 3(7-36), 1999.
FARMAR Fast association rules for multiple relations. S. Nijssen and J. Kok. Data Mining and
Knowledge Discovery, 3(7–36), 1999.
272
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Roadmap
Part I (1.5 hrs)
Graph Mining Primer
Recent advances in Massive Graph Mining
Part 2(1.5 hrs)
CSV: cohesive subgraph Mining
Dngraph mining: a triangle based approach
CSV
1. Cohesive sub-graph mining, with visualization
2. Existing approaches
3. CSV provides effective visual solution
– Algorithm principle
– Connectivity Estimation
4. Experimental Study
273
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Existing solutions
1. Current state-of-the-art to abstract information from huge
graphs. information Yes,
1. Graph partition algorithms. structure No.
Spectral clustering[Ng01]: high computational cost
METIS[Karypis96]: favors balanced pattern
2. Graph Pattern Mining algorithms
CODENSE[Hu05], CLAN[Zeng06]: exponentially running time
2. Graph Layout Tools:
Osprey [Breitkreutz03] Visant [Mellor04]: Do not have mining
capability information No,
We want structured information
structure Yes.
CSV: General Approach
• Separate vertices in the graph into VISITED, UNVISITED
• Start: Pick a vertex and add into VISITED
• Repeat until UNVISITED=empty
–Among all vertices that are in UNVISITED, pick one vertex V
most highly connected to VISITED
–Plot V’s connectivity
–Add V into VISITED
But how do we measure connectivity?
274
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Connectivity measurement
Connectivity measurement is closely related to clique (fully connected sub-
graph) size.
The connectivity between two vertices in
a graph (ηmax) is defined to be the The “connectivity” of a vertex
biggest clique in the graph such that (ζmax) is similarly defined
both are members of the clique as the biggest clique it
can participate.
b
b
a c
a c
e d
e d
ηmax(a, d) = 0
ηmax(a, c) = 4 ζmax(a) = 5
CSV: Step by Step
heap
From Graph to Plot connectivity
D 4
A B C
3
F H I 2
E G 1
J B
A vertices
unvisited
neighbors
Start from A, explore A’s neighbor B.
visiting
Calculate ζmax (A)=2 and output it
visited
275
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
CSV algorithm on a synthetic graph
heap
From graph to plot connectivity
D 4
A B C
3 C
F H I
E 2 F
G
J B
H
1
unvisited AB vertices
neighbors
Mark A visited, from B, explore B’s
visiting immediate neighbors CFH.
visited Calculate ηmax (AB)=2 and output it
CSV algorithm on a synthetic graph
heap
From graph to plot connectivit
y
D 4
A B C F
3
H
C
F H I 2
E F
G
G 1
J D
H
A BC vertices
unvisited
neighbors Mark B visited, choose the closely
visiting connected C as next visiting vertex. From
C, explore C’s immediate neighbors DFGH,
visited
update ηmax when necessary.
Calculate ηmax (BC)=4 and output it
276
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
CSV algorithm on a synthetic graph
From graph to plot connectivity
Cohesive sub-
graph
D 4
A B C
3
F H I
2
E G
1
J
ABCH FGDE I J vertices
unvisited
neighbors
visiting Visit every vertex accordingly to produce a
visited plot.
Peaks represent cohesive sub-graphs.
Important Theorem
277
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Connectivity computation is a hard
problem
However, if graphs are very huge and massive, exact computation of
connectivity is prohibitive.
Direct computation
is costly
Connectivity computation is
prohibitive
•Exact algorithm relays on D
A B C
clique detection (NP-hard).
•Even approximation is hard. F H I
•Solution Part 1: Spatial E G
Mapping J
•Pick k pivots
P1 I
•Map graph into k-
dimensional space based on 3 A E
their shortest distance to the F GJ
pivots 2 B C D
•A clique will map into the
same grid. 1 H
I
0 1 2 3 P0 A
278
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Connectivity computation
•Solution Part 2: Approximate
Upper Bound for ζmax(v) and
ηmax(v, v’)
•Each vertex in a clique of size k
must have
•degree=k-1
•k-1 neighbors with degree k-1
Let estimate ηmax(a, f)
•For each vertex v, find it immediate
neighbors in the same grid cell and Locate the immediate neighborhood of a
construct a sub-graph and f, {a, b, c, d, e, f, g}. After sorting the
degree array in descending order, we have
array
•Iteratively readjust estimation for 6(a), 6(f), 5(d), 4(b), 4(c), 4(e), 3(g).
clique size
=5? =6? =7?
Experimental study on real datasets
DBLP: co-authorship graphs.
DBLP: v 2819, e 54990
Two groups of
German researchers
Peaks in DBLP CSV plot represents different research groups
279
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
SMD: Stock Market Data
Bridging vertex
Partial clique
Partial clique
Peaks in SMD CSV plot
represents highly cohesive
stocks
DIP: Database of interacting proteins
8 SMD3
9 PFS2
89 LSM8
PRP4
10 RNA14
89 LSM2
PRP8
10 FIP1
89 DCP1
PRP6
89 LSM6
LUC7
Structure of a nucleotide-bound Clp1-Pcf11 10 REF2
89 LSM3
SMX2
polyadenylation factor 10 CFT1
89 LSM4
SNP1
Christian G. Noble, Barbara Beuth, and Ian 10 CFT2
89 PAT1
STO1
A. Taylor*. Nucleic Acids Res. 2007 January; 10 MPE1
89 LSM7
NAM8
35(1): 87–99. 10 GLC7
10 PAP1
89 LSM5
SNU71
8 PRP31
“CPF is also required in both the cleavage 10 PTA1
8 YHC1
and polyadenylation reactions. It contains a 10 YSH1
8 PRP40
core of eight subunits Cft1, Cft2, Ysh1, Pta1 10 YTH1
10 PTI1
8 MUD1
Mpe1, Pfs2, Fip1 and Yth1”
8 SNU56
280
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Experimental Study
CSV as a pre-selection step
How?
•Apply CSV to identify potential
cohesive sub-graphs first.
•Use exact algorithm CLAN to run on
these candidates.
Result
•Get the exact cohesive sub-graphs as
running CLAN alone.
•Saves 28-84% of the time compared CSV as a pre-selection methods
to running CLAN alone.
DNgraph mining: A triangle based approach
• Mining dense patterns out of an extremely large graph
•When the graph is extremely large, it is even difficult to mine
dense patterns.
• An iterative improvement mining approach is more desirable
•Users are able to obtain the most updated results on demand.
• Dense patterns have strong connection with triangles inside a
graphs.
• This has already observed and explained by the preferential
attachment property of large scaled graphs.
281
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
DNgraph mining: A triangle based approach
• What makes a pattern dense? Intuitively B C
•A collection of vertices with high relevance.
•They share large number of common. A
D
• With that we propose the definition of
Dngraph
•A DNgraph is the largest sub graph sharing
A’ E
F
the most neighbors.
•Require each connected vertex pair sharing at λ(G) = 3, λ(GA’)=0
least λ neighbors.
Compare Dngraph with other dense pattern
definition
• Two interesting patterns
• 4-clique and a Turan graph T(14, 4) [14 vertices, 4 groups, fully
connected between groups]
• If mining quasi-clique, may ends up discovering 1 pattern, as in
(d)
• If searching for closed clique, may only find (e)
282
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
DNgraph mining: challenge
• Find common neigbhors for every connected vertices is
expensive
•Require O(E) join operations.
•Need random disc access.
•In fact, finding an DN-graph is an NP-problem.
• Solution
•Using triangles that two vertices participates to approximate
common neighbor size.
•Iterative refine the approximation following graph edge’s locality.
DNgraph mining: How
1. Initially: count # triangles each edge participates.
•Sort vertices and its neighbors in descending order of their degrees
•Scan the graphs to get # triangles for every vertex.
•The # triangle set the initial value of λ .
2. Next, Iteratively refine λ for every vertex
•Using streams of triangles.
•Iterative refine λcur.
283
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Triangle Counting: how?
1. Sort vertices and its neighbors in descending order of
their degrees
a bde e dbacgf
b acde Sort d eacgh
a e
f c bde b edac
d acegh a edb
b g e abcdgf c edb
c
f eg g edf
d h
g def f eg
h d h d
Triangle counting (cont.)
1. Sort vertices and its neighbors in a f
e
descending order of their degrees
2. Join neighborhood for triangle count for b g
every edge
c d h
• The two vertices inhibits locality, due to
reordering and preferential attachment 3 e dbacgf
property of large graphs d eacgh
3
b edac
a edb
c edb
g edf
f eg
h d
284
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Triangle counting (cont.)
a f
e
1. Sort vertices and its neighbors in
descending order of their degrees b g
2. Join neighborhood for triangle count for
c d h
every edge
vertex λcur
3. Use that as the initial λ value for every
edge/vertex e 3
• Vertex λ value is the maximal edge λ value d 3
it participates … …
•λcur(e) = 3
DNgraph mining: How (cont.)
• Initially: count # triangles each edge participates.
• Next, Iteratively refine λ for every vertex
•Using streams of triangles.
•Iterative refine λcur.
285
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Triangle stream
•Follow the same order of visiting graph during triangle
counting
•Triangles are not materialized, saving storage
n1 nx
n2 n2 n2
n1 n1
a b n1 a b
nx nx nx
a b a b a lambda=k b
lambda=k lambda=k
Iteratively refine λ
•Follow the same order of visiting graph during triangle
counting
•Triangles are not materialized, saving storage
•For every vertex v, when its triangles come, bound λcur(v)
using two other vertices’ λcur
286
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Iteratively refine λ (cont.)
a f
e
• Initially: count # triangles each edge 3
participates. b g
3
• Next, Iteratively refine λ for every vertex
c d h
•Using streams of triangles. vertex λcur
•Iterative refine λcur.
e 3
• Until all vertices’ λcur are converged
b 3
… …
DNgraph: Experiment
•Large scaled graph
•Flicker Dataset with with 1,715,255 vertices an 22,613,982
edges.
•1 iteration requires 1 hour, a workstation with a Quad-Core
AMD Opteron(tm) processor 8356, 128GB RAM and 700GB
hard disk.
•Converge in 66 iterations, almost stable after 35 iterations
287
Mining and Searching Complex Chapter 6 Structures Massive Graph Mining
Advantage
• Abstraction
Within the triangulation algorithm. The abstraction ensures
our approach’s extensibility to different input settings.
• Iteratively refine results
• The estimation of common neighborhood improves along
every iteration, users are able to obtain the most updated
results on demand.
• Pre-collection of Statistics to support effective buffer
management
• Process can be easily mapped to key->value pair for
further distributed processing.
Reference (partial)
[Hu05] H.Hu, X.Yan, Y.Huang, J.Han, and X.J.Zhou. Mining coherent dense subgraphs across
massive biological networks for functional discovery. Bioinformatics, 21(1):213--221, 2005.
[Ng01] A.Y. Ng, M.I. Jordan, and Y.Weiss. On spectral clustering: Analysis and an algorithm.
Advances in Neural Information Processing Systems, volume~14, 2001.
[Karypis96] G.Karypis and V.Kumar. Parallel multilevel k-way partitioning scheme for irregular
graphs. Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on
Supercomputing (CDROM), page~35, Washington, DC, USA, 1996. IEEE Computer Society.
[Breitkreutz03] B.J.Breitkreutz, C.Stark, and M.Tyers.Osprey: a network visualization system.
Genome Biology, 4, 2003.
[Mellor04] J.W.J. Z., Mellor and C. DeLisi. An online visualization and analysis tool for biological
interaction data. BMC Bioinformatics, 5:17--24, 2004.
[Zeng06]J. Wang, Z.Zeng, and L. Zhou. Clan: An algorithm for mining closed cliques from large
dense graph databases. Proceedings of the International Conference on Data Engineering},
page~73, 2006.
[Turan41] P. Turan. On an extremal problem in graph theory. Mat. Fiz. Lapok, 48:436–452, 1941
[Ankerst99] M.Ankerst, M.Breunig, H.P. Kriegel, and J.Sander. OPTICS: Ordering points to
identify the clustering structure. Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data
(SIGMOD'99), pages 49--60, Philadelphia, PA, June 1999.
[DNgraph10] On Triangle based DNgraph Mining. NUS technical report TRB4/10
288