# Lecture Notes by dffhrtcv3

VIEWS: 32 PAGES: 290

• pg 1
```									VLDB Database School (China) 2010
August 3-7, 2010, Shenyang

Lecture Notes
Part 1

Mining and Searching Complex
Structures

Anthony K.H. Tung(邓锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung
Mining and Searching Complex Structures

Contents

Chapter 1: Introduction ------------------------------------------ 1

Chapter 2: High Dimensional Data ------------------------- 34

Chapter 3: Similarity Search on Sequences ------------ 110

Chapter 4: Similarity Search on Trees ------------------- 156

Chapter 5: Graph Similarity Search ---------------------- 175

Chapter 6: Massive Graph Mining ------------------------ 234
Mining and Searching Complex Structures                                                  Chapter 1 Introduction

Mining and Searching Complex
Structures
Introduction
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung

What is data mining?
Really nothing different from what scientists had been doing for
Correct,
Generate                            useful
data                                model

Collect data and verify or                Nobel
Real World                construct model of real world
Prize
Output most likely model
based on some statistical
Feed in data              measure
What’s new?

Systematically and
efficiently test
many statistical
models

1
Mining and Searching Complex Structures                                      Chapter 1 Introduction

Components of data mining
Structure of model
geneA=high and geneB=low ===> cancer
geneA, geneB and geneC exhibit strong correlation
Statistical Score for the model
Accuracy of rule 1 is 90%
Similarity function: Are they sufficiently similar group of records
that support a certain model or hypothesis?
Search method for the correct model parameters
Given 200 genes, there could be 2^200 rules. Which rule give the
best prediction power?
Database access method
Given 1 million records, how to quickly find relevant records to
compute the accuracy of a rule?

The Apriori Algorithm

search
• Only read is perform on
the databases                          a,b,c a,b,e a,c,e b,c,e
• Store candidates in
memory to simulate the
lattice search                   a,b    a,c       a,e    b,c     b,e   c,e
steps:
–generate candidates                          a     b        c   e
–count and get actual
frequent items
start           {}
4

2
Mining and Searching Complex Structures                                                                                                                                                                                      Chapter 1 Introduction

The K-Means Clustering Method

• Given k, the k-means algorithm is implemented in 4
steps:
–Partition objects into k nonempty subsets
–Compute seed points as the centroids of the clusters of the
current partition. The centroid is the center (mean point) of the
cluster.
–Assign each object to the cluster with the nearest seed point.
–Go back to Step 2, stop when no more new assignment.

5

The K-Means Clustering Method
• Example
10                                                                                                 10

9                                                                                                  9

8                                                                                                  8

7                                                                                                  7

6                                                                                                  6

5                                                                                                  5

4                                                                                                  4

3                                                                                                  3

2                                                                                                  2

1                                                                                                  1

0                                                                                                  0
0       1       2       3       4       5       6       7       8       9       10                 0       1       2       3       4       5       6       7       8       9       10

10                                                                                                 10

9                                                                                                  9

8                                                                                                  8

7                                                                                                  7

6                                                                                                  6

5                                                                                                  5

4                                                                                                  4

3                                                                                                  3

2                                                                                                  2

1                                                                                                  1

0                                                                                                  0
0       1       2       3       4       5       6       7       8       9        10                0       1       2       3       4       5       6       7       8       9        10

6

3
Mining and Searching Complex Structures                                       Chapter 1 Introduction

Training Dataset (Decision Tree)
Outlook     Temp          Humid    Wind        PlayTennis
Sunny         Hot         High    Weak           No
Sunny         Hot         High    Strong         No
Overcast       Hot         High    Weak           Yes
Rain         Mild        High    Weak           Yes
Rain         Cool       Normal   Weak           Yes
Rain         Cool       Normal   Strong         No
Overcast       Cool       Normal   Strong         Yes
Sunny         Mild        High    Weak           No
Sunny         Cool       Normal   Weak           Yes
Rain         Mild       Normal   Weak           Yes
Sunny         Mild       Normal   Strong         Yes
Overcast       Mild        High    Strong         Yes
Overcast       Hot        Normal   Weak           Yes
Rain         Mild        High
7
Strong         No

Selecting the Next Attribute
S=[9+,5-]                               S=[9+,5-]
E=0.940                                 E=0.940
Humidity                                Wind

High       Normal                       Weak        Strong

[3+, 4-]           [6+, 1-]             [6+, 2-]            [3+, 3-]
E=0.985           E=0.592                 E=0.811       E=1.0
Gain(S,Humidity)                           Gain(S,Wind)
=0.940-(7/14)*0.985                        =0.940-(8/14)*0.811
– (7/14)*0.592                             – (6/14)*1.0
=0.151                                     =0.048
8

4
Mining and Searching Complex Structures                                    Chapter 1 Introduction

Selecting the Next Attribute
S=[9+,5-]
E=0.940
Outlook

Over
Sunny                    Rain
cast

[2+, 3-]        [4+, 0]          [3+, 2-]
E=0.971       E=0.0             E=0.971
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
9

ID3 Algorithm
[D1,D2,…,D14]           Outlook
[9+,5-]

Sunny      Overcast          Rain

Ssunny=[D1,D2,D8,D9,D11]         [D3,D7,D12,D13] [D4,D5,D6,D10,D14]
[2+,3-]                    [4+,0-]         [3+,2-]
?                         Yes                  ?
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

10

5
Mining and Searching Complex Structures                                                 Chapter 1 Introduction

Decision Tree for PlayTennis

Outlook

Sunny         Overcast           Rain

Humidity                          Yes                      Wind

High           Normal                                 Strong            Weak

No                    Yes                              No                     Yes
11

Can we fit what we learn into the framework?

Apriori               K-means                   ID3
task                  rule pattern discovery clustering               classification

structure of the model association rules       clusters                decision tree
or pattern
search space           lattice of all possible choice of any k         all possible
combination of items       points as center     combination of
size= 2m                   size=infinity        decision tree
size= potentially
infinity
score function         support, confidence        square error         accuracy,
information gain
method                 pruning

data management        TBD                        TBD                  TBD
12
technique

6
Mining and Searching Complex Structures                                  Chapter 1 Introduction

Components of data mining(II)

Models Enumeration
Algorithm

Statistical Score Function
Similarity/Search Function
Database Access Method

Database

Background knowledge
• We assume you have some basic knowledge about data
mining, some of the slides here will be very useful for this
purpose
• Association Rule Mining
http://www.comp.nus.edu.sg/~atung/Renmin56.pdf
• Classification and Regression
http://www.comp.nus.edu.sg/~atung/Renmin67.pdf
• Clustering
http://www.comp.nus.edu.sg/~atung/Renmin78.pdf

7
Mining and Searching Complex Structures                                                                Chapter 1 Introduction

IT Trend
Processors are cheap and will become cheaper(multi-core processor,
graphic cards)
Storage will be cheap but might not be fast
Bandwidth will be growing
What can we do with this?
Play more realistic games!
Not exactly a joke since any technologies that speed up games can speed up scientific
simulation
Smarter (more intensive) computation
Can store more personal semantic/ontology
People can collaborate more over the Internet (Flickr, Wikipedia) to make
things more intelligent
The AI dream now have the support of much better hardwares
Essentially, data mining can be made much more simple for the man
on the street
Data mining should be human-centered, not machine centered
2010-7-31
15

What is complex data?
What is “simple” data? data?
What are complex
tabular table, with small number of attributes (of the same type), no
Test1 Regular Progress

values.
Pos missing Fever
2.0                                                                     ……

Neg        -0.3   Unconscious

N.A        5.7

High dimensional data: Lots of attributes with different data types with missing values

Sequences/ time series                         Trees                             Graphs

8
Mining and Searching Complex Structures                                                                                                  Chapter 1 Introduction

Why complex data?
They come naturally in many applications. Bring research nearer to real
world
Lots of challenges which mean more fun!
Some fundamental challenges:
How do you compare complex objects effectively and efficiently?
How do you find special subset in the data that is interesting?
Pos What new type of models and score function must you used?
2.0   Fever                                                                                                                  ……
Neg
How do you handle noise and error ?
-0.3  Unconscious
N.A       5.7

a                                       a

d        b       e                     b         c        d

c          d                          c       d              b
c        d
e
T1                                   T2

Personalized Semantic for Personal Data Management
everyone will own terabytes of data soon
improve query/search interface by mining and extracting personalized semantics like
entities and their relationship etc. by comparing them against high quality tagged databases

Query by            Query by                                    Query by
documents           audio/music         Query by video
photographs/images
Wikipedia

singers
authors
High Quality
Data                                                                                               semantic
actor/actress                                  songs
Sources
papers
layer
places
movies

Personal Data

documents              audio            video                 photographs/i       Webpage/Blogs/Bookmarks
music                                  mages

9
Mining and Searching Complex Structures                                                                                     Chapter 1 Introduction

Integrated Approach to Mining Software Engineering Data
software engineering data: code base, change history, bug reports, runtime trace
integrated into a data warehouse to support decision making and mining,
Example: Which code module should I modify to create a new function? Which
module need maintenance?

programming      defect detection        testing         debugging       maintenance       …

software engineering tasks helped by data mining

association/
classification                             clustering     …
patterns

Data Warehouse

code          change             program           structural          bug
bases         history             states            entities        reports/nl      …

software engineering data

WikiScience
Collaborative platform for scientist to build scientific models/hypothesis and share
data, applications

Based on some
articles, I make some
changes to Model A                                                                              supporting
to create Model B                                                                             articles tagged to
Model B

Centralized,
Centralized,
Model A                               Hybrid Model
Hybrid Model                                 Model B
Model A                                                                             Model B
C Constructed
C Constructed
by System
by System
supporting
dataset tagged to
Model A
This is my model of
the solar system base
on my supporting
dataset

10
Mining and Searching Complex Structures                                  Chapter 1 Introduction

Hey, why not Cloud Computing, Map/Reduce?
• These are platform for scaling up services to large
number of users on large amount of data
• But what exactly do you want to scale up?
• Services that provide useful and semantically
correct information to the users
• We have too many scalable data mining
algorithms that find nothing or too many things
• Let’s focus on finding useful things first
(assuming we have lot’s of processing power) and
then try to scale it up

Schedule of the Course
Date/Time   Content
Lesson 1    Introduction
Lesson 2    Mining and Search High Dimensional Data I
Lesson 3    Mining and Search High Dimensional Data II
Lesson 4    Mining and Search High Dimensional Data III
Lesson 5    Similarity Search for Sequences and Trees I
Lesson 6    Similarity Search for Sequences and Trees III
Lesson 7    Similarity Search for Graph I
Lesson 8    Similarity Search for Graph II
Lesson 9    Similarity Search for Graph III
Lesson 10   Mining Massive Graph I
Lesson 11   Mining Massive Graph II
Lesson 12   Mining Massive Graph III

11
Mining and Searching Complex Structures                               Chapter 1 Introduction

Focus of the course
• Techniques that can handle high dimensional, complex
structures
–Providing semantics to similarity search
–Shotgun and Assembly: Column/Feature Wise Processing using
Inverted Index
–Row-wise Enumeration
–Using local properties to infer global properties
• Throughout the course, please try to think of how these
techniques are applicable across different type of complex
structures

Databases Queries

To start off, we will consider something very basic call
ranking queries since we need ranking any similarity search
(usually from most similar to most dissimilar)
In relational database, SQL returns all results at one go
How many tuples can be fitted in one screen?
How many tuples can you remember?
Options:
Summarize the results
Display representative tuples
How to select representative tuples?

12
Mining and Searching Complex Structures                                             Chapter 1 Introduction

Retrieve Relevant Information

Search videos related to Shanghai Expo
Too many results: as long as you click “next”, there are 20
more new results
Are we interested in all results?
No, only most relevant ones
Search engines have to rank the results, out of which they
make money from

Question: How to Select a Small Result Set

Selecting the most representative or most interesting results
is not trivial
Find an apartment with rental cheaper than 1000, the
cheaper the better
The result tuples can be sorted in the ascending order of rental prices,
those in front are more favorable
Find an apartment with rental cheaper than 1000 near NEU,
the lower the better, the nearer the better
Apartment with lower rent may not be near, nearer one may not be
cheap
Order by prices? Order by distances?

13
Mining and Searching Complex Structures                                      Chapter 1 Introduction

Top-k Queries
Define a scoring function, which maps a tuple to a real
number, as a score
The higher the score is, the more favorable the tuple is
Define an integer k
Answer: k objects with highest scores
Different scoring function may give different top-k result
Price        Distance to NEU
Apartment A          \$800            500 meter
Apartment B         \$1200            200 meters

Given k = 1, if the score function is defined as the sum of
price and distance, the first tuple is better; if it is defined as
the product, the second tuple is better

Brute Force Top-k

Compute scores for each result tuple
Sort the tuples according to the descending order of the
scores
Select the first k tuples
What if the number of tuples is unlimited? Search engines
can give unlimited number of results
Even if the number of tuples is limited, it is too slow to
compute score for each tuple
We have to do it efficiently

14
Mining and Searching Complex Structures                                     Chapter 1 Introduction

Outline

Two well-known top-k algorithms
Fagin's Algorithm (FA)
The Threshold Algorithm (TA)
Take random access into consideration
No Random Access Algorithm (NRA)
The Combined Algorithm (CA)

Monotonicity

A score function f is monotone if f(x1,x2,...,xm)≤f(y1,y2,...,ym)
whenever xi≤yi for every i
Select top-3 students with highest total score in mathematics,
physics and computer science:
•

select name, math+phys+comp as score
from student
order by score desc limit 3

sum(x.math,x.phys,x.comp)≤sum(y.math,y.phys,y.comp) if
x.math≤y.math and x.phys≤y.phys and x.comp≤y.comp

15
Mining and Searching Complex Structures                                     Chapter 1 Introduction

Sorted Lists

We shall think of a database consisting of m sorted lists L1,
L2, … Lm

Lmath                    Lphys           Lcomp
Ann    98            Hugh         97     Kurt    96
Ben    96            Ryan         94     Ann     95
Kurt   93            Ann          92     Jane    95
Hugh    91            Kurt         91     Ben     93
Carl   90            Jane         89     Hugh    92
...       ...        ...         ...     ...    ...
...       ...        ...         ...     ...    ...
...       ...        ...         ...     ...    ...

Outline

Two well-known top-k algorithms
Fagin's Algorithm (FA)
The Threshold Algorithm (TA)
Take random access into consideration
No Random Access Algorithm (NRA)
The Combined Algorithm (CA)

16
Mining and Searching Complex Structures                                    Chapter 1 Introduction

Fagin's Algorithm (I)

Do sequential access until there are at least k matches

Ann    98            Hugh        97     Kurt    96
Ben    96            Ryan        94     Ann     95
Kurt   93            Ann         92     Jane    95
Hugh    91            Kurt        91     Ben     93
Carl   90            Jane        89     Hugh    92
...    ...           ...        ...     ...    ...
...    ...           ...        ...     ...    ...
...    ...           ...        ...     ...    ...

Sequential accesses are stopped when 3 students are seen, i.e.
Ann, Hugh and Kurt

Fagin's Algorithm (II)

For each object that has been seen, do random accesses on
other lists to compute its score

Ann    98            Hugh        97     Kurt    96
Ben    96            Ryan        94     Ann     95
Kurt   93            Ann         92     Jane    95
Hugh    91            Kurt        91     Ben     93
Carl   90            Jane        89     Hugh    92
...    ...           ...        ...     ...    ...
...    ...           ...        ...     ...    ...
...    ...           ...        ...     ...    ...

Random accesses need to be done for Ben, Carl, Jane and
Ryan

17
Mining and Searching Complex Structures                                      Chapter 1 Introduction

Fagin's Algorithm (III)

Select the k objects with highest score as top-k result

Ann     98           Hugh        97      Kurt    96
Ben     96           Ryan        94      Ann     95
Kurt    93           Ann         92      Jane    95
Hugh     91           Kurt        91      Ben     93
Carl    90           Jane        89      Hugh    92
...    ...           ...        ...      ...    ...
...    ...           ...        ...      ...    ...
...    ...           ...        ...      ...    ...

Why is FA correct? (I)

There are at least k objects seen on all attributes when
sequential access is stopped
By monotonicity, those objects that are not seen do not have
higher score than the above k objects

Ann     98           Hugh        97      Kurt    96
Ben     96           Ryan        94      Ann     95
Kurt    93           Ann         92      Jane    95
Hugh     91           Kurt        91      Ben     93
Carl    90           Jane        89      Hugh    92
...    ...           ...        ...      ...     ...
...    ...           ...        ...      ...     ...
...    ...           ...        ...      ...     ...

18
Mining and Searching Complex Structures                                      Chapter 1 Introduction

Why is FA correct? (II)

For those that have been seen, it is either all attributes has
been seen, or random accesses are performed to know all
attributes
The k objects with highest scores are therefore the top-k
result

Ann     98           Hugh        97       Kurt    96
Ben     96           Ryan        94       Ann     95
Kurt    93           Ann         92       Jane    95
Hugh     91           Kurt        91       Ben     93
Carl    90           Jane        89       Hugh    92
...    ...           ...        ...       ...    ...
...    ...           ...        ...       ...    ...
...    ...           ...        ...       ...    ...

Outline

Two well-known top-k algorithms
Fagin's Algorithm (FA)
The Threshold Algorithm (TA)
Take random access into consideration
No Random Access Algorithm (NRA)
The Combined Algorithm (CA)

19
Mining and Searching Complex Structures                                        Chapter 1 Introduction

The Threshold Algorithm (I)

Do sequential access on all lists. If an object is seen, do

Ann     98           Hugh         97       Kurt    96
Ben     96           Ryan         94       Ann     95
Kurt    93            Ann         92       Jane    95
Hugh     91            Kurt        91       Ben     93
Carl    90           Jane         89      Hugh     92
...    ...            ...        ...       ...    ...
...    ...            ...        ...       ...    ...
...    ...            ...        ...       ...    ...

Random accesses on Ann, Hugh and Kurt first, then on Ben
and Ryan

The Threshold Algorithm (II)

Remember the k objects with highest scores, together with
their scores
Ann     98           Hugh         97       Kurt    96
Ben     96           Ryan         94       Ann     95
Kurt    93            Ann         92       Jane    95
Hugh     91            Kurt        91       Ben     93
Carl    90           Jane         89      Hugh     92
...    ...            ...        ...       ...    ...
...    ...            ...        ...       ...    ...
...    ...            ...        ...       ...    ...

Score (Ann) = 285
Score (Hugh) = 280
Score (Kurt) = 280

20
Mining and Searching Complex Structures                                 Chapter 1 Introduction

The Threshold Algorithm (III)

• Let threshold value τ be the function value on last seen values
on all sorted lists
• As soon as at least k objects with score at least τ, then halt

Ann     98             Hugh    97      Kurt    96
τ(1) = 291
Ben     96             Ryan    94      Ann     95     τ(2) = 285
Kurt    93             Ann     92      Jane    95     τ(3) = 280
Hugh    91             Kurt    91      Ben     93
Carl    90             Jane    89      Hugh    92
...    ...             ...    ...      ...    ...
...    ...             ...    ...      ...    ...
...    ...             ...    ...      ...    ...

Why is TA correct?
• By monotonicity, those unseen objects do not have higher
score than τ
• For those that have been seen, random accesses are
performed, the k objects with highest scores are therefore the
top-k result

Ann     98             Hugh    97      Kurt    96
τ(1) = 291
Ben     96             Ryan    94      Ann     95     τ(2) = 285
Kurt    93             Ann     92      Jane    95     τ(3) = 280
Hugh    91             Kurt    91      Ben     93
Carl    90             Jane    89      Hugh    92
...    ...             ...    ...      ...    ...
...    ...             ...    ...      ...    ...
...    ...             ...    ...      ...    ...

21
Mining and Searching Complex Structures                                   Chapter 1 Introduction

Comparing TA with FA

• Number of sequential accesses
At the time FA stops sequential accesses, τ is guaranteed not
higher than the k objects seen on all sorted lists
• Number of random accesses
TA requires m-1 random accesses for each object
But FA is expected to random access more objects
• Size of buffers used
Buffer used by FA can be unbounded
TA only needs to remember k objects with k scores, and the
threshold value τ

Outline

Two well-known top-k algorithms
Fagin's Algorithm (FA)
The Threshold Algorithm (TA)
Take random access into consideration
No Random Access Algorithm (NRA)
The Combined Algorithm (CA)

22
Mining and Searching Complex Structures                                      Chapter 1 Introduction

Random Access

Random accesses are impossible
Text retrieval: sorted lists are results of search engines
Random accesses are expensive
Sequential accesses on disk are orders of magnitude faster
than random accesses
We need to consider not using random accesses or using
them as few as possible

No Random Access
Without random access, all we know are the upper bounds

Lmath                    Lphys            Lcomp
Ann    98            Hugh         97      Kurt   96
Ben    96            Ryan         94      Ann    95
Kurt   93            Ann          92     Jane    95
Hugh    91            Kurt         91      Ben    93
Carl   90            Jane         89     Hugh    92
...       ...        ...         ...      ...    ...
...       ...        ...         ...      ...    ...
...       ...        ...         ...      ...    ...

Carl’s scores on physics and computer science are not higher
than 89 and 92 respectively

23
Mining and Searching Complex Structures                                   Chapter 1 Introduction

Lower and Upper Bounds
If an object has not been seen on one attribute
Lower bound is 0
Upper bound is the last seen value
Ann    98             Hugh        97    Kurt   96
Ben    96             Ryan        94    Ann    95
Kurt   93             Ann         92    Jane   95
Hugh    91             Kurt        91    Ben    93
Carl   90             Jane        89    Hugh   92
...    ...            ...        ...    ...   ...
...    ...            ...        ...    ...   ...
...    ...            ...        ...    ...   ...

The lower bound of Carl’s score on physics is 0
The upper bound of Carl’s score on physics is 89

Worse and Best Scores (I)
W (R): The worst possible score of tuple R
B (R): The best possible score of tuple R

Ann    98             Hugh        97    Kurt   96
Ben    96             Ryan        94    Ann    95
Kurt   93             Ann         92    Jane   95
Hugh    91             Kurt        91    Ben    93
Carl   90             Jane        89    Hugh   92
...    ...            ...        ...    ...   ...
...    ...            ...        ...    ...   ...
...    ...            ...        ...    ...   ...

W (Carl) = 90
B (Carl) = 90 + 89 + 92

24
Mining and Searching Complex Structures                                                         Chapter 1 Introduction

Worse and Best Scores (II)
W (R) ≤ Score of R ≤ B (R)
W (R) and B (R) get updated as its value gets sequential
accessed
Ann        98            Hugh         97            Kurt            96
Ben        96            Ryan         94            Ann             95
Kurt       93             Ann         92           Jane             95
Hugh        91             Kurt        91            Ben             93
Carl       90            Jane         89           Hugh             92
...        ...            ...        ...              ...           ...
...        ...            ...        ...              ...           ...
...        ...            ...        ...              ...           ...

Ann                   Hugh                 Kurt
W           98                       97                96
B           291                      291              291

Worse and Best Scores (II)
W (R) ≤ Score of R ≤ B (R)
W (R) and B (R) get updated as its value gets sequential
accessed
Ann        98            Hugh         97            Kurt            96
Ben        96            Ryan         94            Ann             95
Kurt       93             Ann         92           Jane             95
Hugh        91             Kurt        91            Ben             93
Carl       90            Jane         89           Hugh             92
...        ...            ...        ...              ...           ...
...        ...            ...        ...              ...           ...
...        ...            ...        ...              ...           ...

Ann     Hugh               Kurt        Ben          Ryan
W      98→193        97                 96        96               94
B      291→287    291→288         291→286         285              285

25
Mining and Searching Complex Structures                                              Chapter 1 Introduction

Worse and Best Scores (II)
W (R) ≤ Score of R ≤ B (R)
W (R) and B (R) get updated as its value gets sequential
accessed
Ann          98           Hugh        97          Kurt    96
Ben          96           Ryan        94          Ann     95
Kurt         93           Ann         92          Jane    95
Hugh          91           Kurt        91          Ben     93
Carl         90           Jane        89          Hugh    92
...         ...           ...        ...          ...    ...
...         ...           ...        ...          ...    ...
...         ...           ...        ...          ...    ...

Ann         Hugh     Kurt            Ben   Ryan   Jane
W 193→285            97    96→189            96     94     95
B 287→285 288→285 286→281 285→283 285→282                 280

Outline

Two well-known top-k algorithms
Fagin's Algorithm (FA)
The Threshold Algorithm (TA)
Take random access into consideration
No Random Access Algorithm (NRA)
The Combined Algorithm (CA)

26
Mining and Searching Complex Structures                                              Chapter 1 Introduction

No Random Access Algorithm (I)
Maintain the last-seen values x1,x2,…,xm
For every seen object, maintain its worst possible score, its
known attributes and their values

Ann          98           Hugh        97          Kurt    96
Ben          96           Ryan        94          Ann     95
Kurt         93           Ann         92          Jane    95
Hugh          91           Kurt        91          Ben     93
Carl         90           Jane        89          Hugh    92
...         ...           ...        ...          ...    ...
...         ...           ...        ...          ...    ...
...         ...           ...        ...          ...    ...

xmath = 96; xphys = 94; xcomp = 95
Ann:193:{<Math:98>;<Comp:95>}

No Random Access Algorithm (II)
Why not maintain the best possible score for each objects

Ann          98           Hugh        97          Kurt    96
Ben          96           Ryan        94          Ann     95
Kurt         93           Ann         92          Jane    95
Hugh          91           Kurt        91          Ben     93
Carl         90           Jane        89          Hugh    92
...         ...           ...        ...          ...    ...
...         ...           ...        ...          ...    ...
...         ...           ...        ...          ...    ...

Ann         Hugh     Kurt            Ben   Ryan   Jane
W 193→285            97    96→189            96     94     95
B 287→285 288→285 286→281 285→283 285→282                 280

Too Frequently Updated!

27
Mining and Searching Complex Structures                                                           Chapter 1 Introduction

No Random Access Algorithm (III)
Let M be the kth largest W value
An object R is viable if B (R) ≥ M
Ann             98              Hugh        97           Kurt      96
Ben             96              Ryan        94           Ann       95
Kurt            93              Ann         92          Jane       95
Hugh             91              Kurt        91           Ben       93
Carl            90              Jane        89          Hugh       92
...          ...              ...        ...           ...          ...
...          ...              ...        ...           ...          ...
...          ...              ...        ...           ...          ...

Ann         Hugh         Kurt          Ben   Ryan           Jane
W 285            97→188       189→280    96→189     94             95          M = 189
B         285    285→280 281→280 283→280 282→278 280→277

No Random Access Algorithm (III)
Let M be the kth largest W value
An object R is viable if B (R) ≥ M

Ann             98              Hugh        97           Kurt      96
Ben             96              Ryan        94           Ann       95
Kurt            93              Ann         92          Jane       95
Hugh             91              Kurt        91           Ben       93
Carl            90              Jane        89          Hugh       92
...          ...              ...        ...           ...          ...
...          ...              ...        ...           ...          ...
...          ...              ...        ...           ...          ...

Ann          Hugh     Kurt         Ben   Ryan           Jane
W          285     188→280       280         188    94           95→184        M = 280
B          285     285→280       280    280→278 278→276 277→274

28
Mining and Searching Complex Structures                                      Chapter 1 Introduction

No Random Access Algorithm (IV)
Let set T contain objects with W (R) ≥ M
Halt when
There are at least k objects seen on all sorted lists
No viable objects left outside set T

Ann     Hugh      Kurt        Ben   Ryan      Jane
W   285   188→280     280         188    94     95→184    M = 280
B   285   285→280     280    280→278 278→276 277→274

T = {Ann, Hugh, Kurt}

Why is NRA correct?

W (R) ≤ Score of R ≤ B (R) always holds
If an object R is not viable, Score of R ≤ B (R) ≤ M, then
there are at least k objects with scores not lower than R
Therefore, if there is no viable object outside T and T
contains at least k objects, T is the set of top-k result

29
Mining and Searching Complex Structures                                     Chapter 1 Introduction

Comparing NRA with TA

• Number of sequential accesses
The number of sequential accesses of NRA is at least the last
position of top-k result on all attributes
• Number of random accesses
NRA is obviously 0
• Size of buffers used
TA remembers k objects with k scores, and the threshold
value τ
NRA remembers all viable objects with its scores on all seen
attributes, and the last-seen value on all attributes

How deep can NRA go?
Ann    98            Hugh        97      Kurt    96
Hugh    97            Kurt        96      Ann     95
Ben    60            Ryan        60      Jane    60
Ryan    60            Ben         60      Ben     60
Carl   60            Jane        60      Carl    60
...    ...           ...        ...      ...    ...
Jane   60            Carl        60     Ryan     60
Kurt    0            Ann         0      Hugh     0

The set T can be identified quickly, but their scores will only
be certain at the end of lists
If we allow relatively fewer number of random accesses,
scanning the entire lists can be avoided

30
Mining and Searching Complex Structures                             Chapter 1 Introduction

Outline

Two well-known top-k algorithms
Fagin's Algorithm (FA)
The Threshold Algorithm (TA)
Take random access into consideration
No Random Access Algorithm (NRA)
The Combined Algorithm (CA)

The Combined Algorithm (I)

CA combines TA and NRA
cR: the cost of a random access
cS: the cost of a sequential access
h=
Run NRA, but every h steps to run random accesses, like TA
h = ∞ → never do random access, CA is then NRA

31
Mining and Searching Complex Structures                                   Chapter 1 Introduction

The Combined Algorithm (II)
Ann    98            Hugh        97     Kurt   96
Hugh    97            Kurt        96     Ann    95
Ben    60            Ryan        60    Jane    60
Ryan    60            Ben         60     Ben    60
Carl   60            Jane        60     Carl   60
...    ...           ...        ...     ...   ...
Jane   60            Carl        60    Ryan    60
Kurt    0            Ann         0     Hugh     0

Random accesses for Ann, Hugh and Kurt quickly find out
the scores for Ann, Hugh and Kurt

The Combined Algorithm (III)

In CA, by doing random accesses, we wish to either
Confirm an object is a top-k result, or
Prune a viable object
As the number of random accesses in CA is limited, various
heuristics can be made to optimize CA in terms of total cost

32
Mining and Searching Complex Structures                             Chapter 1 Introduction

Reference
• Ronald Fagin, Amnon Lotem, Moni Naor: Optimal
aggregation algorithms for middleware. J. Comput. Syst.
Sci. 66(4): 614-656 (2003)

33
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Mining and Searching Complex
Structures
High Dimensional Data
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung

Outline

• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data

Mining and Searching Complex Structures

34
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Sources of High Dimensional Data

•   Microarray gene expression
•   Text documents
•   Images
•   Features of Sequences, Trees and Graphs
•   Audio, Video, Human Motion Database (spatio-
temporal as well!)

Mining and Searching Complex Structures

Challenges of High Dimensional Data
• Indistinguishable
–Distance between two nearest points and two furthest points
could be almost the same
• Sparsity
–As a result of the above, data distribution are very sparse
giving no obvious indication on where the interesting
knowledge is
• Large number of combination
–Efficiency: How to test the number of combinations
–Effectiveness: How do we understand and interpret so many
combinations?

Mining and Searching Complex Structures

35
Mining and Searching Complex                                                      Chapter 2 Structures High Dimensional Data

Outline

• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data

Mining and Searching Complex Structures

•   Objects represented by multidimensional vectors

2596             51           3            221                    232                    148
…

•   The traditional approach to similarity search: kNN query
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
ID         d1          d2       d3    d4       d5          d6       d7         d8         d9      d10   Dist

P1         1.1            1    1.2    1.6     1.1          1.6      1.2        1.2        1        1    0.93

P2         1.4         1.4     1.4    1.5     1.4          1        1.2        1.2        1        1    0.98
P3         1              1     1      1          1        1        2          1          2        2    1.73
P4         20          20       21    20       22          20       20         19         20      20    57.7

P5         19          21       20    20       20          21       18         20         22      20    60.5

P6         21          21       18    19       20          19       21         20         20      20    59.8

Mining and Searching Complex Structures

36
Mining and Searching Complex                                             Chapter 2 Structures High Dimensional Data

•     Deficiencies
–Distance is affected by a few dimensions with high dissimilarity
–Partial similarities can not be discovered

•     The traditional approach to similarity search: kNN query
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

ID        d1         d2       d3       d4      d5     d6    d7     d8        d9   d10   Dist

P1      1.1         1
100      1.2       1.6     1.1    1.6   1.2   1.2        1     1    0.93
99.0

P2      1.4        1.4      1.4       1.5     1.4    1
100    1.2   1.2        1     1    99.0
0.98
P3       1          1        1        1        1     1     2      1
100        2     2    1.73
99.0

P4       20         20       21       20      22     20    20     19        20   20    57.7

P5       19         21       20       20      20     21    18     20        22   20    60.5

P6       21         21       18       19      20     19    21     20        20   20    59.8

Mining and Searching Complex Structures

Thoughts

• Aggregating too many dimensional differences into a single value
result in too much information loss. Can we try to reduce that loss?
• While high dimensional data typically give us problem when in
come to similarity search, can we turn what is against us into
• Our approach: Since we have so many dimensions, we can
compute more complex statistics over these dimensions to
overcome some of the “noise” introduce due to scaling of
dimensions, outliers etc.

Mining and Searching Complex Structures

37
Mining and Searching Complex                                                     Chapter 2 Structures High Dimensional Data

The N-Match Query : Warm-Up

•    Description
–Matches between two objects in n dimensions. (n ≤ d)
–The n dimensions are chosen dynamically to make the two objects match best.

•    How to define a “match”
–Exact match
–Match with tolerance δ

•    The similarity search example
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
n=6

ID        d1       d2       d3       d4     d5       d6       d7    d8    d9           d10              Dist

P1        1.1      1
100       1.2      1.6    1.1      1.6     1.2    1.2   1                1            0.2

P2        1.4      1.4      1.4      1.5    1.4       1
100     1.2    1.2   1                1            0.4
0.98
P3         1        1        1        1     1         1         2    1
100   2                2            1.73
0

P4        20       20       21       20     22       20       20    19    20           20               19

P5        19       21       20       20     20       21       18    20    22           20               19

P6        21       21       18       19     20       19       21    20    20           20               19

Mining and Searching Complex Structures

The N-Match Query : The Definition

•       The n-match difference
Given two d-dimensional points P(p1, p2, …, pd) and Q(q1, q2, …, qd), let δi
= |pi - qi|, i=1,…,d. Sort the array {δ1 , …, δd} in increasing order and let
the sorted array be {δ1’, …, δd’}. Then δn’ is the n-match difference
y
between P and Q.
1-match=A
10           E
•       The n-match query
8           D           2-match=B
Given a d-dimensional database DB, a query point Q and an
integer n (n≤d), find the point P ∈ DB that has the smallest                  6
n-match difference to Q. P is called the n-match of Q.                       4    A
B
2                                   C
•       The similarity search example
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)                    7
8
n=6             Q        2       4        6    8        10   x

ID        d1       d2       d3       d4     d5       d6       d7    d8    d9           d10              Dist

P1        1.1      1
100       1.2      1.6    1.1      1.6     1.2    1.2   1                1            0.2
0.6

P2        1.4      1.4      1.4      1.5    1.4       1
100     1.2    1.2   1                1            0.4
0.98
P3         1        1        1        1     1         1         2    1
100   2                2            1.73
0
1

P4        20       20       21       20     22       20       20    19    20           20               19

P5        19       21       20       20     20       21       18    20    22           20               19

P6        21       21       18       19     20       19       21    20    20           20               19

Mining and Searching Complex Structures

38
Mining and Searching Complex                                                        Chapter 2 Structures High Dimensional Data

The N-Match Query : Extensions
•       The k-n-match query
Given a d-dimensional database DB, a query point Q, an integer k, and an
integer n, find a set S which consists of k points from DB so that for any
point P1 ∈ S and any point P2∈ DB-S, P1’s n-match difference is smaller
than P2’s n-match difference. S is called the k-n-match of Q.         y
•       The frequent k-n-match query                                                                                 2-1-match={A,D}
10              E
Given a d-dimensional database DB, a query point Q, an integer
k, and an integer range [n0, n1] within [1,d], let S0, …, Si be                             8           D    2-2-match={A,B}
the answer sets of k-n0-match, …, k-n1-match, respectively,                                 6
find a set T of k points, so that for any point P1 ∈ T and any point                        4   A
P2 ∈ DB-T, P1’s number of appearances in S0, …, Si is larger                                                 B                  C
2
than or equal to P2’s number of appearances in S0, …, Si .
•       The similarity search example                                                           Q           2        4     6        8       10   x
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)                            n=6
ID        d1       d2       d3        d4       d5        d6           d7    d8     d9           d10                Dist

P1       1.1      1
100       1.2      1.6       1.1       1.6          1.2   1.2     1               1              0.2

P2       1.4      1.4      1.4      1.5       1.4        1
100          1.2   1.2     1               1              0.4
0.98
P3          1      1        1         1        1         1           2      1
100     2               2              1.73
0

P4       20       20       21        20       22        20           20    19     20               20             19

P5       19       21       20        20       20        21           18    20     22               20             19

P6       21       21       18        19       20        19           21    20     20               20             19
Mining and Searching Complex Structures

Cost Model

•      The multiple system information retrieval model
–Objects are stored in different systems and scored by each system
–Each system can sort the objects according to their scores
–A query retrieves the scores of objects from different systems and then combine them using some
aggregation function
Q : color=“red” & shape=“round” & texture “cloud”

System 1: Color                      System 2: Shape                      System 3: Texture

Object ID    Score                   Object ID          Score               Object ID              Score
1           0.4
0.4                        1              1.0
1.0                      1                    1.0
1.0
2           2.8
2.8                        2
5              1.5
5.5                      2                    2.0
2.0
3
5           3.5
6.5                        3
2              5.5
7.8                      3                    5.0
5.0
3
4           6.5
9.0                        4
3              7.8
9.0                      4
5                    8.0
9.0
4
5           9.0
3.5                        5
4              9.0
1.5                      5
4                    9.0
8.0

•      The cost
–Retrieval of scores – proportional to the number of scores retrieved

•      The goal
–To minimize the scores retrieved

Mining and Searching Complex Structures

39
Mining and Searching Complex                                                                       Chapter 2 Structures High Dimensional Data

•       The AD algorithm for the k-n-match query
–Locate the query’s attributes value in every dimension
–Retrieve the objects’ attributes value from the query’s attributes in both directions
–The objects’ attributes are retrieved in Ascending order of their Differences to the query’s attributes. An n-match is found
when it appears n times.

2-2-match 3.0 ( 3.0 , 7.0 , 4.0 )
shape=“round” )
Q : color=“red” &Q : (of Q ,:7.0 , 4.0& texture “cloud”

System 1: Color
d1                                          2:
Systemd2 Shape                                      3:
System d3 Texture

Object ID        Score
Attr                      Object ID          Score
Attr                     Object ID         Score
Attr
1             0.4                          1               1.0                            1            1.0
2             2.8                3.0       5               1.5                            2            2.0                4.0

5             3.5                          2               5.5          7.0               3            5.0
3             6.5                          3               7.8                            5            8.0
4             9.0                          4               9.0                            4            9.0

Auxiliary structures                                             d1                         d2                         d3
Next attribute to retrieve g[2d]
2 , 0.2
1 , 2.6        3 ,, 3.5
5 0.5       2 , 1.5        4 , 0.8
3 , 2.0     2 , 2.0        3 1.0
5 ,, 4.0

Number of appearances appear[c]                                   1           2              3           4               5

0           0
2
1              0
2
1           0               0
1
{
{ 3 , {23} }

Mining and Searching Complex Structures

•     The AD algorithm for the frequent k-n-match query
–The frequent k-n-match query
• Given an integer range [n0, n1], find k-n0-match, k-(n0+1)-match, ... , k-n1-
match of the query, S0, S1, ... , Si.
• Find k objects that appear most frequently in S0, S1, ... , Si.

–Retrieve the same number of attributes as processing a k-n1-match query.

•     Disk based solutions for the (frequent) k-n-match query

• Sort each dimension and store them sequentially on the disk
• When reaching the end of a disk page, read the next page from disk

–Existing indexing techniques
• Tree-like structures: R-trees, k-d-trees
• Mapping based indexing: space-filling curves, iDistance
• Sequential scan
• Compression based approach (VA-file)

Mining and Searching Complex Structures

40
Mining and Searching Complex                                                    Chapter 2 Structures High Dimensional Data

Experiments : Effectiveness
•   Searching by k-n-match
–COIL-100 database
–54 features extracted, such as color histograms, area moments

k-n-match query, k=4
kNN query
n          Images returned
k     Images returned
5           36, 42, 78, 94
10     13, 35, 36, 40, 42
10          27, 35, 42, 78
64, 85, 88, 94, 96
15           3, 38, 42, 78
20          27, 38, 42, 78
25          35, 40, 42, 94
30          10, 35, 42, 94
35          35, 42, 94, 96
40          35, 42, 94, 96
45          35, 42, 94, 96
50          35, 42, 94, 96

Searching by frequent k-n-                                  Data sets (d)        IGrid   HCINN   Freq. k-n-match
match                                                   Ionosphere (34)         80.1%    86%          87.5%
UCI Machine learning repository
Competitors:                                     Segmentation (19)        79.9%    83%          87.3%
IGrid                                             Wdbc (30)          87.1%    N.A.         92.5%
Human-Computer Interactive NN
search (HCINN)                                     Glass (9)         58.6%    N.A.         67.8%
Iris (4)        88.9%    N.A.         89.6%
Mining and Searching Complex Structures

Experiments : Efficiency
•   Disk based algorithms for the Frequent k-n-mach query
–Texture dataset (68,040 records); uniform dataset (100,000 records)
–Competitors:
• VA-file
• Sequential scan

Mining and Searching Complex Structures

41
Mining and Searching Complex                                            Chapter 2 Structures High Dimensional Data

Experiments : Efficiency (continued)
•   Comparison with other similarity search techniques
–Texture dataset ; synthetic dataset
–Competitors:
• Frequent k-n-match query using the AD algorithm
• IGrid
• scan

Mining and Searching Complex Structures

Future Work(I)
• We now have a natural way to handle similarity search for
data with categorical , numerical and attributes. Investigating
k-n-match performance on such mixed-type data is currently
under way
• Likewise, applying k-n-match on data with missing or
uncertain attributes will be interesting
• Query={1,1,1,1,1,1,1,M,No,R}
ID      d1      d2      d3      d4     d5        d6       d7   d8    d9   d10

P1      1.1     1      1.2     1.6     1.1       1.6     1.2   M    Yes   R

P2      1.4     1.4    1.4     1.5     1.4       1       1.2   F    No    B

P3      1       1       1       1       1        1        2    M    No    B

P4      20      20      21      20     22        20       20   M    Yes   G

P5      19      21      20      20     20        21       18   F    Yes   R

P6      21      21      18      19     20        19       21   F    Yes   Y

Mining and Searching Complex Structures

42
Mining and Searching Complex                                   Chapter 2 Structures High Dimensional Data

Future Work(I)

• We now have a natural way to handle similarity search for
data with categorical , numerical and attributes. Investigating
k-n-match performance on such mixed-type data is currently
under way
• Likewise, applying k-n-match on data with missing or
uncertain attributes will be interesting
• Query={1,1,1,1,1,1,1,M,No,R}
ID   d1    d2    d3     d4     d5        d6       d7   d8    d9   d10

P1         1     1.2    1.6    1.1       1.6     1.2   M          R

P2   1.4   1.4          1.5              1       1.2   F    No    B

P3    1    1      1      1      1                 2    M    No    B

P4   20    20           20     22        20       20   M          G

P5   19    21    20     20     20                 18        Yes   R

P6   21          18            20                 21   F    Yes   Y

Mining and Searching Complex Structures

Future Work(II)
• In general, three things affect the result from a similarity search:
noise, scaling and axes orientation. K-n-match reduce the effect of
noise. Ultimate aim is to have a similarity function that is robust
to noise, scaling and axes orientation
• Eventually will look at creating mining algorithms using k-n-
match

Mining and Searching Complex Structures

43
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Outline

• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data

Mining and Searching Complex Structures

Motivation

query
Large
results
Data Sets

Ever-increasing data collection rates of modern
enterprises and the need for effective, guaranteed-
Concern: compress as much as possible.

Mining and Searching Complex Structures                22

44
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Conventional Compression Method
• Try to find the optimal encoding of arbitrary strings for
the input data:
–Huffman Coding
–Lempel-Ziv Coding (gzip)
• View the whole table as a large byte string
• Statistical or dictionary based
• Operate at the byte level

Mining and Searching Complex Structures                23

Why not just “syntactic”?

• Do not exploit the complex dependency patterns in the table
• Individual retrieval of tuple is difficult
• Do not utilize lossy compression

Mining and Searching Complex Structures                24

45
Mining and Searching Complex                            Chapter 2 Structures High Dimensional Data

Semantic compression methods

• Derive a descriptive model M
• Identify the data values which can be derived from M (within
some error tolerance), which are essential for deriving, and
which are the outliers
• Derived values need not to be stored, only the outliers need

Mining and Searching Complex Structures                25

• More Complex Analysis
–Example: detect correlation among columns
• Fast Retrieval
–Tuple-wise access
• Query Enhancement
–Possible to answer query directly from discover semantic
–Compress in way which enhanced answering of some complex
queries, eg. “Go Green: Recycle and Reuse Frequent Patterns”, C.
Gao, B. C. Ooi, K. L. Tan and A. K. H. Tung. ICDE’2004.

Choose a combination of compression methods
based on semantic and syntactic information
Mining and Searching Complex Structures                26

46
Mining and Searching Complex                               Chapter 2 Structures High Dimensional Data

Fascicles
• Key observation
–Often, numerous subsets of records in T have similar values for
many attributes

Protocol   Duration   Bytes Packets           • Compress data by storing
http       12       20K     3
http       16       24K     5             representative values (e.g.,
http       15       20K     8             “centroid”) only once for each
http        19       40K    11             attribute cluster
http        26       58K    18
ftp        27      100K    24
ftp        32      300K    35
• Lossy compression:
ftp        18        80K   15             information loss is controlled by
the notion of “similar values” for
attributes (user-defined)

Mining and Searching Complex Structures                           27

ItCompress: Compression Format
Representative Rows (Patterns)
Original Table
RRid age salary          credit   sex
age salary credit sex
1       30     90k    good       F
20    30k     poor     M
2       70     35k     poor      M
25    76k     good     F
Compressed Table
30    90k     good     F
Outlying
40    100k    poor     M                       RRid bitmap
value
50    110k    good     F                           2        0111        20
60    50k     good     M                           1        1111
70    35k     poor     F                           1        1111
75    15k     poor     M                           1        0100    40, poor, M
Error Tolerance:                                      1        0111        50
age salary credit sex                              1        0010    60, 50k, M
5     25k      0       0                          2        1110        F
28

2        1111
Mining and Searching Complex Structures

47
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Some definitions

• Error tolerance
–Numeric attributes
• The upper bound that x’ can be different from x
• x ∈ [ x’-ei, x’+ei ]
–Categorical attributes
• The upper bound on the probability that the compressed
value differs from actual value
• Given an actual value x and its error tolerance ei, the
compressed value x’ should satisfy: Prob( x=x’ ) ≥ 1 - ei

Mining and Searching Complex Structures                    29

Some definitions

• Coverage
–Let R be a row in the table T, and Pi be a pattern
–The coverage of Pi on R :
cov( Pi , R ) = number of attributes X i in which
R[ X i ] is match by Pi [ X i ]
• Total coverage
–Let P be a set of patterns P1,…,Pk; and the table T
contains n rows R1,…,Rn
–
totalcov ( P, T ) =       ∑ cov( P
i =1..n
max   ( Ri ), Ri )
30

Mining and Searching Complex Structures

48
Mining and Searching Complex                                 Chapter 2 Structures High Dimensional Data

ItCompress: basic algorithm

• First randomly choose k rows as initial patterns
• Scan the table T:            Phase1
–For each row R, compute the coverage of each pattern on it,
then try to find Pmax(R)
–Allocate R to its most covered pattern
• After each iteration, re-compute all patterns’ Phase2
attributes, always using the most frequent values
• Iterate until sum of total coverage does not increase

Mining and Searching Complex Structures                  31

Example: the 1st iteration begins

age salary credit       sex                 RRid age salary      credit   sex
20    30k       poor    M                    1      20    30k    poor     M
25    76k       good     F                   2      25    76k    good     F
30    90k       good     F
40   100k       poor    M
50   110k       good     F
60    50k       good    M
70    35k       poor     F
75    15k       poor    M
Error Tolerance:
age salary credit       sex
5    25k        0       0                                                      32

Mining and Searching Complex Structures

49
Mining and Searching Complex                               Chapter 2 Structures High Dimensional Data

Example: Phase 1
RRid age salary           credit   sex
age salary credit       sex                1    20     30k       poor     M
20    30k       poor    M                  2    25     76k       good      F
25    76k       good     F
age salary        credit   sex
30    90k       good     F
20     30k       poor     M
40    100k      poor    M
40    100k       poor     M
50    110k      good     F
60     50k       good     M
60    50k       good    M
70     35k       poor      F
70    35k       poor     F
75     15k       poor     M
75    15k       poor    M
age salary credit          sex
Error Tolerance:
25     76k       good      F
age salary credit       sex
30     90k       good      F
5    25k        0       0                                                      33
50    110k       good      F
Mining and Searching Complex Structures

Example: Phase 2
RRid      age    salary credit       sex
age salary credit       sex            1       20                         M
70      30k       poor     M
20    30k       poor    M
2       25
25      90k
76k       good     F
F
25    76k       good     F
30    90k       good     F                     age salary        credit   sex

40    100k      poor    M                       20     30k       poor     M

50    110k      good     F                      40    100k       poor     M

60    50k       good    M                       60     50k       good     M

70    35k       poor     F                      70     35k       poor      F

75    15k       poor    M                       75     15k       poor     M

Error Tolerance:                                  age salary credit          sex
25     76k       good     F
age salary credit       sex
30     90k       good     F
5    25k        0       0                                                      34

50    110k       good     F

Mining and Searching Complex Structures

50
Mining and Searching Complex                            Chapter 2 Structures High Dimensional Data

Convergence(I)
• Phase 1:
–When we assign the rows to their most coverage patterns:
• For each row, the coverage increases or maintain
So the total coverage also increases or maintain
• Phase 2:
–When we re-compute the attribute values for the patterns:
• For each pattern, the coverage increases or maintains
So the total coverage also increases or maintains

Mining and Searching Complex Structures                35

Convergence(II)
• In both Phase 1&2, the total coverage is either increased
or maintained, and it has a obvious upper bound (cover
the whole table)

The algorithm will converge eventually

Mining and Searching Complex Structures                36

51
Mining and Searching Complex                            Chapter 2 Structures High Dimensional Data

Complexity
• Phase 1:
–In l iterations, we need to go through the n rows in the table and
match each row against the k patterns(2m comparisons,)
The running time complexity is O(kmnl) where m is the
number of attributes
• Phase 2:
–Computing each new pattern Pi will require going through all
the domain values/intervals of each value
Assuming the total number of domain values/intervals is d, the
running time complexity is O(kdl)

The total time complexity is O(kmnl+kdl)

Mining and Searching Complex Structures                37

• Simplicity and Directness
–Two phases process of Fascicle and Spartan
• Find rules/patterns
• Compress database using discovered rules/patterns
–ItCompress optimize the compression directly without finding
rules/patterns that may not be useful (a.k.a microeconomic approach)
• Less constraints
–Do not need patterns to be matched completely or rules that apply
globally
• Easily tuned parameters

Mining and Searching Complex Structures                38

52
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Performance Comparison
• Algorithms
–ItCompress, ItCompress+gzip
–Fascicles, Fascicles+gzip
–SPARTAN+gzip
• Platform
–ItCompress,Fascicles: AMD Duron 700Mhz, 256MB Memory
–SPARTAN: Four 700Mhz Pentium CPU, 1GB Memory)
• Datasets
–Corel: 32 numeric attributes, 35000 rows, 10.5MB
–Census: 7 numeric, 7 categorical, 676000 rows, 28.6MB
–Forest-cover: 10 numeric, 44 categorical, 581000 rows, 75.2MB

Mining and Searching Complex Structures                39

Effectiveness (Corel)

Mining and Searching Complex Structures                40

53
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Effectiveness (Census)

Mining and Searching Complex Structures                41

Effectiveness (Forest Cover)

Mining and Searching Complex Structures                42

54
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Efficiency

Mining and Searching Complex Structures                43

Varying k

Mining and Searching Complex Structures                44

55
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Varying Sample Ratio

Mining and Searching Complex Structures                45

Mining and Searching Complex Structures                46

56
Mining and Searching Complex                                Chapter 2 Structures High Dimensional Data

Effect of Corruption                                              20%
Corruption?

A1    A2    A3       A4      A5      A6     A7      A8      A9      A10 A11 A12

47

Mining and Searching Complex Structures

Effect of Corruption                                          20%
Corruption?

A1    A2    A3       A4      A5      A6     A7      A8      A9      A10 A11 A12

48

Mining and Searching Complex Structures

57
Mining and Searching Complex                             Chapter 2 Structures High Dimensional Data

Findings
•    ItCompress is
–More efficient than SPARTAN
–More effective than Fascicles
–Insensitive to parameter setting
–Robust to noises

Mining and Searching Complex Structures                49

Future work

• Can we perform mining on the compressed datasets using
only the patterns and the bitmap ?
–Example: Building Bayesian Belief Network
• Is ItCompress a good “bootstrap” semantic compression
algorithm ?

ItCompress
Compressed
database                                             database

Other Semantic
Compression Algorithms
50

Mining and Searching Complex Structures

58
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Outline

• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data

Mining and Searching Complex Structures

Metric v.s. Non-Metric
• Euclidean distance dominates DB queries
• Similarity in human perception

• Metric distance is not enough!

2010-7-31           Mining and Searching Complex Structures                52

59
Mining and Searching Complex                                         Chapter 2 Structures High Dimensional Data

Bregman Divergence

h

(q,f(q))
convex function f(x)

(p,f(p))
Bregman divergence
Df(p,q)

q                                  p
Euclidean dist.

2010-7-31                       Mining and Searching Complex Structures                  53

Bregman Divergence
• Mathematical Interpretation
–The distance between p and q is defined as the difference
between f(p) and the first order Taylor expansion at q

f(x) at p            first order Taylor expansion at q

2010-7-31                       Mining and Searching Complex Structures                  54

60
Mining and Searching Complex                                      Chapter 2 Structures High Dimensional Data

Bregman Divergence
• General Properties
–Non-Negativity
• Df(p,q)≥0 for any p, q
–Identity of Indiscernible
• Df(p,p)=0 for any p
–Symmetry and Triangle Inequality
• Do NOT hold any more

2010-7-31                    Mining and Searching Complex Structures                      55

Examples

Distance             f(x)                Df(p,q)              Usage

KL-Divergence        x logx               p log (p/q)        distribution,
color histogram
Itakura-Saito        -logx             p/q-log (p/q)-1     signal, speech
Distance
Squared               x2                 (p-q)2         Euclidean space
Euclidean
Von-Nuemann      tr(X log X – X)        tr(X logX – X      symmetric matrix
Entropy                               logY – X + Y)

2010-7-31                    Mining and Searching Complex Structures                      56

61
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Why in DB system?
• Database application
–Retrieval of similar images, speech signals, or time series
–Optimization on matrices in machine learning
–Efficiency is important!
• Query Types
–Nearest Neighbor Query
–Range Query

2010-7-31           Mining and Searching Complex Structures                57

Euclidean Space
• How to answer the queries
–R-Tree

2010-7-31           Mining and Searching Complex Structures                58

62
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Euclidean Space
• How to answer the queries
–VA File

2010-7-31           Mining and Searching Complex Structures                59

Our goal
• Re-use the infrastructure of existing DB system to support
Bregman divergence
–Storage management
–Indexing structures
–Query processing algorithms

2010-7-31           Mining and Searching Complex Structures                60

63
Mining and Searching Complex                            Chapter 2 Structures High Dimensional Data

Basic Solution
• Extended Space
–Convex function f(x) = x2

point       D1      D2                     point        D1     D2      D3

p       0       1                        p+          0     1        1

q       0.5     0.5                      q+         0.5    0.5     0.5

r       1       0.8                       r+         1     0.8    1.64

t       1.5     0.3                       t+        1.5    0.3    3.15

2010-7-31            Mining and Searching Complex Structures                 61

Basic Solution
• After the extension
–Index extended points with R-Tree or VA File
–Re-use existing algorithms with lower and upper bounds on
the rectangles

2010-7-31            Mining and Searching Complex Structures                 62

64
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

How to improve?
• Reformulation of Bregman divergence
• Tighter bounds are derived
• No change on index construction or query processing
algorithm

2010-7-31           Mining and Searching Complex Structures                63

A New Formulation

h

h’                                            query vector vq

Df(p,q)+Δ

q                            p
D*f(p,q)

2010-7-31           Mining and Searching Complex Structures                64

65
Mining and Searching Complex                                  Chapter 2 Structures High Dimensional Data

Math. Interpretation
• Reformulation of similarity search queries
–k-NN query: query q, data set P, divergence Df
• Find the point p, minimizing

–Range query: query q, threshold θ, data set P
• Return any point p that

2010-7-31                  Mining and Searching Complex Structures                65

Naïve Bounds
• Check the corners of the bounding rectangles

2010-7-31                  Mining and Searching Complex Structures                66

66
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Tighter Bounds
• Take the curve f(x) into consideration

2010-7-31           Mining and Searching Complex Structures                67

Query distribution
• Distortion of rectangles
–The difference between maximum and minimum distances
from inside the rectangle to the query

2010-7-31           Mining and Searching Complex Structures                68

67
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Can we improve it more?
• When Building R-Tree in Euclidean space
–Minimize the volume/edge length of MBRs
–Does it remain valid?

2010-7-31           Mining and Searching Complex Structures                69

Query distribution
• Distortion of bounding rectangles
–Invariant in Euclidean space (triangle inequality)
–Query-dependent for Bregman Divergence

2010-7-31           Mining and Searching Complex Structures                70

68
Mining and Searching Complex                                  Chapter 2 Structures High Dimensional Data

Utilize Query Distribution
• Summarize query distribution with O(d) real number
• Estimation on expected distortion on any bounding
rectangle in O(d) time
• Allows better index to be constructed for both R-Tree and
VA File

2010-7-31                  Mining and Searching Complex Structures                71

Experiments
• Data Sets
–KDD’99 data
• Network data, the proportion of packages in 72 different
TCP/IP connection Types
–DBLP data
• Use co-authorship graph to generate the probabilities of the
authors related to 8 different areas

2010-7-31                  Mining and Searching Complex Structures                72

69
Mining and Searching Complex                                     Chapter 2 Structures High Dimensional Data

Experiment
• Data Sets
–Uniform Synthetic data
• Generate synthetic data with uniform distribution
–Clustered Synthetic data
• Generate synthetic data with Gaussian Mixture Model

2010-7-31                  Mining and Searching Complex Structures                   73

Experiments
• Methods to compare

Basic              Improved              Query
Bounds            Distribution
R-Tree                 R                     R-B                 R-BQ

VA File                V                     V-B                 V-BQ

Linear Scan                                     LS

BB-Tree                                      BBT

2010-7-31                  Mining and Searching Complex Structures                   74

70
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Existing Solution
• BB-Tree (L. Clayton, ICML 2009)
–Memory-based indexing tree
–Construct with k-means clustering
–Hard to update
–Ineffective in high-dimensional space

2010-7-31           Mining and Searching Complex Structures                75

Experiments
• Index Construction Time

2010-7-31           Mining and Searching Complex Structures                76

71
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Experiments
• Varying dimensionality

2010-7-31           Mining and Searching Complex Structures                77

Experiments
• Varying dimensionality (cont.)

2010-7-31           Mining and Searching Complex Structures                78

72
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Experiments
• Varying data cardinality

2010-7-31           Mining and Searching Complex Structures                79

Conclusion
• A general technique on similarity for Bregman Divergence
• All techniques are based on existing infrastructure of
commercial database
• Extensive experiments to compare performances with R-
Tree and VA File with different optimizations

2010-7-31           Mining and Searching Complex Structures                80

73
Mining and Searching Complex                            Chapter 2 Structures High Dimensional Data

Outline

• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data

Mining and Searching Complex Structures

Motivation
• Probabilistic data is ubiquitous
–To represent the data uncertainty (WSN, RFID, moving
object monitoring)
–To compress data (image processing)
• Histogram is a good way to represent the prob. data
–Easy to capture
–Is very useful in image representation
•   Colors
•   Textures
•   Depth

Mining and Searching Complex Structures

74
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Motivation
• Similarity search is important for managing prob. data
are similar with sensor A (range query)
–Can answer which k pictures are similar (top-k query)
• Similarity function for prob. data should be carefully
chosen
–Bin by bin methods
• L1 and L2 norms
• χ2 distance
–Cross-bin methods
• Earth Mover’s Distance (EMD)

Mining and Searching Complex Structures

Outline
•   Motivation
•   Introduction to Earth Mover’s Distance (EMD)
•   Related works
•   Indexing the probabilistic data based on EMD
•   Experimental results
•   Conclusion and future work

Mining and Searching Complex Structures

75
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Introduction to Earth Mover’s Dist
• Bin by bin vs. cross bin

Bin-by-bin
Not good!

Cross bin
Good!
Can handle
distribution shift
Mining and Searching Complex Structures

Introduction to Earth Mover’s Dist
• What is EMD?
–Earth （泥土）
–Mover （搬运）
–Distance （代价）
–Can be understood as 搬运泥土的代价
• See an example…

Mining and Searching Complex Structures

76
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Moving Earth

≠

Mining and Searching Complex Structures

Moving Earth

≠

Mining and Searching Complex Structures

77
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Moving Earth

=

Mining and Searching Complex Structures

The Difference?

(amount moved)

=

Mining and Searching Complex Structures

78
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

The Difference?

Difference             (amount moved) * (distance moved)

=

Mining and Searching Complex Structures

Linear programming

P

m bins
(distance moved) * (amount moved)

Q                  All movements

n bins

Mining and Searching Complex Structures

79
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Linear programming

P

m clusters
(distance moved) * (amount moved)

Q

n clusters

Mining and Searching Complex Structures

Linear programming

P

m clusters
* (amount moved)

Q

n clusters

Mining and Searching Complex Structures

80
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Linear programming

P

m clusters

Q

n clusters

Mining and Searching Complex Structures

Constraints

1. Move “earth” only from P to Q
P

m clusters
P’

Q

n clusters               Q’

Mining and Searching Complex Structures

81
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Constraints

2. Cannot send more “earth” than
P                  there is

m clusters
P’

Q

n clusters               Q’

Mining and Searching Complex Structures

Constraints

3. Q cannot receive more “earth”
P                  than it can hold

m clusters
P’

Q

n clusters               Q’

Mining and Searching Complex Structures

82
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Constraints

4. As much “earth” as possible
P                  must be moved

m clusters
P’

Q

n clusters    Q’

Mining and Searching Complex Structures

The Formal Definition of EMD
• Earth Mover’s Distance (EMD)
–the minimum amount of work needed to change one
histogram into another

• Challenge of EMD
–O(N^3logN)

Mining and Searching Complex Structures

83
Mining and Searching Complex                                    Chapter 2 Structures High Dimensional Data

Related Works
•   Filter-and-refine framework
–[1] Approximation Techniques for
Indexing the Earth Mover's Distance in
Multimedia Databases. ICDE 2006
• Cannot handle high
dimensional histograms

–[2] Efficient EMD-based Similarity
Search in Multimedia Databases via
Flexible Dimensionality Reduction.
SIGMOD 2008
• Based on scan framework and
influence the scalability
•   Use scanning scheme to
process queries
–Merit: can obtain a good order to access
when execute the k-NN queries and thus
can minimize the number of candidates
–Demerit: need to scan the whole dataset
to obtain the order and thus low algo.
scalability

Mining and Searching Complex Structures

Related Works
•   Related works
–Based on the filter-and-refine framework
–Based on scanning method and low scalability
•   Our work
–Also based on the filter-and-refine method
–But avoid to scan the whole data set
• Use B+ trees
• And thus can obtain high scalability
•   Our contributions
–To the best of our knowledge, the 1st paper to index the high
dimensional prob. data based on the EMD
–Proposed algorithms of processing the similarity query based on B+ tree
filter
–Improve the efficiency and scalability of EMD-based similarity search

Mining and Searching Complex Structures

84
Mining and Searching Complex                                   Chapter 2 Structures High Dimensional Data

Indexing the probabilistic data
based on EMD
•   Our intuition:
–primal-dual theory in linear programming

•   Primal problem (EMD)

•   Dual problem

Mining and Searching Complex Structures

Indexing the probabilistic data based on
EMD

•    Good properties of dual space
–Constrains of dual space are independent of prob. data points (i.e., p and
q in this example)
• Thus, give any feasible solution (π, Ф) in dual space we can derives a
lower bound for EMD(p, q)
• Lower bound can help to filter out the not-hit histograms.
–given any feasible solution (π, Ф) in dual space, a histogram p can be
mapped as a value, using the operation of
• Can index histograms using B+ tree

Mining and Searching Complex Structures

85
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Indexing the probabilistic data based on EMD
• 1. Mapping Construction
–Key and counter key

Key                    Counter key
–Assuming p is a histogram in DB, given a feasible solution
(π, Ф), we calculate the Key for each record in DB
–We can index those keys using B+ tree
–For each feasible solution (π, Ф), a B+ tree can be
constructed

Mining and Searching Complex Structures

• Range query based on B+ index
–Given any feasible solution (π, Ф) , we construct a B+ tree
using keys of histograms
–Given a query histogram, we calculate its counter key using
the operation of
–Given a similarity search threshold θ, we have proved that
all candidate histogram’s key can be bounded by

–To further filter the candidates, we use L B+ tree and make
an intersection among their candidate results

Mining and Searching Complex Structures

86
Mining and Searching Complex                             Chapter 2 Structures High Dimensional Data

•   K-NN query based on B+ index
–Given a query q, we issue search on
each B+ tree Tl with key(q, Фl)
–We create two cursors for each tree and
let them to fetch records from different
directions (one left and one right)
–Whenever record r has already been
accessed by all B+ tree, it can be output
as a candidate for k-NN query

Mining and Searching Complex Structures

Experimental Setup
• 3 real data set
–RETINA1
• an image data set consists of 3932 feline retina scans labeled
with various antibodies.
–IRMA
• contains 10000 radiography images from the Image Retrieval
in Medical Application (IRMA) project
–DBLP
• With parameter setting

Mining and Searching Complex Structures

87
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Experimental Results on
Query CPU Time

Mining and Searching Complex Structures

Experimental Results on
Scalability

sigmod
our

Mining and Searching Complex Structures

88
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Conclusions
• We present a new indexing scheme for the general
purposes of similarity search on Earth Mover's Distance
• Our index method relies on the primal-dual theory to
construct mapping functions from the original
probabilistic space to one-dimensional domain
• Our B+ tree-based index framework has
–High scalability
–High efficiency
–can handle High dimensional data

Mining and Searching Complex Structures

Outline

• Sources of HDD
• Challenges of HDD
• Searching and Mining Mixed Typed Data
–Similarity Function on k-n-match
–ItCompress
• Bregman Divergence: Towards Similarity Search on Non-metric
Distance
• Earth Mover Distance: Similarity Search on Probabilistic Data
• Finding Patterns in High Dimensional Data

Mining and Searching Complex Structures

89
Mining and Searching Complex                                  Chapter 2 Structures High Dimensional Data

A Microarray Dataset
1000 - 100,000 columns

Class       Gene1       Gene2     Gene3         Gene4     Gene       Gene    Ge
5          6
Sample1        Cancer
Sample2        Cancer
100-
500          .
rows         .
.
SampleN-1      ~Cance
r
SampleN        ~Cance
r

• Find closed patterns which occur frequently among genes.
• Find rules which associate certain combination of the
columns that affect the class of the rows
–Gene1,Gene10,Gene1001 -> Cancer
Mining and Searching Complex Structures

Challenge I
• Large number of patterns/rules
–number of possible column combinations is extremely high
• Solution: Concept of a closed pattern
–Patterns are found in exactly the same set of rows are grouped together
and represented by their upper bound
• Example: the following patterns are found in row 2,3 and 4

upper
aeh             bound                i                 ri         Class
(closed                  1   a ,b,c,l,o,s         C
pattern)                 2   a ,d, e , h ,p,l,r   C
ae                 ah                                      3   a ,c, e , h ,o,q,t   C
eh               4   a , e ,f, h ,p,r     ~C
5   b,d,f,g,l,q,s,t      ~C
e                        h
lower bounds                             “a” however not part of
the group
Mining and Searching Complex Structures

90
Mining and Searching Complex                               Chapter 2 Structures High Dimensional Data

Challenge II
• Most existing frequent pattern discovery algorithms perform
searches in the column/item enumeration space i.e. systematically
testing various combination of columns/items
• For datasets with 1000-100,000 columns, this search space is
enormous
this purpose. CARPENTER (SIGKDD’03) is the FIRST

Mining and Searching Complex Structures

Column/Item Enumeration Lattice

• Each nodes in the lattice represent
a combination of columns/items                                                a,b,c,e
• An edge exists from node A to B if
A is subset of B and A differ from                                 a,b,c a,b,e a,c,e b,c
B by only 1 column/item
• Search can be done
breadth first           a,b        a,c       a,e    b,c     b

i               ri        Class
1   a,b,c,l,o,s       C
a     b        c
2   a,d,e,h,p,l,r     C
3   a,c,e,h,o,q,t     C
4   a,e,f,h,p,r       ~C
5   b,d,f,g,l,q,s,t   ~C
start           {}
Mining and Searching Complex Structures

91
Mining and Searching Complex                                  Chapter 2 Structures High Dimensional Data

Column/Item Enumeration Lattice
• Each nodes in the lattice represent
a combination of columns/items
a,b,c,e
• An edge exists from node A to B if
A is subset of B and A differ from
B by only 1 column/item                                              a,b,c a,b,e a,c,e b,c
• Search can be done depth first
• Keep edges from parent to child
only if child is the prefix of parent                      a,b       a,c         a,e     b,c     b

i               ri          Class
1   a,b,c,l,o,s         C
a        b        c
2   a,d,e,h,p,l,r       C
3   a,c,e,h,o,q,t       C
4   a,e,f,h,p,r         ~C
5   b,d,f,g,l,q,s,t     ~C
start               {}
Mining and Searching Complex Structures

General Framework for Column/Item Enumeration

Association Mining   Apriori[AgSr94],                Eclat,                         Hmine
DIC                 MaxClique[Zaki01],
FPGrowth [HaPe00]

Discovery                                       [Zaki98,Zaki01],
PrefixSpan
[PHPC01]

Iceberg Cube      Apriori[AgSr94]                                      BUC[BeRa99], H-
Cubing [HPDW01]

Mining and Searching Complex Structures

92
Mining and Searching Complex                              Chapter 2 Structures High Dimensional Data

A Multidimensional View

types of data                                                     others
or knowledge                                       other interest
measure
associative
pattern                               constraints

pruning method
sequential
pattern       compression method

closed/max
iceberg      pattern
cube
lattice transversal/
main operations

Mining and Searching Complex Structures

Sample/Row Enumeration Algorihtms

• To avoid searching the large column/item enumeration space, our
mining algorithm search for patterms/rules in the sample/row
enumeration space
• Our algorithms does not fitted into the column/item enumeration
algorithms
• They are not YAARMA (Yet Another Association Rules Mining
Algorithm)
• Column/item enumeration algorithms simply does not scale for
microarray datasets

Mining and Searching Complex Structures

93
Mining and Searching Complex                                     Chapter 2 Structures High Dimensional Data

Existing Row/Sample Enumeration Algorithms

• CARPENTER(SIGKDD'03)
–Find closed patterns using row enumeration
• FARMER(SIGMOD’04)
–Find interesting rule groups and building classifiers based on them
• COBBLER(SSDBM'04)
–Combined row and column enumeration for tables with large
number of rows and columns
• Topk-IRG(SIGMOD’05)
–Find top-k covering rules for each sample and build classifier
directly
• Efficiently Finding Lower Bound Rules(TKDE’2010)
–Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao.
What is Unequal among the Equals? Ranking Equivalent Rules from
Gene Expression Data. Accepted in TKDE
Mining and Searching Complex Structures

Concepts of CARPENTER
ij R (ij )
C       ~C
i              ri         Class       a    1,2,3   4
b    1       5
1   a,b,c,l,o,s       C                                                 C         ~C
c    1,3
2   a,d,e,h,p,l,r     C           d    2       5                   a    1,2,3     4
3   a,c,e,h,o,q,t     C           e    2,3     4                   e    2,3       4
4   a,e,f,h,p,r       ~C          f            4,5                 h    2,3       4
5   b,d,f,g,l,q,s,t   ~C          g            5
h    2,3     4                       TT|{2,3}
Example Table                     l    1,2     5
o    1,3
p    2       4
q    3       5
r    2       4
s    1       5
t    3       5

Transposed Table,TT
Mining and Searching Complex Structures

94
Mining and Searching Complex                                       Chapter 2 Structures High Dimensional Data

ij    R (ij )

Row Enumeration                                                               a
C
1,2,3 4
~C

b     1       5
c     1,3
d     2       5
e     2,3     4
123                   12345                         f             4,5
{a}      1234          {}
{a}                                       g             5
12             124                                                 h     2,3     4
{al}             {a}      1235                                      l     1,2     5
{}                 ij   R (ij )
13             125                                                 o     1,3
{aco}           {l}                                C       ~C       p     2       4
1                                       1245                a    1,2,3 4          q     3       5
14              134       {}
{abclos}          {a}             {a}               TT|{1} b        1       5        r     2       4
s     1       5
15              135                          c    1,3
{bls}             {}      1345                                      t     3       5
{}                 l    1,2 5
23               145                         o    1,3
2                                {}                                               ij    R (ij )
{aeh}                                         s    1       5
24              234     2345
{aeh}     {}                                       a     1,2,3 4
{aehpr}
TT|{12} l
{}
1,2 5
3              25               235
{dl}              {}                         ij    R (ij )
{acehoqt}
245                              C       ~C
34              {}
{aeh}                                        a     1,2,3 4
4                                               TT|{123}
{124}
{aefhpr}                            345
35               {}
{q}

5               45
{bdfglqst}        {f}
Mining and Searching Complex Structures

Pruning Method 1

•   Removing rows that appear in all tuples
of transposed table will not affect results
C        ~C
a       1,2,3    4
e       2,3      4
h       2,3      4
r2 r3                    r2 r3 r4
TT|{2,3}
{aeh}                      {aeh}

r4 has 100% support in the conditional table of
“r2r3”, therefore branch “r2 r3r4” will be
pruned.

Mining and Searching Complex Structures

95
Mining and Searching Complex                                 Chapter 2 Structures High Dimensional Data

Pruning method 2

123
{a}        1234
{a}
• if a rule is discovered
12345
{}
12
{al}        124
{a}        1235
before, we can prune
{}
13         125
{l}
enumeration below this
{aco}
1          14          134
{a}
1245
{}              node
{abclos}      {a}
15          135                         –Because all rules below
{bls}         {}        1345
{}              this node has been
23           145
2         {aeh}
{}                        discovered before
24         {aeh}      2345             –For example, at node 34, if
{}                 {aehpr}                  {}

3          25           235                        we found that {aeh} has
{dl}          {}
{acehoqt}
245                        been found, we can prune
34          {}               C       ~Coff all branches below it
{aeh}
4
345      a      1,2,3   4
{aefhpr}       35           {}
{q}                    e      2,3     4
h      2,3     4
5           45
{f}
TT|{3,4}
{bdfglqst}
Mining and Searching Complex Structures

Pruning Method 3: Minimum Support

• Example: From TT|{1}, we can see                                       ij R (ij )
that the support of all possible
pattern below node {1} will be at                                         C ~C
most 5 rows.
TT|{1}
a 1,2,3 4
b 1 5
c 1,3
l 1,2 5
o 1,3
s 1 5

Mining and Searching Complex Structures

96
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

From CARPENTER to FARMER
• What if classes exists ? What more can we
do ?
• Pruning with Interestingness Measure
–Minimum confidence
–Minimum chi-square
• Generate lower bounds for classification/
prediction

Mining and Searching Complex Structures

Interesting Rule Groups
• Concept of a rule group/equivalent class
–rules supported by exactly the same set of rows are grouped together
• Example: the following rules are derived from row 2,3 and 4 with
66% confidence

i                 ri         Class
upper                     1   a ,b,c,l,o,s         C
aeh--> C(66%)
bound                     2   a ,d, e , h ,p,l,r   C
3   a ,c, e , h ,o,q,t   C
ae-->C (66%)    ah--> C(66%)       eh-->C (66%)               4   a , e ,f, h ,p,r     ~C
5   b,d,f,g,l,q,s,t      ~C

a-->C however is not in
e-->C (66%)          h-->C (66%)
the group
lower bounds

Mining and Searching Complex Structures

97
Mining and Searching Complex                                Chapter 2 Structures High Dimensional Data

Pruning by Interestingness Measure
• In addition, find only interesting rule groups (IRGs) based
on some measures:
–minconf: the rules in the rule group can predict the class on
the RHS with high confidence
–minchi: there is high correlation between LHS and RHS of
the rules based on chi-square test
• Other measures like lift, entropy gain, conviction etc. can
be handle similarly

Mining and Searching Complex Structures

ij   R (ij )
C       ~C

Ordering of Rows: All Class C before ~C                                           a
b
1,2,3 4
1       5
c    1,3
d    2       5
e    2,3     4
123                    12345                      f            4,5
{a}        1234         {}
{a}                                   g            5
12         124                                               h    2,3     4
{al}         {a}        1235                                  l    1,2     5
{}                ij   R (ij )
13         125                                               o    1,3
{aco}       {l}                                 C       ~C    p    2       4
1                                 1245               a    1,2,3 4       q    3       5
14          134         {}
{abclos}      {a}         {a}                TT|{1} b        1       5     r    2       4
s    1       5
15          135                           c    1,3
{bls}         {}        1345                                  t    3       5
{}                l    1,2 5
23           145                          o    1,3
2                        {}                                             ij   R (ij )
{aeh}                                      s    1       5
24          234       2345
{aeh}       {}                                   a    1,2,3 4
{aehpr}
TT|{12} l
{}
1,2 5
3          25           235
{dl}          {}                          ij    R (ij )
{acehoqt}
245                               C       ~C
34          {}
{aeh}                                     a     1,2,3 4
4                                        TT|{123}
{124}
{aefhpr}                    345
35           {}
{q}

5           45
{bdfglqst}    {f}
Mining and Searching Complex Structures

98
Mining and Searching Complex                            Chapter 2 Structures High Dimensional Data

Pruning Method: Minimum Confidence

• Example: In TT|{2,3} on the right,                              C          ~C
the maximum confidence of all rules                      a      1,2,3,6    4,5
below node {2,3} is at most 4/5                          e      2,3,7      4,9
h      2,3        4

TT|{2,3}

Mining and Searching Complex Structures

Pruning method: Minimum chi-square

C          ~C
Same as in computing
maximum confidence                                   a      1,2,3,6    4,5
e      2,3,7      4,9
h      2,3        4

TT|{2,3}
C              ~C             Total
A     max=5          min=1          Computed
~A    Computed       Computed       Computed
Constant       Constant       Constant

Mining and Searching Complex Structures

99
Mining and Searching Complex                                    Chapter 2 Structures High Dimensional Data

Finding Lower Bound, MineLB

–Example: An upper bound
rule with antecedent A=abcde
a,b,c,d,e
and two rows (r1 : abcf ) and
(r2 : cdeg)
ad ae     bd    be        –Initialize lower bounds {a, b,
abc
cde                               c, d, e}
{d ,e}
a                                      e
b           c         d                             –Add “cdeg”--- new lower

Candidate lower bound: ad, ae, bd, be,
Candidate lower bound: ad, ae, bd, be cd, ce
Removed since d,e are still lower them
Kept since no lower bound overridebound

Mining and Searching Complex Structures

Implementation

• In general, CARPENTER FARMER can be                                   ij   R (ij )
implemented in many ways:                                                  C       ~C
a    1,2,3 4
–FP-tree                                                              b    1       5
–Vertical format                                                      c    1,3
d    2       5
• For our case, we assume the dataset can be                            e    2,3     4
fitted into the main memory and used                                  f            4,5
g            5
pointer-based algorithm similar to BUC                                h    2,3     4
l    1,2     5
o    1,3
p    2       4
q    3       5
r    2       4
s    1       5
t    3       5

Mining and Searching Complex Structures

100
Mining and Searching Complex                              Chapter 2 Structures High Dimensional Data

Experimental studies

• Efficiency of FARMER
–On five real-life dataset
• lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALL-
AML leukemia (ALL), Colon Tumor(CT)
–Varying minsup, minconf, minchi
–Benchmark against
• CHARM [ZaHs02] ICDM'02
• Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99
• Usefulness of IRGs
–Classification

Mining and Searching Complex Structures

Example results--Prostate

100000
FA RM ER
10000         Co lumnE
1000         CHA RM

100

10

1
3      4          5             6             7   8      9

mi ni mum sup p o r t

Mining and Searching Complex Structures

101
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Example results--Prostate

1200
FA RM ER:minsup=1:minchi=10
1000
FA RM ER:minsup =1
800

600

400

200

0
0      50        70           80         85      90      99

minimum confidence(%)

Mining and Searching Complex Structures

Top k Covering Rule Groups

• Rank rule groups (upper bound) according to
– Confidence
– Support
• Top k Covering Rule Groups for row r
– k highest ranking rule groups that has row r as support and support
> minimum support
• Top k Covering Rule Groups =
TopKRGS for each row

Mining and Searching Complex Structures

102
Mining and Searching Complex                                      Chapter 2 Structures High Dimensional Data

Usefulness of Rule Groups

•   Rules for every row
•   Top-1 covering rule groups sufficient to build CBA classifier
•   No min confidence threshold, only min support
•   #TopKRGS = k x #rows

Mining and Searching Complex Structures

Top-k covering rule groups

• For each row, we find the most
significant k rule groups:
class          Items
–based on confidence first
–then support
C1             a,b,c
• Given minsup=1, Top-1
–row 1: abc C1(sup = 2, conf= 100%)                       C1             a,b,c,d
–row 2: abc C1

C1             c,d,e
• abcd C1(sup=1,conf = 100%)
–row 3: cd C1(sup=2, conf = 66.7%)
• If minconf = 80%, ?
–row 4: cde C2 (sup=1, conf = 50%)                        C2             c,d,e

Mining and Searching Complex Structures

103
Mining and Searching Complex                              Chapter 2 Structures High Dimensional Data

Main advantages of Top-k coverage rule group

• The number is bounded by the product of k and the number
of samples
• Treat each sample equally provide a complete description
for each row (small)
• The minimum confidence parameter-- instead k.
• Sufficient to build classifiers while avoiding excessive
computation

Mining and Searching Complex Structures

Top-k pruning
• At node X, the maximal set of rows covered by rules to
be discovered down X-- rows containing X and rows
ordered after X.
– minconf    MIN confidence of the discovered TopkRGs for all rows in the above
set
– minsup    the corresponding minsup
• Pruning
–If the estimated upper bound of confidence down X < minconf     prune
–If same confidence and smaller support prune
• Optimizations

Mining and Searching Complex Structures

104
Mining and Searching Complex                               Chapter 2 Structures High Dimensional Data

Classification based on association rules
• Step 1: Generate the complete set of association rules
for each class ( minimum support and minimum
confidence.)
–CBA algorithm adopts apriori-like algorithm -fails at this step on microarray
data.
• Step 2:Sort the set of generated rules
• Step 3: select a subset of rules from the sorted rule
sets to form classifiers.

Mining and Searching Complex Structures

Features of RCBT classifiers

Problems                                           RCBT
To discover, store, retrieve and            Mine those rules to be used for
sort a large number of rules                classification.e.g.Top-1 rule group
is sufficient to build CBA classifier

Default class not convincing for            Main classifier + some back-up
biologists                                  classifiers

Rules with the same                         A subset of lower bound rules—
discriminating ability, how to              integrate using a score
integrate?                                  considering both confidence and
support.
Upper bound rules: specific
Lower bound rules: general

Mining and Searching Complex Structures

105
Mining and Searching Complex                                      Chapter 2 Structures High Dimensional Data

Experimental studies
• Datasets: 4 real-life data
• Efficiency of Top-k Rule mining
–Benchmark: Farmer, Charm, Closet+
• Classification Methods:
–CBA (build using top-1 rule group)
–RCBT (our proposed method)
–IRG Classifier
–Decision trees (single, bagging, boosting)
–SVM

Mining and Searching Complex Structures

Runtime v.s. Minimum support on ALL-AML dataset

10000
FARMER
FARMER(minconf=0.9)
1000         FARMER+prefix(minconf=0.9)
TOP1
100                           TOP100
Runtime(s)

10

1

0.1

0.01
17        19       21      22              23   25
Minimum Support

Mining and Searching Complex Structures

106
Mining and Searching Complex                                                Chapter 2 Structures High Dimensional Data

Scalability with k

100
PC
ALL

10
Runtime(s)

1

0.1
100          300       500            600          800      1000
k

Mining and Searching Complex Structures

Biological meaning –Prostate Cancer Data
Frequncy of Occurrence
1800
W72186
1600

1400

1200
AF017418
1000
AI635895

800
X14487
600                        AB014519
M61916
400                                                            Y13323

200

0
0       200       400   600          800     1000    1200    1400    1600

Gene Rank

Mining and Searching Complex Structures

107
Mining and Searching Complex                           Chapter 2 Structures High Dimensional Data

Classification results

Mining and Searching Complex Structures

Classification results

Mining and Searching Complex Structures

108
Mining and Searching Complex                                  Chapter 2 Structures High Dimensional Data

References
•       Anthony K. H. Tung, Rui Zhang, Nick Koudas, Beng Chin Ooi. "Similarity Search:
A Matching Based Approach", VLDB'06
•       H. V. Jagadish, Raymond T. Ng, Beng Chin Ooi, Anthony K. H. Tung, "ItCompress:
An Iterative Semantic Compression Algorithm". International Conference on Data
Engineering (ICDE'2004), Boston, 2004.
•       Zhenjie Zhang, Beng Chin Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.
Similarity Search on Bregman Divergence: Towards Non-Metric Indexing. In the
Proceedings of the 35th International Conference on Very Large Data Bases(VLDB),
Lyon, France August 24-28, 2009.
•       Jia Xu, Zhenjie Zhang, Anthony K. H. Tung, and Ge Yu. "Efficient and Effective
Similarity Search over Probabilistic Data Based on Earth Mover's Distance". to
appear in VLDB 2010, a preliminary version on Technical Report TRA5-10,
National University of Singapore. [Codes & Data]
•       Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering
Rule Groups for Gene Expression Data". In Proceedings SIGMOD'05,Baltimore,
Maryland 2005
•       Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao. What is Unequal
among the Equals? Ranking Equivalent Rules from Gene Expression Data.
Accepted in TKDE

Mining and Searching Complex Structures

Optional References:
•     Feng Pan, Gao Cong, Anthony K. H. Tung, Jiong Yang, Mohammed Zaki,
"CARPENTER: Finding Closed Patterns in Long Biological Datasets",
In Proceedings KDD'03, Washington, DC, USA, August 24-27, 2003.
•     Gao Cong, Anthony K. H. Tung, Xin Xu, Feng Pan, Jiong Yang.
"FARMER: Finding Interesting Rule Groups in Microarray Datasets".
Iin SIGMOD'04, June 13-18, 2004, Maison de la Chimie, Paris, France.
•     Feng Pang, Anthony K. H. Tung, Gao Cong, Xin Xu. "COBBLER:
Combining Column and Row Enumeration for Closed Pattern
Discovery". SSDBM 2004 Santorini Island Greece.
•     Gao Cong, Kian-Lee Tan, Anthony K.H. Tung, Feng Pan. “Mining
Frequent Closed Patterns in Microarray Data”. In IEEE International
Conference on Data Mining, (ICDM). 2004
•     Xin Xu, Ying Lu, Anthony K.H. Tung, Wei Wang. "Mining Shifting-and-
Scaling Co-Regulation Patterns on Gene Expression Profiles". ICDE
2006.

Mining and Searching Complex Structures

109
Mining and Searching Complex Structures                         Chapter 3 Similarity Search on Sequences

Searching and Mining Complex
Structures
Similarity Search on Sequences
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung

Types of sequences
Symbolic vs Numeric
We only touch discrete symbols here. Sequences of number are called time
series and is a huge topic by itself!
Single dimension vs multi-dimensional
Example: Yueguo Chen, Shouxu Jiang, Beng Chin Ooi, Anthony K. H. Tung.
"Querying Complex Spatial-Temporal Sequences in Human Motion Databases"
accepted and to appear in 24th IEEE International Conference on Data Engineering
(ICDE) 2008
Single long sequence vs multiple sequences

2010-7-31

110
Mining and Searching Complex Structures                Chapter 3 Similarity Search on Sequences

Outline
• Searching based on a disk based suffix tree
• Approximate Matching Using Inverted List (Vgrams)
• Approximate Matching Based on B+ Tree (BED Tree)

2010-7-31

Suffix
Suffixes of acacag\$:

1.   acacag\$
2.   cacag\$
3.   acag\$
4.   cag\$
5.   ag\$
6.   g\$
7.   \$

2010-7-31

111
Mining and Searching Complex Structures                                               Chapter 3 Similarity Search on Sequences

Suffix Trie

E.g. consider the string S = acacag\$                                                           g
Suffix Trie: a ties of            \$                                                  a    c                 \$
all possible suffices of S    7                                                                                 6
c        g                     a
Suffix
a                    \$                     g
1     acacag\$                                                                            c
2     cacag\$
c       g                5
a                                      a               \$
3     acag\$
4     cag\$                                                        \$                                  4
g                                         g
5     ag\$                                                         3
6     g\$                              \$                                            \$
7     \$                               1                                             2
2010-7-31

Suffix Tree (I)
Suffix tree for S=acacag\$: merge nodes with only one child
1 2 3 4 5 6 7
S= a c a c a g \$
\$                   a          c                g \$
a
7                                                                  6
c        g
a          \$                                                  “ca” is an
Path-label of                                                 c                                edge label
v                               a              g
node v is “aca”                                             g
Denoted as α(v)
5                                \$
ac            g                     \$
g                                                                       This is a
\$                \$                    2               4                  leaf edge
1               3

2010-7-31

112
Mining and Searching Complex Structures                                             Chapter 3 Similarity Search on Sequences

Suffix Tree (II)
Suffix tree has exactly n leaves and at most n edges
The label of each edge can be represented using 2 indices
Thus, suffix tree can be represented using O(n log n) bits

1 2 3 4 5 6 7
7,7
\$           1,1 a       c       6,7   g\$                   S= a c a c a g \$
a
7          2,3
2,3             6
c         g
a          \$           4,7
6,7         c                       Note: The end index of every
4,7
a          g 6,7         leaf edge should be 7, the last
5         g            \$
ac       6,7                    \$                         index of S. Thus, for leaf edges,
g             g                                            we only need to store the start
\$               \$                 2             4           index.
1                 3

2010-7-31

Generalized suffix tree
Build a suffix tree for two or more strings
E.g. S1 = acgat#, S2 = cgt\$

a       c
#            \$                     g     g           t
6             4           c t
g                  a                   a
t t\$                         t
a        #                             t         \$    #      \$
#t                  #                   #
1             4     2             1     3          2    5       3

2010-7-31

113
Mining and Searching Complex Structures                            Chapter 3 Similarity Search on Sequences

Straightforward construction of suffix tree

Consider S = s1s2…sn where sn=\$

Algorithm:
Initialize the tree we only a root
For i = n to 1
Includes S[i..n] into the tree

Time: O(n2)

2010-7-31

Example of construction
S=acca\$

Init                                          For-loop

c               a c
a          c
\$            \$           \$ a a                  a
\$ \$                 \$ c
a c
\$        \$ \$                                               c
\$ a                a a
\$ c \$      a
\$               \$        \$
5           5 4          5 4 3           5 4 3 2              5 4 1 3       2

I5           I4           I3                 I2                    I1
2010-7-31

114
Mining and Searching Complex Structures                   Chapter 3 Similarity Search on Sequences

Construction of generalized suffix tree
S’= c#

Init                               For-loop

a c                     a c                      a c

\$  c    c                 #\$ c     c                 #\$ c     c
a a
\$ c \$ a                       a a
\$ c \$ a                      a a
\$ c \$ a\$ #
\$    \$                     \$    \$                     \$
5 4 1 3 2               2   5 4 1 3 2            2     5 4 1 3 2 1

I1                   J2                        J1
2010-7-31

Property of suffix tree
Fact: For any internal node v in the suffix tree, if
the path label of v is α(v)=ap, then
there exists another node w in the suffix tree such that
α(w)=p.

Proof: Skip the proof.

For any internal node v, define its suffix link sl(v) = w.

2010-7-31

115
Mining and Searching Complex Structures                                 Chapter 3 Similarity Search on Sequences

S=acacag\$

\$               a     c         g \$
a
7                                             6
c        g
a          \$
c
a       g
5         g         \$
ac         g               \$
g
\$             \$              2         4
1             3

2010-7-31

Can we construct a suffix tree in O(n)
time?
Yes. We can construct it in O(n) time and O(n) space
Weiner’s algorithm [1973]
Linear time for constant size alphabet, but much space
McGreight’s algorithm [JACM 1976]
Linear time for constant size alphabet, quadratic space
Ukkonen’s algorithm [Algorithmica, 1995]
Online algorithm, linear time for constant size alphabet, less space
Farach’s algorithm [FOCS 1997]
Linear time for general alphabet
Hon,Sadakane, and Sung’s algorithm [FOCS 2003]
O(n) bit space O(n logen) time for 0<e<1
O(n) bit space O(n) time for suffix array construction
But they are all in-memory algorithm that does not
guarantee locality of processing

2010-7-31

116
Mining and Searching Complex Structures                              Chapter 3 Similarity Search on Sequences

Trellis Algorithm
A novel disk-based suffix tree construction
algorithm designed specifically for DNA
sequences
Scales gracefully for very large genome sequences
(i.e. human genome)
Unlike existing algorithms,
Trellis exhibits no data skew problem
Trellis has fast construction and query time
Trellis is a 4-step algorithm
2010-7-31

Trellis: Algorithm Overview
1. Variable-length prefixes: e.g. AA, ACA, ACC, …
R0      R1                              Rr-1
S

TR0      TR1                                       TRr-1
2. Prefixed Suffix Sub-trees
TPi
TR0,P0                                TR1,Pm-1                               3. Tree
Merging
Disk    TR0,Pi             TRr-1,Pi

2010-7-31

117
Mining and Searching Complex Structures                                                                                                 Chapter 3 Similarity Search on Sequences

1. Variable-length Prefix Creation

Goal: Separate the complete suffix tree by prefixes of
suffixes, such that each subtree can reside entirely in the
available memory
Frequency of Length-2 Prefixes for
Human Genome

300,000,000
AA                                                                 TT
Main Idea:
250,000,000
AT
Expand prefixes
Frequency

200,000,000                    AG        CA        CT                              TG
TA
GA                    TC
150,000,000               AC                  CC                  GGGT
100,000,000
50,000,000
GC
only as needed
CG
0
0                     5                   10                    15               20
Prefixes

2010-7-31

2. Suffix Tree
Partitioning
1. Variable-length prefixes: e.g. AA, ACA, ACC, …
R0      R1                              Rr-1
S

TR0                             TR1                                                                                     TRr-1

2. Prefixed Suffix Sub-trees
TR0,P0                                                                    TR1,Pm-1                                       • Use Ukkonen’s method because
of Its efficiency: O(n) time &space
subtrees on disk
Disk                            • Store enough information so that a
subtree can be rebuilt quickly, e.g. edge
starting index, edge length, node parent,
etc.

2010-7-31

118
Mining and Searching Complex Structures                             Chapter 3 Similarity Search on Sequences

3. Suffix Tree Merging
1. Variable-length prefixes: e.g. AA, ACA, ACC, …
R0      R1                              Rr-1
S

TR0   TR1                                         TRr-1
2. Prefixed Suffix Sub-trees
TPi
TR0,P0                               TR1,Pm-1                               3. Tree
Merging
Disk    TR0,Pi             TRr-1,Pi

2010-7-31

Merge Algorithm

T1                        T2

A                   G        T
C

Case 1: No common prefix

2010-7-31

119
Mining and Searching Complex Structures                   Chapter 3 Similarity Search on Sequences

Merge Algorithm

T1                    T2
T
A       C
G

Case 1: No common prefix

2010-7-31

Merge Algorithm

T1                    T2         T1                    T2
T                  A      CAAT           CAGGC
A       C
G

Case 1: No common prefix          Case 2: Has common prefix

2010-7-31

120
Mining and Searching Complex Structures                   Chapter 3 Similarity Search on Sequences

Merge Algorithm

T1                    T2         T1                     T2
CA
T                 A
A        C                                 AT       GGC
G

Case 1: No common prefix          Case 2: Has common prefix

2010-7-31

Some internal nodes have suffix links from the
Ukkonen’s algorithm in Step #1
Some internal nodes are created in the merging
step and do not have suffix links
and stored suffix trees on disk (does not help
speed this step up, so discard to simplify)
Should suffix links are required, use the suffix
link recovery algorithm to rebuild them

2010-7-31

121
Mining and Searching Complex Structures                                                                             Chapter 3 Similarity Search on Sequences

For each prefixed suffix tree, recursively call this function
from the tree’s root.
x: an internal node
L: be edge label between x and parent(x)

RECOVER(x, L)
if (x == root) sl(x)      x;
else {
1. p = parent(x);
2. q = sl(p); //get suffix link of p, and load the prefix tree
for q from disk if not in memory
3. Skip/count using L to locate sl(x) under q; }
for (each internal child y of x)
RECOVER(y, edge-label(x,y));
2010-7-31

Experimental Results

Construction Time                                                 Construction and Link Recovery Time
Trellis vs TOP-Q and DynaCluster                                                      Trellis vs TDD

1000                                                                              400
Time (mins)

Time (mins)

100                                                                             300
200
10
100
1                                                                              0
0      20     40     60      80     100      120                               200       400           600          800            1000
Sequence Length (Mbp)                                                              Sequence Length (Mbp)

TOP-Q (mins)        DynaCluster (mins)      Trellis (mins)                        TDD        Trellis        Link Recovery         Total Trellis

• Memory: 512 MB                                                                   • Memory: 512MB
• TOP-Q and DynaCluster parameters were                                            Human genome suffix tree
set as recommended in their papers                                                 (size ~3Gbp, using 2GB of memory)
Trellis             TDD: 12.8hr
• Without
5.9hr

122
Mining and Searching Complex Structures                                                                                   Chapter 3 Similarity Search on Sequences

Experimental Results (cont.)
Disk Space Usage

Disk-based Suffix Tree Size
Trellis vs TDD
27 bytes per character indexed while
30
Size (GB)

20

10
For the human genome, TDD uses
0
200       400            600             800             1000
Sequence Length (Mbp)                              requires 64-bit environment to index
Trellis       TDD
larger sequences.

Trellis remains at 27 bytes/char for
the human genome.
Human Genome
Trellis                         TDD
72GB                           54GB
2010-7-31

Experimental Results (cont.)
TDD
Trellis vs TDD                                                 • smaller suffix trees
Query Times on the Human Genome Suffix Tree                                    • edge length must be determined
by examining all children nodes
8000                                                              Trellis                • each internal node only has a
TDD
4000                                                                                       pointer to its first child, i.e. children
Query Length (bp)

1000
must be linearly scanned during
a query search
600
Trellis
200
• larger suffix trees
80                                                                                      • edge length stored locally with its
40                                                                                        respective node
0.000                0.050          0.100          0.150               0.200        • all children locations stored locally,
Query Time (secs)                                        so each child can be accessed in a
constant time, i.e. no linear scan
needed
Hence, faster query time!
2010-7-31

123
Mining and Searching Complex Structures                                                         Chapter 3 Similarity Search on Sequences

Experimental Results
(cont.)
S[150]

xαG                                C

Query length = 100

xα                                    α                             • Uses suffix links to move
across the tree to search for
the next query
v                                sf(v)                         • Mimics the behavior of
A           G      A                               G                  exact match anchor search
CA                      during a genome alignment
2010-7-31

Experiment Results (cont.)
Query Times on the Human Genome Suffix Tree

8000
Query Length (bp)

1000

600

200

80

40

0.000        0.010      0.020      0.030          0.040          0.050
Query Time (secs)

2010-7-31

124
Mining and Searching Complex Structures                 Chapter 3 Similarity Search on Sequences

Summary
Trellis builds a disk-based suffix tree based on
A partitioning method via variable-length prefixes
A suffix subtree merging algorithm
Trellis is both time and space efficient
Faster than existing leading methods in both
construction and query time

2010-7-31

Outline
• Searching based on a disk based suffix tree
• Approximate Matching Using Inverted List (Vgrams)
• Approximate Matching Based on B+ Tree (BED Tree)

2010-7-31

125
Mining and Searching Complex Structures                              Chapter 3 Similarity Search on Sequences

Example 1: a movie database

Tom

Find movies starred Samuel Jackson
Star                        Title                Year   Genre
Keanu Reeves        The Matrix                         1999    Sci-Fi
Samuel Jackson      Star Wars: Episode III - Revenge   2005    Sci-Fi
of the Sith
Schwarzenegger      The Terminator                     1984    Sci-Fi
Samuel Jackson      Goodfellas                         1990   Drama
…                   …                                   …       …
2010-7-31

The user doesn’t know the exact spelling!

Star                        Title                Year   Genre
Keanu Reeves        The Matrix                         1999   Sci-Fi
Samuel Jackson      Star Wars: Episode III - Revenge   2005   Sci-Fi
of the Sith
Schwarzenegger      The Terminator                     1984   Sci-Fi
Samuel Jackson      Goodfellas                         1990   Drama
…                   …                                   …       …
2010-7-31

126
Mining and Searching Complex Structures                                 Chapter 3 Similarity Search on Sequences

Relax Condition

Find movies with a star “similar to” Schwarrzenger.

Star                     Title                Year   Genre
Keanu Reeves       The Matrix                         1999   Sci-Fi
Samuel Jackson     Star Wars: Episode III - Revenge   2005   Sci-Fi
of the Sith
Schwarzenegger     The Terminator                     1984   Sci-Fi
Samuel Jackson     Goodfellas                         1990   Drama
…                  …                                   …      …
2010-7-31

Edit Distance
Given two strings A and B, edit A to B with the
minimum number of edit operations:
Replace a letter with another letter
Insert a letter
Delete a letter
E.g.
A = interestings                         _i__nterestings
B = bioinformatics                       bioinformatic_s
101101101100110
Edit distance = 9

2010-7-31

127
Mining and Searching Complex Structures                Chapter 3 Similarity Search on Sequences

Edit Distance Computation
Instead of minimizing the number of edge operations, we
can associate a cost function to the operations and
minimize the total cost. Such cost is called edit distance.
For the previous example, the cost function is as follows:
A= _i__nterestings
B= bioinformatic_s
101101101100110
Edit distance = 9                            _   A   C   G   T
_       1   1   1   1
A   1   0   1   1   1
C   1   1   0   1   1
G    1   1   1   0   1
T   1   1   1   1   0
2010-7-31

Needleman-Wunsch algorithm (I)
Consider two strings S[1..n] and T[1..m].
Define V(i, j) be the score of the optimal
alignment between S[1..i] and T[1..j]
Basis:
V(0, 0) = 0
V(0, j) = V(0, j-1) + δ(_, T[j])
Insert j times
V(i, 0) = V(i-1, 0) + δ(S[i], _)
Delete i times

2010-7-31

128
Mining and Searching Complex Structures                            Chapter 3 Similarity Search on Sequences

Needleman-Wunsch algorithm (II)

Recurrence: For i>0, j>0

⎧V (i − 1, j − 1) + δ ( S [i ], T [ j ])   Match/mismatch
⎪
V (i, j ) = max ⎨ V (i − 1, j ) + δ ( S [i ], _)           Delete
⎪ V (i, j − 1) + δ (_, T [ j ])
⎩                                          Insert

In the alignment, the last pair must be either
match/mismatch, delete, insert.
xxx…xx                  xxx…xx              xxx…x_
|                       |                   |
xxx…yy                  yyy…y_              yyy…yy
match/mismatch             delete           insert
2010-7-31

Example (I)

_       A      G        C     A      T          G    C
_      0      -1 -2 -3 -4 -5 -6 -7
A      -1
C      -2
A      -3
A      -4
T      -5
C      -6
C      -7
2010-7-31

129
Mining and Searching Complex Structures                 Chapter 3 Similarity Search on Sequences

Example (II)

_    A    G         C   A    T     G    C
_    0    -1 -2 -3 -4 -5 -6 -7
A    -1     2   1         0   -1 -2 -3 -4
C    -2     1   1         ?
3   2
A    -3
A    -4
T    -5
C    -6
C    -7
2010-7-31

Example (III)

_    A    G         C   A    T     G    C
_    0    -1 -2 -3 -4 -5 -6 -7
A    -1     2   1         0   -1 -2 -3 -4
C    -2     1   1         3   2    1     0    -1
A    -3     0   0         2   5    4     3    2
A    -4 -1 -1             1   4    4     3    2
T    -5 -2 -2             0   3    6     5    4
C    -6 -3 -3             0   2    5     5    7
C    -7 -4 -4 -1              1    4     4    7
2010-7-31

130
Mining and Searching Complex Structures                       Chapter 3 Similarity Search on Sequences

“q-grams” of strings

universal

2-grams

2010-7-31

q-gram inverted lists

at         4
ch         0      2
id   strings
ck         1      3
0    rich
2-grams
ic         0      1     2      4
1    stick
ri         0
2    stich
st         1      2     3      4
3    stuck
ta         4
4    static
ti         1      2     4
tu         3
uc         3

2010-7-31

131
Mining and Searching Complex Structures                         Chapter 3 Similarity Search on Sequences

Searching using inverted lists
Query: “shtick”, ED(shtick, ?)≤1
sh ht ti ic ck                                     # of common grams >= 3

at    4
ch    0     2
id          strings
ck    1     3
0           rich
2-grams
ic    0     1           2       4
1           stick
ri    0
2           stich
st    1     2           3       4
3           stuck
ta    4
4           static
ti    1     2           4
tu    3
uc    3

2010-7-31

2-grams -> 3-grams?
Query: “shtick”, ED(shtick, ?)≤1
sht hti tic ick                                     # of common grams >= 1

ati   4
ich   0         2
id          strings             ick   1
0           rich                ric   0
1           stick     3-grams   sta   4
2           stich               sti   1     2
3           stuck               stu   3
4           static              tat   4
tic   1         2           4
tuc   3
2010-7-31
uck   3

132
Mining and Searching Complex Structures            Chapter 3 Similarity Search on Sequences

Observation 1: dilemma of choosing “q”
Increasing “q” causing:
Longer grams Shorter lists
Smaller # of common grams of similar strings
at    4
ch    0       2
id strings
ck    1       3
0   rich
2-grams
ic    0       1       2    4
1   stick
ri    0
2   stich
st    1       2       3    4
3   stuck
ta    4
4   static
ti    1       2       4
tu    3
uc    3
2010-7-31

Observation 2: skew distributions of gram
frequencies
DBLP: 276,699 article titles
Popular 5-grams: ation (>114K times), tions, ystem, catio

2010-7-31

133
Mining and Searching Complex Structures                      Chapter 3 Similarity Search on Sequences

VGRAM: Main idea
Grams with variable lengths (between qmin and qmax)
zebra
ze(123)
corrasion
co(5213), cor(859), corr(171)
Reduce index size ☺
Reducing running time ☺

2010-7-31

Challenges
Generating variable-length grams?
Constructing a high-quality gram dictionary?
Relationship between string similarity and their
gram-set similarity?

2010-7-31

134
Mining and Searching Complex Structures         Chapter 3 Similarity Search on Sequences

Challenge 1: String       Variable-length grams?

Fixed-length 2-grams

universal

Variable-length grams
[2,4]-gram dictionary
universal             ni
ivr
sal
uni
vers
2010-7-31

Representing gram dictionary as
a trie

ni
ivr
sal
uni
vers

2010-7-31

135
Mining and Searching Complex Structures                   Chapter 3 Similarity Search on Sequences

Challenge 2: Constructing gram
dictionary
Step 1: Collecting frequencies of grams with length in [qmin,
qmax]

st     0, 1, 3
sti    0, 1
stu    3
stic    0, 1
stuc    3

Gram trie with frequencies
2010-7-31

Step 2: selecting grams
Pruning trie using a frequency threshold T (e.g., 2)

2010-7-31

136
Mining and Searching Complex Structures                 Chapter 3 Similarity Search on Sequences

Step 2: selecting grams (cont)

Threshold T = 2

2010-7-31

Final gram dictionary

[2,4]-grams

2010-7-31

137
Mining and Searching Complex Structures              Chapter 3 Similarity Search on Sequences

Challenge 3: Edit operation’s effect on grams

Fixed length: q
universal

k operations could affect k * q grams

2010-7-31

Deletion affects variable-length grams

Not affected              Affected      Not affected

i-qmax+1           i     i+qmax- 1
Deletion

2010-7-31

138
Mining and Searching Complex Structures                 Chapter 3 Similarity Search on Sequences

Grams affected by a deletion

Affected?

i-qmax+1           i     i+qmax- 1
Deletion
[2,4]-grams
Deletion                          ni
ivr
universal
sal
uni
Affected?                               vers
2010-7-31

Grams affected by a deletion (cont)
Affected?

i-qmax+1        i        i+qmax- 1
Deletion

Trie of grams
2010-7-31                                  Trie of reversed grams

139
Mining and Searching Complex Structures                    Chapter 3 Similarity Search on Sequences

# of grams affected by each operation

Deletion/substitution                  Insertion

0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0
_u_n_i_v_e_r_s_a_l_

2010-7-31

Max # of grams affected by k operations

Vector of s = <2,4,6,8,9>

With 2 edit operations, at most 4 grams can be affected

Called NAG vector (# of affected grams)
Precomputed and stored

2010-7-31

140
Mining and Searching Complex Structures                Chapter 3 Similarity Search on Sequences

Summary of VGRAM index

2010-7-31

Basic interfaces:
String s grams
String s1, s2 such that ed(s1,s2) <= k    min # of their
common grams

2010-7-31

141
Mining and Searching Complex Structures                          Chapter 3 Similarity Search on Sequences

Lower bound on # of common grams

Fixed length (q)

universal

If ed(s1,s2) <= k, then their # of common grams
>=:
(|s1|- q + 1) – k * q

Variable lengths: # of grams of s1 – NAG(s1,k)
2010-7-31

Example: algorithm using inverted lists
Query: “shtick”, ED(shtick, ?)≤1
sh   ht    tick
2-grams                                               2-4 grams
…           Lower bound = 3                           …
ck              1     3                               ck        1      3
ic              0     1        2       4              ic        1      4
…                                                     ich       0      2
ti              1     2        4                      …
…                         id       strings            tic       2      4
0        rich               tick      1
1        stick              …
2        stich
3        stuck             Lower bound = 1

2010-7-31
4        static

142
Mining and Searching Complex Structures                    Chapter 3 Similarity Search on Sequences

PartEnum + VGRAM
PartEnum, fixed q-grams:
ed(s1,s2) <= k
hamming(grams(s1),grams(s2)) <= k * q

VGRAM:
ed(s1,s2) <= k
hamming(VG (s1),VG(s2)) <= NAG(s1,k) +
NAG(s2,k)

2010-7-31

PartEnum + VGRAM (naïve)

R
S

Bm(S) = max(NAG(s,k))

Bm(R) = max(NAG(r,k))

• Both are using the same gram dictionary.
• Use Bm(R) + Bm(S) as the new hamming bound.
2010-7-31

143
Mining and Searching Complex Structures                  Chapter 3 Similarity Search on Sequences

PartEnum + VGRAM (optimization)
R
S
R1 with Bm(R1)

R2 with Bm(R2)

R3 with Bm(R3)         Bm(S) = max(NAG(s,k))

• Group R based on the NAG(r,k) values
• Join(R1,S) using Bm(R1) + Bm(S)
• Similarly, Join(R2,S), Join(R3,S)
• Local bounds tighter     better signatures generated
• Grouping S also possible.
2010-7-31

Outline
• Searching based on a disk based suffix tree
• Approximate Matching Using Inverted List (Vgrams)
• Approximate Matching Based on B+ Tree (BED Tree)

2010-7-31

144
Mining and Searching Complex Structures             Chapter 3 Similarity Search on Sequences

Approximate String Search
Information Retrieval
Web search query with string “Posgre SQL” instead of
“Postgre SQL”
Data Cleaning
“13 Computing Road” is the same as “#13 Comput’ng Rd”?
Bioinformatics
Find out all protein sequences similar to
“ACBCEEACCDECAAB”

2010-7-31

71

Edit Distance
Edit distance on strings

13 Computing Drive
3 deletions
13 Computing Dr           Edit distance: 5
1 replacement
13 Comput’ng Dr
1 insertion
#13 Comput’ng Dr
Normalized edit distance
ED(s1,s2)                      5
MaxLength(s1,s2)             18
2010-7-31

72

145
Mining and Searching Complex Structures                 Chapter 3 Similarity Search on Sequences

Existing Solution
Q-Gram
Q=3
Postgre
##P #Po Pos ost stg tgr gre re# e##

Posgre
##P #Po Pos osg sgr gre re#           e##

Observation: If ED(s1,s2)=d, they agree on at least
min(|s1|,|s2|)+Q-1-d*(Q+1) grams

2010-7-31

73

Existing Solution
Inverted List
Postgre

##P           #Po   Pos    osg       sgr     gre         re\$   e\$\$

Posgre

2010-7-31

74

146
Mining and Searching Complex Structures                    Chapter 3 Similarity Search on Sequences

Limitations
Inverted List Method
Limited queries supported

Range Query Join Query       Top-K Query   Top-K Join
Edit Distance       Y          Y                  N             N
Uncontrollable memory consumption
Normalized ED         N          N                  N             N
Concurrency protocol

2010-7-31

75

Our Contributions
Bed-Tree
Wide support on different queries and distances
Range Query Join Query       Top-K Query   Top-K Join
Edit Distance        Y             Y              Y            Y
Y
Normalized EDbuffer size
and low I/O cost        Y            Y

Highly concurrent
Easy to implement
Competitive performance

2010-7-31

76

147
Mining and Searching Complex Structures                  Chapter 3 Similarity Search on Sequences

Basic Index Framework
Bed-Tree Framework
Index Construction
follows standard B+                           Estimate the minimal
tree                   Query: Posgre          distance to query and
prune B+ tree nodes

Map all strings to a
1D domain

Result: Postgre         Refine the result by exact
edit distance

2010-7-31

77

String Order Properties

P1: Comparability
Given two string s1 and s2, we know the order of s1 and s2
under the specified string order
P2: Lower Bounding
Given an interval [L,U] on the string order, we know a
lower bound on edit distance to the query string
Query: Posgre
Candidates in the
sub-tree?

2010-7-31

78

148
Mining and Searching Complex Structures                         Chapter 3 Similarity Search on Sequences

String Order Properties

P3: Pairwise Lower Bounding
Given two intervals [L,U] and [L’,U’], we know the lower
bound of edit distance between s1 from [L,U] and s2 from
[L’,U’]
P4: Length Bounding
Given an interval [L,U] on the string order, we know the
minimal length of the strings in the interval

Potential join results?

2010-7-31

79

String Order Properties
Properties v.s. supported queries and distances

Range Query Join Query          Top-K Query     Top-K Join
Edit Distance     P1, P2          P1, P3           P1, P2          P1, P3
Normalized ED     P1, P2, P4      P1, P3, P4      P1, P2, P4     P1, P3, P4

Description
P1                       Comparability
P2                      Lower Bounding
P3                  Pair-wise Lower Bounding
P4                      Length Bounding

2010-7-31

80

149
Mining and Searching Complex Structures                          Chapter 3 Similarity Search on Sequences

Dictionary Order
All strings are ordered alphabetically, satisfying P1, P2 and
P3

Search: Posgre with ED=1
Insertion: Postgre
It’s between “pose”
pose      powder      sit           and “powder”

2010-7-31

81

Dictionary Order
All strings are ordered alphabetically, satisfying P1, P2 and
P3

Search: Posgre with ED=1
Not pruning
pose    powder       sit                anything!

Pruning happens
power       put           sad        only when long
prefix exists

2010-7-31

82

150
Mining and Searching Complex Structures                         Chapter 3 Similarity Search on Sequences

Gram Counting Order
Jim Gray

Hash all grams to
4 buckets

2010-7-31

Count the grams
in binary

1 1     1 0          0 1        1 1

Gram Counting Order
Transform the count vector to a bit string with z-order

Encode with z-
order

Order the strings
with this signature

2010-7-31

84

151
Mining and Searching Complex Structures                   Chapter 3 Similarity Search on Sequences

Gram Counting Order
Lower Bounding                                    Query: Jim Gary
“11011011” to “11011101”

Prefix: “11011???”
signature:
(4,1,2,2)

Minimal edit
distance: 1
2010-7-31

85

Gram Location Order
Extension of Gram Counting Order
Include positional information of the grams
Jim Gray           Grace Hopper

Allow better estimation of mismatch grams
Harder to encode

2010-7-31

86

152
Mining and Searching Complex Structures              Chapter 3 Similarity Search on Sequences

Experiment Settings
Data

Five Index Schemes
Bed-Tree: BD, BGC, BGL
Inverted List: Flamingo, Mismatch
Default Setting
Q=2, Bucket=4, Page Size=4KB

2010-7-31

87

Empirical Observations
How good is Bed-Tree?
With small threshold, Inverted Lists are better
When threshold increases, Bed-Tree is not worse

153
Mining and Searching Complex Structures             Chapter 3 Similarity Search on Sequences

Empirical Observations
Which string order is better?
Gram counting order is generally better
Gram Location order: tradeoff between gram content
information and position information

Conclusion
A new B+ tree index scheme
All similarity queries supported
Both edit distance and normalized distance
General transaction and concurrency protocol
competitive efficiencies

2010-7-31

90

154
Mining and Searching Complex Structures           Chapter 3 Similarity Search on Sequences

References
Benjarath Phoophakdee, Mohammed J. Zaki:
"Genome-scale disk-based suffix tree indexing".
SIGMOD Conference 2007: 833-844
Chen Li, Bin Wang, and Xiaochun Yang . "VGRAM:
Improving Performance of Approximate Queries on
String Collections Using Variable-Length Grams". In
VLDB 2007.
• Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi,
and Divesh Srivastava, "B^{ed}-Tree: An All-Purpose
Tree Index for String Similarity Search on Edit
Distance". SIGMOD 2010.

155
Mining and Searching Complex                     Chapter 4 Structures Similarity Search on Trees

Searching and Mining Complex
Structures
Similarity Search on Trees
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung

Outline

Importance of Trees
Distance between Trees
Fast Edit Distance Approximation for Trees

2

156
Mining and Searching Complex                    Chapter 4 Structures Similarity Search on Trees

Importance of Trees

Between sequences and graphs
Equivalent to acyclic graph
Represents hierarchal structures
Examples
XML documents
Programs
RNA structure

3

Types of Trees

Is there a root?
Are the nodes labeled?
Are the children of a node ordered?

4

157
Mining and Searching Complex                    Chapter 4 Structures Similarity Search on Trees

Outline

Importance of Trees
Distance between Trees
Fast Edit Distance Approximation for Trees

5

Distance Measure

Many ways to define distance
Convert to standard types and adopt the distance metric there
How many operations to transform one tree to another? (Edit
distance)
Inverse of similarity
dist(S, T) = maxSim – sim(S,T)
Relationship between different definitions?

6

158
Mining and Searching Complex                     Chapter 4 Structures Similarity Search on Trees

Operations on Trees

Relabel

Delete

Insert

7

Remarks on Edit Distance

Ordered trees are tractable
Approach based on dynamic programming
NP-hard for unordered trees
Approach is to impose restrictions so that DP can be used

8

159
Mining and Searching Complex                      Chapter 4 Structures Similarity Search on Trees

Edit Script

Edit script(S, T): sequence of operations to transform S to T
Example
1. S=

3. Insert c
Relabel f → a
2. Delete c
Relabel e → d

9

Edit Distance Mapping
Edit distance mapping(S, T): alternative representation of edit
operations
relabel: v → w
delete: v → \$
insert: \$ → w
Mapping corresponding to the script

10

160
Mining and Searching Complex                           Chapter 4 Structures Similarity Search on Trees

Edit Distance for Ordered Trees

Generalize the problem to forests.
C(φ, φ) = 0
C(S, φ) = C(S – v, φ) + cost(v → \$)
C(φ, T) = C(φ, T – w) + cost(\$ → w)
C(S, T) = minimum of
1. C(S – v, T) + cost(v → \$)     [deleting v]
2. C(S, T – w) + cost(\$ → w) [inserting w]
3. C(S – tree(v), T – tree(w)) +
C(S(v) - v, T(w)) + cost(v → w)[relabel v → w]

11

Illustration of Case 3

C(S – tree(v), T – tree(w)) +
C(S(v), T(w)) + cost(v → w) [relabel v → w]

S - tree(v)   ...          v            T - tree(w)   ...    w

S(v)                                    T(w)

12

161
Mining and Searching Complex                     Chapter 4 Structures Similarity Search on Trees

Algorithm Complexity

Number of subproblems bounded by O(|S|2|T|2)
Zhang and Shasha, 1989 showed that the number of relevant
subproblems is
O(|S||T|min(SD, SL) min(TD, TL)) and space is O(|S||T|)
Further improvements, required decomposition of a rooted tree
into disjoint paths

13

Decomposition into Paths

Concept of heavy and light nodes/edges
(Harel and Tarjan, 1984)
Root is light, child with max size is heavy
Removal of light edges partitions T into disjoint heavy paths
Important property: light depth(v) ≤ log|T| + O(1)
Complexity can be reduced to O(|S|2|T|log|T|)

14

162
Mining and Searching Complex                       Chapter 4 Structures Similarity Search on Trees

Unordered Edit Distance

NP-hard
Special cases (in P)
T is a sequence
Number of leaves in T is logarithmic
Disjoint subtrees map to disjoint subtrees

15

Tree Inclusion

Is there a sequence of deletion operations on S which can
transform it to T?
Special case of edit distance which only allows deletions

16

163
Mining and Searching Complex                     Chapter 4 Structures Similarity Search on Trees

Complexity of Tree Inclusion

Ordered trees
Concept of embeddings (restriction of mappings)
O(|S||T|) using the algorithm of
Kilpelainen and Mannila
Unordered trees
NP-complete (what did you expect ?)
Special cases

17

Related Problems on Trees
Tree Alignment (covered in the survey paper)
Robinson-Fould's Distance for leaf labeled trees, where edge =
bipartition of leaves
Tree Pattern Matching
Maximum Agreement Subtree
Largest Common Subtree
Smallest Common Supertree
Many are generalizations of problems on strings

18

164
Mining and Searching Complex                      Chapter 4 Structures Similarity Search on Trees

Summary of Tree Distance

Edit distance
Concept of edit mapping
Dynamic programming for ordered trees
Constrained edit distance for unordered trees
Tree inclusion
Special case of edit distance
Specialized algorithms are more efficient
Useful for determining embedded trees

19

Outline

Importance of Trees
Distance between Trees
Fast Edit Distance Approximation for Trees

20

165
Mining and Searching Complex                                  Chapter 4 Structures Similarity Search on Trees

Similarity Measurement
Edit Distance EDist(T1, T2)
Edit Operation           e; cost γ(e),

a->b                             b->λ                         λ->b
a            a

si(ei1,ei2,…,eik) : T1->T2; cost(si)= ∑j γ(eij)
EDist(T1,T2)=mini(cost(si)) unit cost: EDist(T1,T2)=min(k)

Computational Complexity:
O (| T1 | × | T2 | × min(depth(T1 ), leaves(T1 )) × min(depth(T2 ), leaves(T2 )))

7/31/2010                                                                              21

Edit Operation Mapping

Edit operations mapping
One-to-one
Preserve sibling order
Preserve ancestor order
a                    a

d       b      e         b    c       d

c       d                c   d        b
M(T1,T2)                  c      d

e
T1                   T2

7/31/2010                                                                              22

166
Mining and Searching Complex                                                                   Chapter 4 Structures Similarity Search on Trees

Observation

Edit operations do not change many sibling
relationship
a                                             a
c->λ

c                    d
b                                        e
b      f   g        h       i       d           e

f           g           h        i                Sibling relation:
(b,c)->(b,f)
(c,d)->(i,d)

Node: Varying number of children v.s. at most 2 siblings

7/31/2010                                                                                                                            23

Binary Tree Representation
a
Binary Tree Representation
b                           e
Left-child, right sibling                                                                                            b

Normalized Binary Tree                                                                         c           d       c       d

a(1,8)
a   b   b   b   b   c   d  d   d   e
b … c … c … c … e … ε … ε …ε … ε … ε
b(2,3)
ε b     c   e   ε d     b  e   ε   ε
ε
b(5,6)                                   T1
c(3,1)
c(6,4)                        e(8,7)             1 …1 …0 … 1 … 0 … 2                       …0 …0 … 2 … 1
ε      d(4,2)
T2
ε           ε           ε       d(7,5)            ε        ε
1 …0 …1 … 0 … 1 … 2                       …0 …1 … 0 … 1
1
ε        ε
|Γ |
BBDist (T1 , T2 ) = ∑ | b1i − b2i | = 8                                   Triangular Inequality
i =1

7/31/2010                                                                                                                            24

167
Mining and Searching Complex                                                                                                      Chapter 4 Structures Similarity Search on Trees

One Edit Operation Effect
v’                                                                  v’

...          ...                         ...                         ...                      ...                 ...
w1         w2           wl           w l+m            w l+m+1         w1        w2          wl                 v     w l+m+1
...                            ...                                   ...                                    ...
...                                             Each node appears in
w l+1             w l+m                                     at most two binary
v’                                                                           v’                                                         branches

...                                                                         ...
w1                                                                         w1

w2          ...
w2
...                            ...                                                            ...
wl                                                                              wl                   v ...
w l+1                                                        ...
...
w l+m+1
w l+m                                                 w l+1
...                                                                                                              ...
w l+m+1                                                  w l+2
...
...
w l+m
...

ε

7/31/2010                                                                                                                                                                      25

Theorem

1 insertion/deletion incurs at most 5 difference on BBDist
1 rellabeling incurs at most 4 difference on BBDist
T, T’, EDIST(T, T’) = k = ki + kd + kr ,
BDist(T,T’) <= 4kr+5ki+5kd <= 5k;

1/5 BDist is a lower
bound of edit distance;

7/31/2010                                                                                                                                                                      26

168
Mining and Searching Complex                                                                                     Chapter 4 Structures Similarity Search on Trees

Positional Binary Branch
a(1,8)                                                            a(1,8)                                                               a
a
b(2,3)                                                        b(2,5)
ε                                                   ε
b(5,6)                        c(3,1)                             c(7,6)                                           b
c(3,1)                                                                                                                   d        b         e                   c        d
c(6,4)                   e(8,7)
d(4,2)             ε                d(8,7)
ε   d(4,2)                                                  ε
ε                 ε
c   d               b
ε          b(5,4)                                         d
ε        ε       ε     d(7,5)          ε        ε                                                                c            c         d
e(6,3)                  ε
ε       ε                                ε            ε                                              T1               T2
e
B(T1)                                                           B(T2)
a
Incurs 0 difference for
BBDist(T1,T2)
c         d        e

Positional binary branch: PosBiB(T(u))                                                                                     T’2

PosBiB(T1(e))=(BiB(e,ε,ε),8,7)                                                    ≠            PosBib(T2(e))=(BiB(e,ε,ε),6,3)

Positional Binary Branch Distance
7/31/2010                                                                                                                                                  27

Computational Complexity

D: dataset; |D|: dataset size;
Vector construction part:                                                                                                                       | D|
time, space : O(∑ | Ti |)
Traverse the data trees for once                                                                                                          i =1

Optimistic bound computation:
time: each binary search O(|Ti|+|Tq|),
| D|
O (∑ (| Ti | + | Tq |) log(min(| Ti |,| Tq |)))
i =1
totally:
| D|

space:                           O (∑ (| Ti | + | Tq |))
i =1

7/31/2010                                                                                                                                                  28

169
Mining and Searching Complex                                                                Chapter 4 Structures Similarity Search on Trees

Generalized Study

Extend the sliding window to q level
The images vector gives multiple level binary
branch profiles.
(T,T’      [4*(q- 1)+1]*EDist(T,T’
BDist_q(T,T’) <= [4*(q-1)+1]*EDist(T,T’)
v’                                                            v’

...                                                           ...
w1                                                                w1

w2        ...
w2
...                      ...                                                  ...
wl                                                                 wl             v ...
w l+1                                           ...
...
w l+m+1
w l+m                                         w l+1
...                                                                                    ...
w l+m+1                                        w l+2
...
...
w l+m
...

7/31/2010                                                                                                                             29

Query Processing Strategy

Filter-and-refine frameworks
Lower bound distances filter out most objects
The lower bound computation is much succinct
Lower bound distance is a close approximation of
the real dist
Remaining objects be validated by real distance

7/31/2010                                                                                                                             30

170
Mining and Searching Complex                                                                                                                                          Chapter 4 Structures Similarity Search on Trees

Experimental Settings
Compare with histogram methods[KKSS04]
Lower bound: feature vector distance (Leaf Distance Height
histogram vector, Degree histogram vector, Label histogram
vector)
Synthetic dataset:
Tree size, Fanout, Label, Decay factor
Real dataset: dblp XML document
Performance measure:
Percentage of data accessed:
| false positive | + | true positive |
× 100%
| dataset |
CPU time consumed
Space requirement
7/31/2010                                                                                                                                                                                                                                                                         31

Sensitivity to the Data Properties
Sensitivity test                                                       Range: N{}N{50,2.0}L8D0.05

35                                                      0.5                                                                       Range: N{4,0,5}N{}L8D0.05
% of Accessed Data

CPU Cost (Second)

30
0.4
25                                                                                                                  80                                                            3
% of Accessed Data

CPU Cost (Second)

20                                                      0.3                                                         70                                                            2.5
60
15                                                      0.2                                                                                                                       2
50
10
0.1                                                         40                                                            1.5
5                                                                                                                  30                                                            1
0                                                      0                                                           20
10                                                            0.5
2            4           6              8
Fanout
0                                                            0
BiBranch %            Histo %               Result %
25          50             75          125
BiBranch              Sequ                                                                                                               Tree Size
BiBranch %          His to %           Result %
BiBranch            Sequ

KNN: N{}N{50,2.0}L8D0.05

KNN: N{4,0.5}N{}L8D0.05
8                                                                   0.5
7
% of Accessed Data

CPU Cost (Second)

0.4                                                         100                                                       3
6
% of Accessed Data

CPU Cost (Second)

80                                                       2.5
5                                                                   0.3
4                                                                                                                                                                                         2
0.2                                                          60
3                                                                                                                                                                                         1.5
2                                                                   0.1                                                          40
1
1
0                                                                   0                                                            20                                                       0.5
2            4               6          8                                                                        0                                                       0
Fanout
25          50          75            125
BiBranch %         Histo %            BiBranch         Sequ                                                                                Tree Size

BiBranch %      Histo %            BiBranch     Sequ

mean(fanout): 2                                                                                8;                                                  mean(|T|): 25    125;
mean(|T|): 50;                                                                                                                                     mean(fanout): 4;
size(label): 8                                                                                                                                     size(label): 8
7/31/2010                                                                                                                                                                                                                                                                         32

171
Mining and Searching Complex                                                                                                                             Chapter 4 Structures Similarity Search on Trees

Sensitivity test (cont.)

Range: N{4,0.5}N{50,2.0}L{}D0.05
KNN: N{4,0.5}N{50,2.0}L{}D0.05

35                                                              0.5                                                                   7                                                             0.45
% of Accessed Data

CPU Cost (Second)
30                                                                                                                                                                                                  0.4
6

% of Accessed Data

CPU Cost (Second)
0.4
25                                                                                                                                                                                                  0.35
5
0.3                                                                                                                                 0.3
20
4                                                             0.25
15                                                              0.2                                                                   3                                                             0.2
10                                                                                                                                                                                                  0.15
0.1                                                                   2
5                                                                                                                                                                                                  0.1
1                                                             0.05
0                                                              0
0                                                             0
8         16        32          64
8             16        32               64
Label Number
Label Number
BiBranch %     Histo %            Result %
BiBranch       Sequ                                                                                                                  BiBranch %          Histo %         BiBranch          Sequ

size(label): 8                 64; mean(|T|): 50; mean(fanout): 4

7/31/2010                                                                                                                                                                                                                                                                              33

Queries with Different Parameters
Dblp data (avg. distance: 5.031)
Range queries
KNN (k:5-20)
Range: DBLP
KNN: DBLP

100                                                                   0.35
0.3                                                                   6                                                                                    0.35
% of Accessed Data

CPU Cost (second)

80
0.25                                                                  5                                                                                    0.3
CPU Cost (second)

60
% of Accessed Data

0.2                                                                                                                                                        0.25
4
40                                                                   0.15                                                                                                                                                       0.2
0.1                                                                   3
0.15
20
0.05                                                                  2
0.1
0                                                                   0
1                                                                                    0.05
1              2        3     4       5   7    10
Range                                                                                          0                                                                                    0
BiBranch %                             Histo %           Result %
5          7       10      12      15      17        20
BiBranch                               Sequ                                                                                                                                                   k
BiBranch %              Histo %         BiBranch          Sequ

7/31/2010                                                                                                                                                                                                                                                                              34

172
Mining and Searching Complex                                                                    Chapter 4 Structures Similarity Search on Trees

Pruning Power of Different Level
Data distribution according to distances
Edit distance
Histogram distance
Binary branch distance: 2, 3, 4 level
DBLP

2000

1500
Data Distribution

1000

500

0
1   2      3    4    5    6    7    8   9   10   11   12
Distance
Edit                   Histo               BiBranch(2)
BiBranch(3)            BiBranch(4)

7/31/2010                                                                                                           35

Citations on the Paper

Surprisingly, attract citations and questions from software
engineering! Expect more impact along software mining
direction soon.
DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones - all 2
versions »
L Jiang, G Misherghi, Z Su, S Glondu - Proceedings of the 29th International
Conference on Software …, 2007 - portal.acm.org
Detecting code clones has many software engineering applications. Existing
approaches either do not scale to large code bases or are not robust against minor
code modifications. In this paper, we present an efficient ...
Fast Approximate Matching of Programs for Protecting Libre/Open Source Software
by Using Spatial … - all 2 versions »
AJM Molina, T Shinohara - Source Code Analysis and Manipulation, 2007. SCAM
2007. …, 2007 - doi.ieeecomputersociety.org
To encourage open source/libre software development, it is desirable to have
tools that can help to identify open source license violations. This paper
describes the imple-mentation of a tool that matches open source programs ...

7/31/2010                                                                                                           36

173
Mining and Searching Complex                          Chapter 4 Structures Similarity Search on Trees

References
• Philip Bille . A survey on tree edit distance and related problems.
Theoretical Computer Science. Volume 337 , Issue 1-3 (June 2005)
• Rui Yang, Panos Kalnis, Anthony K. H. Tung: Similarity Evaluation on
Tree-structured Data. SIGMOD 2005.

• Optional References:
• JP Vert. "A tree kernel to analyze phylogenetic profiles" - Bioinformatics,
2002 - Oxford Univ Press

7/31/2010

174
Mining and Searching Complex Structures                     Chapter 5 Graph Similarity Search

Searching and Mining Complex
Structures
Graph Similarity Search
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung

Outline

• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search

175
Mining and Searching Complex Structures                                Chapter 5 Graph Similarity Search

Smart Graphs

Chemical compound Protein structure                 Program flow

Coil                                                               Image

Fingerprint               Letter                 Shape

Motivation

• Why graph?
•Graph is ubiquitous
•Graph is a general model
•Graph has diversity
•Graph problem is complex and challenging
• Why graph search?
•Manifold application areas
•   2D and 3D image analysis
•   Video analysis
•   Document processing
•   Biological and biomedical applications

176
Mining and Searching Complex Structures                      Chapter 5 Graph Similarity Search

Graph Search

• Definition
•Given a graph database D and a query graph Q, find all
graphs in D supporting the users’ requirements:
•   The same as Q
•   Containing Q or contained by Q
•   Similarity to Q
•   Similarity to the subgraph of Q
• Challenge
•How to efficiently compare two graphs?
•How to reduce the number of pairwise graph comparisons?

How to efficiently compare two graphs?
• The graph matching problem
•Graph matching is the process of finding a
correspondence between the vertices and the edges of two
graphs that satisfies some (more or less stringent)
constraints ensuring that similar substructures in one graph
are mapped to similar substructures in the other.

177
Mining and Searching Complex Structures                     Chapter 5 Graph Similarity Search

How to reduce the number of pairwise
graph comparisons?
• Scalability issue
•A full database scan
•Complex graph matching between a pair of graphs

• Index mechanisms are needed

Outline

• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search

178
Mining and Searching Complex Structures                         Chapter 5 Graph Similarity Search

Categories of Matching

Exact Graph Matching

• Graph Isomorphism
•Two graphs G1=(V1,E1) and G2=(V2,E2) are isomorphic if
there is a bijective function f: V1 → V2 such that for all
u, v ∈ V1: {u,v} ∈ E1 ↔ {f(u),f(v)} ∈ E2

G1                           G2

179
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search

Exact Graph Matching

• Induced Subgraph
•A subset of the vertices of a graph together with all edges
whose endpoints are both in this subset

• Subgraph Isomorphism
•An isomorphism holds between one of the two graphs and
an induced subgraph of the other

Graph Similarity Measure

• Graph Edit Distance
•The minimum amount of distortion that is needed to
transform one graph into another
•The edit operations ei can be deletions, insertions, and
substitutions of vertices and edges

G1

G2

180
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search

Graph Similarity Measure

• Graph Edit Distance
•The minimum amount of distortion that is needed to
transform one graph into another
•The edit operations ei can be deletions, insertions, and
substitutions of vertices and edges

G1

G2

Graph Similarity Measure

• Graph Edit Distance
•The minimum amount of distortion that is needed to
transform one graph into another
•The edit operations ei can be deletions, insertions, and
substitutions of vertices and edges

G1

G2

181
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search

Graph Similarity Measure

• Graph Edit Distance
•The minimum amount of distortion that is needed to
transform one graph into another
•The edit operations ei can be deletions, insertions, and
substitutions of vertices and edges

G1

G2

Graph Similarity Measure

• Graph Edit Distance
•The minimum amount of distortion that is needed to
transform one graph into another
•The edit operations ei can be deletions, insertions, and
substitutions of vertices and edges

G1

G2

182
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search

Graph Similarity Measure

• Graph Edit Distance
•The minimum amount of distortion that is needed to
transform one graph into another
•The edit operations ei can be deletions, insertions, and
substitutions of vertices and edges

G1

G2

Graph Similarity Measure

• Graph Edit Distance (GED)
•Given two attributed graphs G1 = (V1,E1, Σ, l1) and G2 =
(V2,E2, Σ, l2) , the GED between them is defined as

•where T(G1,G2) denotes the set of edit paths transforming G1
into G2, and c denotes the edit cost function measuring the
c(ei) of edit operation ei
• GED provides a general dissimilarity measure for graphs
• Most works on inexact graph matching focusing on the
GED computation problem

183
Mining and Searching Complex Structures                     Chapter 5 Graph Similarity Search

Outline

• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search

Exact Matching Algorithms

• Tree search based algorithms
•Ullmann’s algorithm
•VF and VF2 algorithm

• Other algorithms
•Nauty algorithm

184
Mining and Searching Complex Structures                         Chapter 5 Graph Similarity Search

Tree Search based Algorithms

• Basic Idea
•A partial match (initially empty) is iteratively expanded by
adding new pairs of matched vertices
•The pair is selected using some necessary conditions,
usually also some heuristic condition to prune unfruitful
search paths
•The algorithm ends when it finds a complete matching, or no
further vertex pairs may be added (backtracking)
•For attributed graphs, the attributes of vertices and edges can
be used to constrain the desired matching

The Backtracking Algorithm
1
• Depth-First Search (DFS):
2     5        6
•progresses by expanding the first child node of the search
tree                                          3       4      7       8
•going deeper and deeper until a goal node is found, or until
it hits a node that has no children.
• Branch and Bound (B&B):
•BFS(breadth-first search)-like search for optimal solution
•Branch is that a set of solution candidates is splitted into two
or more smaller sets
•bound is that a procedure upper and lower bounds             1

2         3       4

5      6         7       8

185
Mining and Searching Complex Structures                                    Chapter 5 Graph Similarity Search

Tree Search based Algorithms

• Ullmann’s Algorithm (DFS)
•A refinement procedure based on matrix of possible future
matched vertex pairs to prune unfruitful matches
•The simple enumeration algorithm for the isomorphisms
between a graph G1 and a subgraph of another graph G2 with
• An M’ matrix with |V1| rows and |V2 | columns can be used
to permute the rows and columns of A2 to produce a further
matrix P. If                 , then M’ specifies an
isomorphism between G1 and the subgraph of G2.
(a1 i , j = 1) ⇒ ( pi , j = 1)

P = M ' ( M ' A2 )T

Tree Search based Algorithms

• Ullmann’s Algorithm
•Example for permutation matrix
•The elements of M’ are 1’s and 0’s, such that each row
contains 1 and each column contains 0 or 1

P = M ' (M ' A2 )T
T
⎡          ⎡0     1    0   0⎤⎤
⎡1 0 0 0⎤ ⎢⎡1 0 0 0⎤ ⎢                    ⎥
G2                                                   1     0    1   1⎥⎥
= ⎢0 0 1 0⎥ ⋅ ⎢⎢0 0 1 0⎥.⎢
⎢       ⎥ ⎢⎢         ⎥ ⎢0
⎥
1    0   0⎥⎥
⎣0 1 0 0⎥ ⎢⎢0 1 0 0⎥ ⎢0
⎢       ⎦ ⎣          ⎦        1    0
⎥⎥
0⎦⎥
⎢
⎣          ⎣                  ⎦
⎡0       0 1⎤
⎡1 0 0 0⎤ ⎢              ⎡0 0 1⎤
1       1 0⎥ ⎢
⎥ = 0 0 1⎥
= ⎢0 0 1 0⎥.⎢
⎢       ⎥ ⎢0       0 1⎥ ⎢      ⎥
⎣0 1 0 0⎥ ⎢
⎢       ⎦             ⎥ ⎢1 1 0⎥
⎣     ⎦
⎣0       0 1⎦

186
Mining and Searching Complex Structures                                                                Chapter 5 Graph Similarity Search

Tree Search based Algorithms

• Ullmann’s Algorithm
•Construction of another matrix M0 with the same size of M’
⎧1 if deg(V2i ) ≥ deg(V1i )
mi0, j = ⎨                          , mi , j ∈ {0,1}
⎩0      otherweise
•Generation of all M’ by setting all but one of each row of M0
•A subgraph isomorphism has been found if
(a1 i , j = 1) ⇒ ( pi , j = 1)
⎡0 1 0 0⎤
⎢1 0 1 1⎥
G2                                       A2 = ⎢       ⎥                                  ⎡1 1 1 1⎤
⎢0 1 0 0⎥
⎢       ⎥                            M 0 = ⎢1 1 1 1⎥
⎢       ⎥
⎣0 1 0 0⎦
⎢0 1 0 0⎥
⎣       ⎦
⎡0 0 1⎤
G1                                       A1 = ⎢0 0 1⎥
⎢     ⎥
⎢1 1 0⎥
⎣     ⎦

Tree Search based Algorithms

• Ullmann’s Algorithm
•An example                                                  ⎡1 1 1 1 ⎤
⎢1 1 1 1 ⎥
⎢        ⎥
⎢0 1 0 0⎥
⎣        ⎦

⎡1 0 0 0 ⎤             ⎡0 1 0 0⎤             ⎡0 0 1 0⎤         ⎡0 0 0 1 ⎤
⎢1 1 1 1 ⎥             ⎢1 1 1 1 ⎥            ⎢1 1 1 1 ⎥        ⎢1 1 1 1⎥
⎢        ⎥             ⎢        ⎥            ⎢        ⎥        ⎢        ⎥
⎢0 1 0 0⎥
⎣        ⎦             ⎢0 1 0 0⎥
⎣        ⎦            ⎢0 1 0 0⎥
⎣        ⎦        ⎢
⎣0 1 0 0 ⎥
⎦

⎡1 0 0 0 ⎤              ⎡1 0 0 0 ⎤            ⎡0 0 1 0⎤      ⎡0 0 1 0⎤       ⎡0 0 0 1 ⎤      ⎡0 0 0 1 ⎤
⎢0 0 1 0⎥               ⎢0 0 0 1 ⎥            ⎢1 0 0 0 ⎥     ⎢0 0 0 1 ⎥      ⎢1 0 0 0 ⎥      ⎢0 0 1 0⎥
⎢        ⎥              ⎢        ⎥            ⎢        ⎥     ⎢        ⎥      ⎢        ⎥      ⎢        ⎥
⎢0 1 0 0⎥
⎣        ⎦              ⎢0 1 0 0 ⎥
⎣        ⎦            ⎢0 1 0 0⎥
⎣        ⎦     ⎢0 1 0 0⎥
⎣        ⎦      ⎢0 1 0 0⎥
⎣        ⎦      ⎢0 1 0 0⎥
⎣        ⎦
1               4    1         2           2       4       1       2       2         1       1               1

3                3                     3               3               3                     3
2              3                   1               1                 3                       2

1               4       P = M ' ( M ' A2 )T
2                 ⎡0 0 1 ⎤                                            ⎡0 0 1 ⎤       1

= ⎢0 0 1 ⎥                         compared with A1 = ⎢0 0 1 ⎥
⎢      ⎥               3
3         ⎢      ⎥
⎢1 1 0 ⎥
⎢
⎣1 1 0 ⎥
⎦                                            ⎣      ⎦                       2

187
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search

Tree Search based Algorithms

• Ullmann’s Algorithm
•A most widely used algorithm
• VF or VF2
•VF defines a heuristic based on the analysis of vertices
•VF2 reduces the memory requirement from O(n2) to O(n)
• Other methods: Nauty Algorithm
•Constructs the automorphism group of each of the input
graphs and derives a canonical labeling. The isomorphism
can be checked by verifying the equality of the adjacency
matrices

Exact Graph Matching

• Summary
•The matching problems are all NP-complete except for
graph isomorphism, which has not yet been shown in NP or
not.
•Exact isomorphism is very seldom used. Subgraph
isomorphism can be effectively used in many contexts.
•Exact graph matching has exponential time complexity in
the worst case.
•Ullmann’ algorithm, VF2 algorithm and Nauty algorithm are
mostly used algorithms. Most modified algorithms adopt
some conditions to prune the unfruitful partial matching.

188
Mining and Searching Complex Structures                         Chapter 5 Graph Similarity Search

Error-Tolerant Graph Matching

• GED Computation
•Optimal algorithms
• Exact GED computation requires isomorphism testing
• Tree search based algorithms (A* based algorithms)
•Suboptimal algorithms
• Heuristic algorithms
• Formulated as a BLP problem

A* Algorithm
• A tree search based algorithm
•Similar to isomorphism testing
•Differently, the vertices of the source graph can potentially
be mapped to any node of the target graph

•Search tree is constructed dynamically
by edges to the currently vertex
•A heuristic function is usually used to
•determine the vertex for expansion

189
Mining and Searching Complex Structures                         Chapter 5 Graph Similarity Search

Exact GED Computation
• Summary
•The complexity is exponential in the number of vertices of
the involved graphs.
•For graphs with unique vertex labels the complexity is linear.
•Exact graph edit distance is feasible for small graphs only.
•Several suboptimal methods have been proposed to speed up
the computation and make GED applicable to large graphs.

Bipartite Matching for GED
• A Heuristic Algorithm
•A new suboptimal procedure for the GED computation
based on Hungarian algorithm (i.e., Munkres’ Algorithm).
•Hungarian algorithm is used as a tree search heuristic.
•Much faster than the exact computation and the other
suboptimal methods
•Application for larger graphs

190
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search

Bipartite Matching for GED
• Assignment Problem
•Find an optimal assignment of n elements in a set S1 =
{u1, …, un} to n elements in a set S2 = {v1, …, vn}
•Let cij be the costs of the assignment (ui → vj)
•The optimal assignment is a permutation P = (p1, …, pn) of
the integers 1, …, n that minimizes

S1            S2
c11
c12
c13

Bipartite Matching for GED

• Assignment Problem
•Given the n × n matrix Mcij of the assignment costs
•This problem can be formulated as finding a set of n
independent elements of Mcij with minimum summation
S1           S2
1
5 4
5
7
6
58
8
•Hungarian algorithm finds the minimum cost assignment in
O(n3) time.

191
Mining and Searching Complex Structures                      Chapter 5 Graph Similarity Search

Bipartite Matching for GED

• Main Idea
•Construct a vertex cost matrix Mcv and an edge cost matrix
Mce
•For each open vertex v in the search tree, run Hungarian
algorithm on Mcv and Mce
•The accumulated minimum cost of both assignments serves
as a lower bound for the future costs to reach a leaf node
•h(P) = Hungarian(Mcv) + Hungarian(Mce) is the tree search
hearistic
•Returns a suboptimal solution as an upper bound of GED

Suboptimal Algorithms

• Binary Linear Programming (BLP)
•Use the adjacency matrix representation to formulate a BLP
•Compute GED between G0 and G1

•Edit grid

192
Mining and Searching Complex Structures               Chapter 5 Graph Similarity Search

Binary Linear Programming

• Isomorphisms of G0 on the edit grid

• State vectors

Binary Linear Programming

• Definition:

• Objective Function:

193
Mining and Searching Complex Structures                   Chapter 5 Graph Similarity Search

Binary Linear Programming

• Lower Bound: linear program (O(n7))

• Upper Bound: assignment problem (O(n3))

Summary

• The complexity of the exact GED computation is
exponential and unaccepted.

•    Suboptimal methods solve the graph matching problem
by fast returning the suboptimal solution and can be
applied to larger graphs.

• An important application of the graph matching problem
is searching a graph database.

194
Mining and Searching Complex Structures                           Chapter 5 Graph Similarity Search

Outline

• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search

Graph Search Problem

• Query a graph database
•Given a graph database D and a query graph Q, find all
graphs in D supporting the users’ requirements.
•   Full graph search (all match )
•   Subgraph search (partial match or containment search)
•   Similarity full graph search (based on GED)
•   Similarity subgraph search (based on GED)

195
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search

Scalability Issue

• On-line searching algorithm

100,000
100,000
• A full sequential scan
Subgraph isomorphism testing
•I/O costs
•Subgraph isomorphism testing (GED computation)
• An indexing mechanism is needed

Indexing Graphs

• Indexing is crucial

100,000
100,00
0
filtering       100 checking
Index                   answe
100,00                      10         r
0                         0

196
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search

Indexing Strategy

• Filter-and-refine framework based on features

Step 1. Index Construction
Enumerate smaller units (features)
in the database, build an index
between units and graphs
Step 2. Query Processing
Enumerate smaller units in the
query graph
Use the index to first filter out
non-candidates
checking

Indexing Strategy

• Feature-based Indexing methods
•Break the database graphs into smaller units like paths, trees,
and subgraphs, and use them as filtering features
•Build inverted index between the smaller units and the
database graphs
•Filter graphs based on the number of smaller units or their
locality information

197
Mining and Searching Complex Structures                              Chapter 5 Graph Similarity Search

Featured-based Indexing Systems

Small units Smaller units             Query
GraphGrep        path                 Contain (Containment search)
SING             path                 Contain
gIndex           graph                Contain + Edge relaxation
FGIndex          graph                Contain
TREE∆            tree+graph           Contain
Treepi           tree                 Contain
κ-AT             tree                 Full similarity search
CTree            -                    Contain + Edge relaxation

Path-based Algorithms

[http://ews.uiuc.edu/~xyan/tutorial/kdd06_graph.ht
m]

198
Mining and Searching Complex Structures                             Chapter 5 Graph Similarity Search

Path-based Algorithms

[http://ews.uiuc.edu/~xyan/tutorial/kdd06_graph.ht
m]

Path-based Algorithms: problem

[http://ews.uiuc.edu/~xyan/tutorial/kdd06_graph.ht
m]

199
Mining and Searching Complex Structures                           Chapter 5 Graph Similarity Search

Feature-based Methods: limitation

• Problem:
•For similarity search, filtering is done by inferring the edit
distance bound through the smaller units that exactly match
the query structure
• A rough bound
• Not effective for large graphs (because features that may be
rare in small graphs are likely to be found in enormous graphs
just by chance)

Outline

• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search

200
Mining and Searching Complex Structures                               Chapter 5 Graph Similarity Search

Graph Similarity Search Problem

• Definition
•Given a graph database D and a query structure Q, similarity search is
to find all the graphs in D that are similar to Q based on GED.

• Two challenges in the filter-and-refine framework:
•How to efficiently compute more effective edit distance
bounds between two graphs for filtering?
•How to reduce the number of pairwise graph dissimilarity
computations to speed up the graph search?

Our Solutions

• Work 1: Star decomposition
•Break each graph into a multiset of stars
•Propose new effective and efficient lower and upper GED
bounds through finding a mapping between the star sets of
two graphs using Hungarian algorithm

• Work 2: Sorted index for graph similarity search
•Propose a novel indexing and query processing framework
•Deploy a filtering strategy based on TA and CA methods to
reduce the number of pairwise graph dissimilarity
computations

201
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search

Outline

• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search

Comparing Stars: On
Approximating Graph Edit
Distance

Zhiping Zeng
Anthony K.H. Tung
Jianyong Wang
Jianhua Feng
Lizhu Zhou

202
Mining and Searching Complex Structures                            Chapter 5 Graph Similarity Search

Star Decomposition

• Star structure
•A star structure s is an attributed, single-level, rooted tree
which can be represented by a 3-tuple s=(r, L, l), where r is
the root vertex, L is the set of leaves and l is a labeling
function.
• Star representation for graph
•A graph can be broken into a multiset of star structures

c        b                     a                    c
b    c    c       a       c       d
b                    d
a         c         d
G1                     a   c     d          c       c

Star Decomposition

• Star edit distance
•Given two star structures s1 and s2,
•      λ(s1, s2) = T(r1, r2) + d(L1, L2)
•Where T(r1, r2) = 0 if l(r1) = l(r2); otherwise T(r1, r2) = 1
•     d(L1, L2) = ||L1| − |L2|| + M(L1, L2)
•     M(L1, L2) = max{| ΨL1|, | ΨL2|} − |ΨL1∩ΨL2|

Example: given s1 = abcc, and s2 =                 a                       d
dcc,
T(r1, r2) = 1, as l(a) ≠ l(d);                 b   c       c           c        c
d(L1, L2) = |3-2| + 3 – 2 = 2;                     s1                      s2
λ(s1, s2) = 1 + 2 = 3.

203
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search

Star Decomposition

• Mapping distance
•Given two multisets of star structures S(G1) and S(G2) from
two graphs G1 and G2 with the same cardinality, and assume
P: S(G1) → S(G2) is a bijection. The mapping distance
between G1 and G2 is

•This problem can be formulated as the assignment problem.
Given a distance cost matrix between two star multisets, the
mapping distance can be computed using Hungarian
algorithm.

A Simple Example

204
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search

Bounds of GED

• Lower Bound
•Let G1 and G2 be two graphs, then the mapping distance
μ(G1, G2) between them satisfies
μ(G1, G2) ≤ max{4, [min{δ(G1), δ(G2)]} + 1]} · λ(G1,
G2)

• Based on the above Lemma, μ provides a lower bound
Lm of λ, i.e.,

Constructing the cost matrix takes Θ(n3), and running
the Hungarian algorithm takes O(n3).

Bounds of GED

• Upper bound
•The first upper bound τ comes naturally during the
computation of μ
•The output from the computation of using Hungarian
algorithm leads to a mapping P’ from V(G1) to V(G2)
•Recall the BLP method, exact GED is computed as

•Therefore,                  is a naturally upper bound

The mapping P’ might not be optimal, so τ (G1, G2)≥λ(G1, G2).
C(G1, G2, P’) is solved in Θ(n2) time, therefore, τ can be computed in
Θ(n3) time.

205
Mining and Searching Complex Structures                               Chapter 5 Graph Similarity Search

Bounds of GED

• Refined upper bound ρ: main idea
•Given any two vertices v1 and v2 in G1 and their
corresponding mapping f(v1) and f(v2) in G2 (assuming f is
the mapping function corresponding to P’), we swap f (v1)
and f (v2) if this reduce the edit distance.

c        b       d        ε         c        b        d        ε

a        c       a        c         a        c        a        c
G1               G2                 G1                G2
new mappings obtained might lead to better or worse
bounds. Refining to get a better takes O(n6).

Filtering Strategy

• Integrating all the GED bounds into a filter-and-refine
framework
•Filtering features: Lm ≤ λ ≤ ρ ≤ τ.
•Filtering orders: bounds with lower computation complexity
are deployed first.

206
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search

Full Graph Similarity Search

• Problem
•Given a graph database D and a query structure Q, find all
the graphs Gi in D with λ(Q, Gi) ≤ d (d is a threshold).

• AppFULL algorithm:
•if Lm(Q, Gi) > d, Gi can be safely filtered;
•if τ(Q, Gi) ≤ d, Gi can be reported as a result directly;
•if ρ(Q, Gi) ≤ d, Gi can be reported as a result directly;
•otherwise, λ(Q, Gi) must be computed.

Subgraph exact Search

• Lemma
•Given two graphs G1 and G2 , if no vertex relabelling is
allowed in the edit operations, μ’(G1, G2) ≤ 4 · λ’(G1, G2),
where μ’ and λ’ are computed without vertex relabelling.
•(This Lemma can be used in subgraph search, because if a
graph is subisomorphism to another graph, no vertex
relabelling happens.)
• AppSUB algorithm:
•Filtering based on the lower bound                 .

207
Mining and Searching Complex Structures                         Chapter 5 Graph Similarity Search

Experimental Results

• Compare with the exact algorithm

1,000 graphs were generated, D = 1k,T = 10,V = 4.
Randomly select 10 seed graphs to form D; a seed has 10 vertices.
6 query groups. Each group has 10 graphs. Graphs in the same
group have the same number of vertices.

Experimental Results

• Compare with the BLP method

208
Mining and Searching Complex Structures               Chapter 5 Graph Similarity Search

Experimental Results

• Scalability over real datasets

Experimental Results

• Scalability over synthetic datasets

209
Mining and Searching Complex Structures         Chapter 5 Graph Similarity Search

Experimental Results

• Performance of AppFULL

Experimental Results

• Performance of AppSUB

210
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search

Outline

• Introduction
• Foundation
• State of the Art on Graph Matching
•Exact Graph Matching
•Error-Tolerant Graph Matching
• Search Graph Databases
•Graph Indexing Methods
• Our Works
•Star Decomposition
•Sorted Index For Graph Similarity Search

SEGOS: SEarch similar
Graphs On Star index

Xiaoli Wang
Xiaofeng Ding
Anthony K.H. Tung
Shanshan Ying
Hai Jin

211
Mining and Searching Complex Structures                          Chapter 5 Graph Similarity Search

Our Solutions

• Work 1: Scalability issue
•A full database scan
•A index mechanism is needed
• Existing indexing methods: Filtering power
•Rough bounds with poor filtering power

• Work 2: Sorted index for graph similarity search
•Propose a novel indexing and query processing framework
•Deploy a filtering strategy based on TA and CA methods
•All exiting lower and upper GED bounds can be directly
integrated into our filtering framework

TA Method on the Top-k Query

• The database model used in TA

M
Object                            Sorted L1         Sorted L2
A1         A2
ID
0.9        0.85            (a, 0.9)          (d, 0.9)
a

0.8        0.7             (b, 0.8)         (a, 0.85)
b

0.72       0.2            (c, 0.72)          (b, 0.7)
c
.                 .
d      0.6        0.9                .                 .
.       .          .                 .                 .
.       .          .                 .                 .
.       .          .              (d, 0.6)          (c, 0.2)
N     .       .          .

212
Mining and Searching Complex Structures                          Chapter 5 Graph Similarity Search

TA method on the top-k query

• A simple query
•Find the top-2 objects on the ‘query’ of ‘A1&A2 ’
•This query results in the TA method combing the scores of
A1 and A2 by an aggregation function like

sum(A1,A2)

Aggregation function:
function that gives objects an overall score based on attribute
scores
examples: sum, min functions
Monotonicity!

Monotony on TA (Halting Condition)

• Main idea
•How do we know that scores of seen objects are higher than
•Predict maximum possible score unseen objects:

L1          L2

a: 0.9       d: 0.9
Seen
b: 0.8      a: 0.85
c: 0.72      b: 0.7        ω = sum(0.72, 0.7) =
.           .          1.42
.        f: 0.6
.
.
f: 0.65         .
Possibly unseen             .           .                  Threshold value
d: 0.6      c: 0.2

213
Mining and Searching Complex Structures                         Chapter 5 Graph Similarity Search

A Top-2 Query Example
• Given 2 sorted lists for attributes A1 and A2,

L1          L2

(a, 0.9)     (d, 0.9)
ID     A1      A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)

(c, 0.72)    (b, 0.7)

.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)

A Top-2 Query Example
• Step 1
•Parallel sorted access attributes from every sorted list

L1          L2

(a, 0.9)     (d, 0.9)
ID     A1      A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a    0.9
(c, 0.72)    (b, 0.7)
d             0.9      1.5
.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)

214
Mining and Searching Complex Structures                             Chapter 5 Graph Similarity Search

A Top-2 Query Example
• Step 1
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1          L2

(a, 0.9)     (d, 0.9)
ID        A1     A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a    0.9
(c, 0.72)    (b, 0.7)
d               0.9
.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)

A Top-2 Query Example
• Step 1
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1          L2

(a, 0.9)     (d, 0.9)
ID        A1     A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a    0.9       0.85
(c, 0.72)    (b, 0.7)
d               0.9
.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)

215
Mining and Searching Complex Structures                             Chapter 5 Graph Similarity Search

A Top-2 Query Example
• Step 1
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1          L2

(a, 0.9)     (d, 0.9)
ID        A1     A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a    0.9       0.85      1.75
(c, 0.72)    (b, 0.7)
d               0.9
.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)

A Top-2 Query Example
• Step 1
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1          L2

(a, 0.9)     (d, 0.9)
ID        A1     A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a    0.9       0.85      1.75
(c, 0.72)    (b, 0.7)
d    0.6        0.9      1.5
.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)

216
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search

A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)

L1         L2

(a, 0.9)    (d, 0.9)
ID   A1      A2    sum(A1,A2)
(b, 0.8)    (a, 0.85)
a    0.9    0.85      1.75
(c, 0.72)   (b, 0.7)
d    0.6     0.9      1.5
.           .
.           .
.           .
.           .

(d, 0.6)    (c, 0.2)

A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)

L1         L2

(a, 0.9)    (d, 0.9)
ID   A1      A2    sum(A1,A2)
(b, 0.8)    (a, 0.85)
a    0.9    0.85      1.75
(c, 0.72)   (b, 0.7)
d    0.6     0.9      1.5
.           .
.           .
.           .
.           .

(d, 0.6)    (c, 0.2)          ω = sum(0.9, 0.9) = 1.8

217
Mining and Searching Complex Structures                          Chapter 5 Graph Similarity Search

A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1

L1          L2

(a, 0.9)     (d, 0.9)
ID     A1      A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a     0.9     0.85      1.75
(c, 0.72)    (b, 0.7)
d     0.6     0.9       1.5
.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)           ω = sum(0.9, 0.9) = 1.8

A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1

L1          L2

(a, 0.9)     (d, 0.9)
ID     A1      A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a     0.9     0.85      1.75
(c, 0.72)    (b, 0.7)
d     0.6     0.9       1.5
.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)

218
Mining and Searching Complex Structures                          Chapter 5 Graph Similarity Search

A Top-2 Query Example
• Step 2
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1

L1          L2

(a, 0.9)     (d, 0.9)
ID     A1      A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a     0.9     0.85      1.75
(c, 0.72)    (b, 0.7)
d     0.6     0.9       1.5
.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)

A Top-2 Query Example
• Step 1 (Again)
•Sorted access attributes from every sorted list

L1          L2

(a, 0.9)     (d, 0.9)
ID     A1      A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a     0.9     0.85      1.75
(c, 0.72)    (b, 0.7)
d     0.6     0.9       1.5
.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)

219
Mining and Searching Complex Structures                              Chapter 5 Graph Similarity Search

A Top-2 Query Example
• Step 1 (Again)
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1          L2

(a, 0.9)     (d, 0.9)
ID        A1      A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a    0.9        0.85      1.75
(c, 0.72)    (b, 0.7)
d    0.6         0.9      1.5
.            .
.            .               b        0.8     0.7      1.5
.            .
.            .

(d, 0.6)     (c, 0.2)

A Top-2 Query Example
• Step 1 (Again)
•Sorted access attributes from every sorted list
•For each object seen:
• get all scores by random access
• determine sum(A1,A2)
• amongst 2 highest seen? keep in buffer
L1          L2

(a, 0.9)     (d, 0.9)
ID        A1      A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a    0.9        0.85      1.75
(c, 0.72)    (b, 0.7)
d    0.6         0.9      1.5
.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)

220
Mining and Searching Complex Structures                          Chapter 5 Graph Similarity Search

A Top-2 Query Example
• Step 2 (Again)
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1

L1          L2

(a, 0.9)     (d, 0.9)
ID     A1      A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a     0.9     0.85      1.75
(c, 0.72)    (b, 0.7)
d     0.6     0.9       1.5
.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)           ω = sum(0.8, 0.85) = 1.65

A Top-2 Query Example
• Step 2 (Again)
•Determine threshold value based on objects currently seen
under sorted access. ω = sum(L1, L2)
•2 objects with overall score ≥ threshold value ω? Stop
•else go to next entry position in sorted list and go to step 1

L1          L2

(a, 0.9)     (d, 0.9)
ID     A1      A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a     0.9     0.85      1.75
(c, 0.72)    (b, 0.7)
d     0.6     0.9       1.5
.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)

221
Mining and Searching Complex Structures                            Chapter 5 Graph Similarity Search

A Top-2 Query Example
Situation at stopping:
ω = sum(0.72, 0.7) = 1.42 < 1.5
L1          L2

(a, 0.9)     (d, 0.9)
ID   A1      A2    sum(A1,A2)
(b, 0.8)     (a, 0.85)
a    0.9    0.85      1.75
(c, 0.72)    (b, 0.7)
d    0.6     0.9      1.5
.            .
.            .
.            .
.            .

(d, 0.6)     (c, 0.2)

TA-based Filtering Strategy for Graph
Search Problem
• Main idea
•Each graph is broken into a multiset of stars
•Each distinct star generated from the database graphs can be
seen as an index attribute in the TA database model
•Each entry in the sorted lists contains the graph identity
(denoted by gi) and its score (denoted by λ) in that star
attribute, the score is defined as the star edit distance between
a star of gi and the index star
•Halting condition: given m sorted lists, if the aggregation
function of ω = sum(λ1,…, λm)≥d (d is the edit distance
threshold bound for graph mapping distance), TA stops.

222
Mining and Searching Complex Structures                               Chapter 5 Graph Similarity Search

TA-based Filtering Strategy for Graph
Search Problem
• Challenges:
•How do we know that the distance threshold is larger than
those of unseen graphs (these graphs can be safely filtered
out)? Predict minimum possible mapping distance for unseen
graphs:

L1           L2

g1: 0         g4: 0
Seen
g2: 1         g1: 1
g3: 2         g2: 3         ω = sum(2, 3) = 5 > d (= 4)
.             .
.             :
g6. 5

Possibly unseen           g6. 5
:             .
.             .                  Threshold value
g4: 6         g3: 9

TA-based Filtering Strategy for Graph
Search Problem
• A graph database with a query example

Sorted list L1   Sorted list L2   Sorted list L3

223
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search

Requirement

• An index structure
•Convenient for score-sorted lists construction
• Efficient star search algorithm
•Quickly return similar stars to a query star
• Sorted properties for the halting condition of TA
•The mapping distance of any unseen graph gi satisfies
λ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’)
q is the query graph, τ is the distance threshold, and

where D’ is the set of all unseen graphs.

•Requirement distance in our previous work
Recall the mapping
satisfy:
•μAn index structure
(q, gi) ≤ max{4, [min{δ (q), δ(gi)]} + 1]} · λ(q, gi)
•Convenient for score-sorted lists construction
Efficient δ search max{4, [min{δ (q), δ(gi)]} + 1]},
•We denotestar(q, gi) =algorithm
•Quickly gi) ≤ δ’.
then δ (q,return similar stars to a query star
If μ(q, g ) > τ*δ’, then λ(q, gi) > τ*δ’/δ > τ,
• Sorted iproperties for the halting condition of TA
and this graph can be safely filtered out.
•The mapping distance of any unseen graph gi satisfies
•     μ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’)
•q is the query graph, τ is the distance threshold, and
•δ’ = max{4, [min{δ(q), δ(D’)]} + 1]}
•where D’ is the set of all unseen graphs.

224
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search

Requirement
• An index structure
•Convenient for score-sorted lists construction
• Efficient star search algorithm
•Quickly return similar stars to a query star
• Sorted properties for the halting condition of TA
•The mapping distance of any unseen graph gi satisfies
•     λ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’)
•q is the query graph, τ is the distance threshold, and
•
•where D’ is the set of all unseen graphs.

Build Inverted Index Structures based
on the Star Decomposition
• The upper-level index
•Build an inverted index between stars and graphs
•Used to quickly returned graph lists
• The lower-level index
•Build an inverted index between labels and stars
•Used to construct the sorted lists
•for top-k star search based on TA
•filtering strategy

225
Mining and Searching Complex Structures         Chapter 5 Graph Similarity Search

Build Inverted Index Structures based
on the Star Decomposition

Top-k Star Search Algorithm

• Construct sorted lists

226
Mining and Searching Complex Structures                           Chapter 5 Graph Similarity Search

Graph Score-sorted Lists

• Construct lists based on the top-k results

TA-based Graph Range Query

• Definition
•Given a graph database D and a query q, find all gi ∈ D that
are similar to q with λ(q, gi) ≤ τ. τ is the distance
threshold.
• Steps: given m sorted lists for a query graph q
•Perform sorted retrieval in a round-robin schedule to each
sorted list. For a retrieved graph gi, if Lm(q, gi) > τ, filter out
the graph; if Um(q, gi) ≤ τ, report the graph to the answer
set.
•For each sorted list SLj, let χj be the corresponding distance
last seen under sorted access. If ω = sum(χ1,…, χm) >
τ∗δ’, then halt. Otherwise, go to step 1.

227
Mining and Searching Complex Structures                       Chapter 5 Graph Similarity Search

CA-based Filtering Strategy

• The difference between TA and CA
•TA computes the mapping distance between two graphs
when retrieving a new graph through sorted accesses

•Only in each h depth of the sorted scan, for seen and
unprocessed graphs, CA uses estimated mapping distance
bounds to first filter graphs; Then, it uses Incremental
Hungarian algorithm to compute the partial mapping
distances for filtering

CA-based Filtering Strategy

• Suppose l(g) = {l1,…,ly} ⊆ {1,2,…,m} is a set of known
lists of g seen below q. Let χ(g) be the multiset of
distances of the distinct stars of g last seen in known lists.
•Lower bound denoted by Lμ(q, g) is obtained by substituting
the missing lists j ∈ {1,2,…,m}\l(g) with χj (the distance
last seen under the jth list) in ζ(q, g)
•Upper bound denoted by Uμ(q, g) is computed as
Uμ(q, g) = t′(χ(g)) + χ ∗ (|g| − |χ(g)|)
• Theorem: Let g1 and g2 be two graphs, the bounds
obtained as above satisfies
ζ(g1, g2) ≤ Lμ(g1, g2) ≤ μ(g1, g2) ≤ Uμ(g1, g2)

228
Mining and Searching Complex Structures                                  Chapter 5 Graph Similarity Search

CA-based Filtering Strategy

• Dynamic hungarian for partial mapping distance
•Given m sorted lists for q, suppose S′(g) ⊆ S(g) is a
multiset of stars in g seen below lists. Then we have μ(S(q),
S′(g)) ≤ μ(q, g)

CA-based Graph Range Query

• Steps: given m sorted lists for a query graph q
•Perform sorted retrieval in a round-robin schedule to each
sorted list. At each depth h of lists:
• Maintain the lowest values χ1, . . . , χm encountered in the
lists. Maintain a distance accumulator ζ(q, gi) and a multiset
of retrieved stars S′(gi) ⊆ S(gi) for each gi seen under lists.
• For each gi that is retrieved but unprocessed, if ζ(q, gi) > τ∗δgi,
filter out it; if Lμ(q, gi) > τ∗δgi, filter out it; if Uμ(q, gi) ≤ τ∗δgi ,
add the graph to the candidate set. Otherwise, if μ(S(q), S′(gi )
> τ∗δgi, filter out the graph. Finally, run the Dynamic
Hungarian to obtain Lm(q, gi) and Um(q, gi) for filtering.
•When a new distance is updated, compute a new ω. If ω =
t′(χ) > τ∗δ′, then halt. Otherwise, go to step 1.

229
Mining and Searching Complex Structures          Chapter 5 Graph Similarity Search

Experimental Results: Sensitivity test

Experimental Results: Index construction

230
Mining and Searching Complex Structures         Chapter 5 Graph Similarity Search

Experimental Results: compare with other
works varying distance thresholds

Experimental Results: compare with other
works varying dataset sizes

231
Mining and Searching Complex Structures                        Chapter 5 Graph Similarity Search

References
• D. Conte, Pasquale Foggia, Carlo Sansone, and Mario Vento.
Thirty Years of Graph Matching in Pattern Recognition.
• P. Foggia, C. Sansone and M. Vento. A performance
comparison of five algorithms for graph isomorphism. In 3rd
IAPR-TC15 workshop on graph-based representations in
pattern recognition, 2001.
• K. Riesen, M. Neuhaus, and H. Bunke. Bipartite graph
matching for computing the edit distance of graphs. In GBRPR,
2007.
• P. Hart, N. Nilsson, and B. Raphael. A formal basis for the
heuristic determination of minimum cost paths. IEEE Trans.
SSC, 1966.

References
• D. Justice. A binary linear programming formulation of the
graph edit distance. IEEE TPAMI, 2006.
• R. Giugno and D. Shasha. Graphgrep: A fast and universal
method for querying graphs. In ICPR, 2002.
• R. D. Natale, A. Ferro, R. Giugno, M. Mongiovì, A. Pulvirenti,
and D. Shasha. SING: subgraph search in non-homogeneous
graphs. BMC Bioinformatics, 2010.
• X. Yan, P.S. Yu, and J. Han. Graph indexing: a frequent
structure-based approach. In SIGMOD, 2005.
• J. Cheng, Y. Ke, W. Ng, and A. Lu. Fg-index: towards
verification-free query processing on graph databases. In
SIGMOD, 2007.

232
Mining and Searching Complex Structures                      Chapter 5 Graph Similarity Search

References
• D.W. Williams, J. Huan, and W. Wang. Graph database
indexing using structured graph decomposition. In ICDE, 2007.
• S. Zhang, M. Hu, and J. Yang. Treepi: a novel graph indexing
method. In ICDE, 2007.
• P. Zhao, J. X. Yu, and P. S. Yu. Graph indexing: tree + delta
>= graph. In VLDB, 2007.
• G. Wang, B. Wang, X. Yang, and G. Yu. Efficiently indexing
large sparse graphs for similarity search. IEEE TKDE, 2010.

233
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining

Searching and Mining Complex
Structures
Massive Graph Mining
Anthony K. H. Tung(鄧锦浩)
School of Computing
National University of Singapore
www.comp.nus.edu.sg/~atung

Graph applications: everywhere

And often, they are huge and messy.

social network

Bio Pathway

Co-authorship
network

234
Mining and Searching Complex                       Chapter 6 Structures Massive Graph Mining

Knowledge: NOWHERE

Unless we manage to find where they hide.
Too many clues is like no clue.

Part I (1.5 hrs)
•Graph Mining Primer
•Recent advances in Massive Graph Mining
Part 2(1.5 hrs)
•CSV: cohesive subgraph Mining
•Dngraph mining: a triangle based approach

235
Mining and Searching Complex                             Chapter 6 Structures Massive Graph Mining

• Graph Mining Primer
•   Data mining vs. Graph mining
•   Massive graph mining domain
•   Types of graph patterns
•   Properties of large graph structure
• Recent advances in Massive Graph Mining
• CSV: cohesive sub graph Mining
• DNgraph mining: a triangle based approach

From Data Mining to Graph Mining
•
Data Mining
raph Mining
• Classification
• Captures more complicated
• Clustering                      entity relationships.
• Association rule              • Output: patterns, which are
learning                        smaller subgraphs with
interpretable meanings.

236
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining

Massive graph mining domains
•   Financial data analyzing
•   Bioinformatics network
•   User profiling for customized search
•   Identify financial crime

Financial data analysis
In stock market,
correlations among
stocks helps in profit
making.
Mining stock
correlation graphs                Stocks Correlation Tabular Form
predicting stocks'
price change for
estimating future
return, allocating
portfolio and
controlling risks etc.
Stocks Correlation Patterns

237
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining

Financial data analysis
In stock market,
correlations among
stocks helps in profit
making.
Mining stock
correlation graphs                   Stocks Correlation Tabular Form
predicting stocks'
price change for
estimating future Highly
return, allocating correlated
portfolio and       stock sets
controlling risks etc.
Stocks Correlation Patterns

Bioinformatics network

•Protein-protein interaction
• The fundamental
activities for very
numerous living cells.
• A dense graph pattern
indicates these proteins
have similar functionalities.

one representation of an assembled
NEDD9 network

238
Mining and Searching Complex                             Chapter 6 Structures Massive Graph Mining

User profiling for customized search
The Internet Movie Database (IMDB)
Registered users can comment on movies of their interest.
Mining on comments sharing network provides insight of
user’s interest thus further facilitate customized search.

Movie centric
view of IMDB
review network

Identify financial crime

Large classes of financial crimes such as money laundering,

Geospatial information of suspects          A money laundering pattern

239
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining

Dense Graph Patterns
Clique/Quasi-Clique
A clique represents the highest level of internal interactions.
Quasi-clique is an ``almost'' clique with few missing edges.
High Degree Patterns
Concern the average vertex degree, which is the number of
edges intercepting the vertex.

Dense graph patterns (cont.)
Dense Bipartite Patterns                 Heavy Patterns

Weighted, directed graph of
Bipartite graph of pathways and           online citation network, by
genes for the AML/ALL dataset.            Rosvall & Bergstrom

240
Mining and Searching Complex                        Chapter 6 Structures Massive Graph Mining

Properties of large graph structure
Static
•Power law degree distributions.
•Small world phenomenon.
•Communities and clusters.
Dynamic
•Shrinking diameters of enlarging graphs
•Densification along time

Power law

241
Mining and Searching Complex                           Chapter 6 Structures Massive Graph Mining

Large graph: properties and laws (cont.)
Dynamic
•Shrinking diameters of enlarging graphs.
•Densification along time

• Graph Mining Primer
• Large graph: properties and laws
• Approaches in Graph mining
• Pattern based Mining algorithms
• Practical techniques in Massive Graph Mining
• Graph summarization with randomized sampling
•Connectivity based traversal
•MapReduce based
• CSV: cohesive subgraph Mining
• Dngraph mining: a triangle based approach

242
Mining and Searching Complex                       Chapter 6 Structures Massive Graph Mining

Pattern based Mining algorithms
Greedy methods
SUBDUE (PWKDD04), GBI(JAI94)
Apriori-based approaches (detail in next few slides)
AGM , FSG, gSpan
Inductive logic programming (ILP) oriented solutions
WARMR, FARMAR
Kernel based solutions
Kernels for graph classification

manner
Use a Lattices structure
to count candidate
subgraph sets
efficiently.

A search lattice for item set mining

243
Mining and Searching Complex                       Chapter 6 Structures Massive Graph Mining

Apriori-based Graph Mining
Performance bottleneck: candidate subgraph generation.
Solution:
1. Build a lexicographic order among graphs.
2. Search using depth-first strategy.
Very effective in mining large collections of small to medium
size graphs.

Graph summarization with randomized
sampling
• Efficient Aggregation for Graph Summarization –
SIGMOD 2008
• Graph Summarization with Bounded Error-SIGMOD
2008
• Mining graph patterns efficiently via randomized
summaries - VLDB 2009

244
Mining and Searching Complex                      Chapter 6 Structures Massive Graph Mining

Efficient Aggregation for Graph
Summarization
As graph size increases, graphs summarization becomes
crucial when visualize the whole graph.
Criteria for an efficient summarization solution
Able to produce meaningful summarization for real
application.
Scalable to large graphs.
The choice: graph aggregation

Graph Aggregation
1. Summarization based on user-selected node attributes and
relationships.
2. Produce summaries with controllable resolutions.
“drill-down” and “roll-up” abilities to navigate
Propose two aggregation operations

245
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining

Operation SNAP
Group nodes by user-selected node attributes & relationships
Nodes in each group are homogenous (in terms of attributes
and relationships).
Goal: minimum # of groups

How does SNAP work?

Top down approach
Initial Step: Use user selected attributes to group nodes.
Iterative Step:
If a group are not homogeneous w.r.t. relationships, split the
group based on its relationships with other groups.

246
Mining and Searching Complex                       Chapter 6 Structures Massive Graph Mining

SNAP limitation
Homogeneity requirement for relationships
Noise and uncertainty

Users have no control over the resolutions of summaries
SNAP operation can result in a large number of small groups

Operation k-SNAP
The entities inside a group are not necessarily
homogenous in terms of relationships with other
groups.
Users can control resolution by specifying k (#
groups).
Varying of k provides “drill-down” and “roll-up”
abilities.

247
Mining and Searching Complex                           Chapter 6 Structures Massive Graph Mining

Access quality of summarization

Determined by sum of noisy relations.
When the relationship between two relationships are strong
(>50%), count missing participants.
When the relationship between two relationships are weak
(<=50%), count extra participants.

K-SNAP goal
Find the summary of size k with best quality.
I.e. minimal Δ.
The exact solution to minimize Δ is NP-Complete.
Heuristics
Top down: split a group into 2 at each iteration.
Choose the group with worst quality and split.
Bottom up: merge 2 groups into 1
Choose same attribute values, similar neighbors, or similar
participation ratio.

248
Mining and Searching Complex                   Chapter 6 Structures Massive Graph Mining

Major results

Double-blind
review’s
effect on LP
authors.

k-SNAP: Top down vs. Bottom up

249
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining

Graph Summarization with Bounded
Error
Large graph data needs compression
Compression can reduce size to 1/10 (web graph)
Graph compression vs. Clustering

Compression                     Clustering

use urls, node labels  works for generic
for compression        graphs
Result lacks meaning No compression for
space saving

Solution: MDL Based
Compression for Graphs

Intuition                                        d       e       f       g
Many nodes with similar
neighborhoods
• Communities in social networks;              a       b       c
Collapse such nodes into
supernodes (clusters)
and the edges into                            X = {d,e,f,g}               Summary
superedges
• Bipartite subgraph to two
Y = {a,b,c}
supernodes and a superedge
• Clique to supernode with a
“self-edge”

250
Mining and Searching Complex                              Chapter 6 Structures Massive Graph Mining

How to choose vertex sets to compress
Cost = 14 edges
d        e        f       g

h                                                           i
j
a        b        c
MDL based compression
S is a high-level summary graph:
C is a set of edge corrections:                               X = {d,e,f,g}                  Summary

minimize cost of S+C                                                                                      i
h
Novel Approximate Representation:                                 Y = {a,b,c}                                     i
reconstructs graph with bounded error
(є); results in better compression
Corrections
+(a,h)               Cost = 5
(1 superedge +
+(c,i)               4 corrections)
+(c,j)
‐(a,d)

Compress (cont.)

Summary S(VS, ES)                                                   X = {d,e,f,g}
Each supernode v represents a set of nodes Av           h                                       i
Each superedge (u,v) represents
all pair of edges πuv = Au x Av                                     Y = {a,b,c}                             j
Corrections C: {(a,b); a and b are nodes
C = {+(a,h), +(c,i), +(c,j), -(a,d)}
of G}
Supernodes are key,
superedges/corrections easy
Auv actual edges of G between Au and Av
Cost with (u,v) = 1 + |πuv – Euv|
Cost without (u,v) = |Euv|
d         e       f       g
Choose the minimum, decides whether edge
(u,v) is in S                                       h                                               i
a        b       c                         j

251
Mining and Searching Complex                                              Chapter 6 Structures Massive Graph Mining

Reconstruct

Reconstructing the graph from R
For all superedges (u,v) in S, insert all pair of edges πuv
For all +ve corrections +(a,b), insert edge (a,b)
For all -ve corrections -(a,b), delete edge (a,b)

Approximate Representation Rє
X = {d,e,f,g}
Approximate representation
Recreating the input graph exactly is not always necessary
Y = {a,b}
Reasonable approximation enough: to compute communities,
anomalous traffic patterns, etc.
Use approximation leeway to get further cost reduction                   C = {-(a,d), -(a,f)}
Generic Neighbor Query
Given node v, find its neighbors Nv in G
Apx-nbr set N’v estimates Nv with є-accuracy                             d       e   f       g
Bounded error: error(v) = |N’v - Nv| + |Nv - N’v| < є |Nv|
Number of neighbors added or deleted is at most є-fraction of
the true neighbors
a           b
Intuition for computing Rє
If correction (a,d) is deleted, it adds error for both a and d
From exact representation R for G, remove (maximum)                  For є=.5, we can remove
corrections s.t. є-error guarantees still hold
one correction of a

d       e   f       g

a           b

252
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining

Main Results: cost reduction

Reduces the cost down to 40%

Cost of GREEDY 20%
lower than RANDOMIZED

RANDOMIZED is 60%
faster than GREEDY

Comparison with other schemes

Techniques give much
better compression

253
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining

Approximate-Representation

Cost reduces linearly
as є is increased;
With є=.1, 10% cost
reduction over R

Mining graph patterns efficiently via
randomized summaries
Motivation
In a graph with large number of identical labeled vertices,
graph isomorphism becomes a bottleneck.
How to avoid enumerate identical patterns?

3 (triangular) × 4 (square) = 12 (total)

254
Mining and Searching Complex                                Chapter 6 Structures Massive Graph Mining

Solution framework

Summarization->Mining->Verification
Raw DB

Summarized DB

Raw DB

Reduce false positive
• Technique 1: merge vertices that are far away from each
other.
•The length of the shortest path
•The probability of random walk
• Technique 2: merge vertices whose neighborhood overlap.
•Cosine, Chi^2, Lift, Coherence
• Technique 3: Go back to raw database to do verification
It is guaranteed that there is no false positives.
Summarization may cause false positive
a         b                a
False Embeddings
False Positives
a         b            b           b

255
Mining and Searching Complex                           Chapter 6 Structures Massive Graph Mining

Summarization: Reduce false negative
a        b                a
Miss Embeddings   False Negatives

a        c            b          c

Technique 1: For raw database with frequency threshold min_sup,
we adopt a lower frequency threshold pseudo min_sup for
summarized database.
Technique 2: Iterate the mining steps for T times and combine the
results generated in each time.
It is NOT guaranteed that there is no false positives, but the
possibility is bounded by

Connectivity based traversal
CSV: Cohesive Subgraph Mining –SIGMOD 2008
(Discussed in detail in Part II)
Progressive Clustering of Networks using
Structure-Connected Order of Traversal –ICDE 2010

256
Mining and Searching Complex                              Chapter 6 Structures Massive Graph Mining

Progressive clustering of networks using structure-
connected order of traversal
SCAN Algorithm
•Similar to DBSCAN: connectivity-based
•Average O(n) time
•Uses structural similarity measure, minimum cluster size mu, and
minimum similarity epsilon
•Finds outliers and hubs
Problems
•No automated way to find good epsilon
•Must rerun algorithm for each possible epsilon
•Epsilon is global threshold
• No hierarchical clusters
• No variation in cluster subtlety

Solution

• Structure-Connected Order of Traversal (SCOT)
•Contains all possible epsilon-clusterings
• Efficient method to find global epsilon
• New Contiguous Subinterval Heap structure
(ContigHeap)
• New Progressive Mean Heap Clustering (ProClust)
•Epsilon-free
•Hierarchical
• Refinement by Gap Constraint (GapMerge)

257
Mining and Searching Complex                                 Chapter 6 Structures Massive Graph Mining

Original Network:

SCOT plot:

Optimal Global Epsilon

SCAN paper only contains supervised
sampling method.
Sample points, find k-NN similarities, sort,
plot, find knee visually
O(nd log n) time

Our solution:
Knee hypothesis implies approx concave
plot
Optimal epsilon minimizes obtuse angle
between segments
Modified histogram and binary search: O(n)
time

258
Mining and Searching Complex                                        Chapter 6 Structures Massive Graph Mining

ContigHeap

BuildContigHeap produces heap containing
all contiguous subintervals from SCOT
output in O(n) time, and integrates with
SCOT
Example:

GapMerge: Gap Constraint Refinement
Merges chained clusters, heap branches with single children
Does not merge across pruned heap nodes (local maxima boundary)
Gap constraint prevents clusters whose left or right boundaries differ by more
than mu from being merged
Such clusters are not redundant relative to the minimum interesting cluster size
Steps
1.Identify chains that meet gap constraint
2.When a node has more than one child or violates gap constraint, begin new chain.
3.Within each chain, calculate significance of each cluster in both up and down
directions
4.Begin with most redundant node, merge nodes in direction of least significance
5.After each merge, recalculate significances
6.Continue until chain contains one node, or no merging possible under gap constraint.

259
Mining and Searching Complex                     Chapter 6 Structures Massive Graph Mining

MapReduce based approach
PEGASUS: A Peta-Scale Graph Mining System –ICDM
2009
Pregel: a system for large-scale graph processing SIGMOD
2010

PEGASUS: A Peta-Scale
Graph Mining System
Dealing with real graph such as Yahoo! Web graph up to 6.7
billion edges.
A Hadoop based graph mining package.
Target at primitive matrix operations such as matrix
multiplication (GMI-v).

260
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining

Motivation
Many Graph mining tasks require matrix
multiplication
PageRank,
Random Walk with Restart(RWR),
Diameter estimation, and
Connected components …
MapReduce provides a simplified programming
concept for large data processing
Details of the data distribution, replication, load balancing are
taken care of.
Provides a similar programming structure. i.e. functional
programming

GIM-V: Generalized Iterative Matrix-
Vector multiplication
Intuition: Matrix Multiplication
M × v = v'
combine2
v i' = ∑ j =1 m
n
i, j   vj
combineall
Assign
Operator× G are matrix multiplication expressed by above 3
steps
× G is iteratively carried out until converge.

261
Mining and Searching Complex                               Chapter 6 Structures Massive Graph Mining

× G and SQL
The matrix multiplication operation can be expressed by
an SQL query.
If view graphs are two table:      ×G
edge table E(sid, did, val) and
a vector table V(id, val)
becomes

×G

SELECT E.sid, combineAllE.sid(combine2(E.val, V.val))
FROM E, V
WHERE E.did = V.id
GROUP BY E.sid

Generalize × G

Vary definition of three steps to generalize × G
matrix

p = (cE T + (1 - c)U)p
All element = 1/n

Damping factor = 0.85

262
Mining and Searching Complex                        Chapter 6 Structures Massive Graph Mining

Generalize × G
Vary definition of three steps to generalize
PageRank
×G

p = (cE T + (1 - c)U)p
combine2 = c × mi, jvj
1- c
+ ∑ j=1 xj
n
combineAll =
n

Generalize (cont.)
By altering three functions, GIM-V adapts to
• Random Walk with Restart
• Diameter Estimation
• Connected Components

263
Mining and Searching Complex                                    Chapter 6 Structures Massive Graph Mining

GIM-V: How to
Stage 1
Combine2
V: Key = id, v: vval, E: Key = idsrc
State 2
Combineall & assign

Bottleneck: shuffling and disc I/O

GIM_V Block Multiplication (BL)

Save on sorting
Data compressing
Clustered Edge

264
Mining and Searching Complex                        Chapter 6 Structures Massive Graph Mining

Clustered edge:

GIM-V DI Dialogonal Block Iteration

Intuition
Increase multiplication
inside an iteration to
reduce # of iterations.
How
Reach local convergence
within a block first before
iterate                         Compare GIM-V BL and DI

265
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining

Main Results
Scalability
GIM-V BL DI is ~5 times faster than GIM-V Base

Main Results (cont.)

Distribution of connected
components are stable after a
‘gelling’ point in 2003.

266
Mining and Searching Complex                        Chapter 6 Structures Massive Graph Mining

Main Results (cont.)

Pregel: A System for Large-Scale
Graph Processing
A scalable and fault-tolerant platform with an API that is
sufficiently flexible to express arbitrary graph algorithms.
Model of computing:
Vertex centric, synchronized iterative model

267
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining

Graph Algorithms Implementation in
Pregel
Graph data are in respect machines, pass messages only, NO
graph state passing.

Pregel C++ API
• Compute() - executed at each active vertex in every
superstep.
•Query information about the current vertex and its edges.
•Send messages to other vertices.
•Inspect or modify the value associated with its vertex/out-
edges.
•state updates are visible immediately. no data races on
concurrent value access from diefferent vertices
• Limiting the graph state managed by the framework to
single value per vertex or edge simplifies the main
computation cycle, graph distribution, and failure
recovery.

268
Mining and Searching Complex                      Chapter 6 Structures Massive Graph Mining

Pregel C++ API (cont.)
• Message Passing
•No guaranteed order, but it will be delivered and no
duplication.
• Combiners
•Combine several messages to reduce overhead
• Aggregators
•Mechanism for global communication, monitoring, and data.
•A number of predefined aggregators, such as min, max, or
sum operations
• Topology mutation
•Change graph toplogy, resolve conflicts when individual
vertices sent conflict messages.

Pregel C++ API (cont.)
• Input and output

269
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining

Pregel implementation
• Design for Google cluster architecture
•Each consists of thousands of commercial PCs
• Persistent data
•Stored in files on distributed file systems such as GFS or
BigTable
• Temporary data
•Stored as buffered message on local disk.

• Divide graph vertices into partitions and assign to
different machines
•controllable by users, default method: hash
• In absence of fault:
•One master, many other workers on a cluster of machines.
• master assign load jobs, i/o and instruct on super steps
• Fault tolerent:
•Use checkpoint: master ping workers
•Confined recovery (undergoing): master log outgoing
message

270
Mining and Searching Complex              Chapter 6 Structures Massive Graph Mining

Graph Application
PageRank
Shortest Path
Bipartite Matching
Semi Cluster

Pregel: Main Result

271
Mining and Searching Complex                                            Chapter 6 Structures Massive Graph Mining

Reference (partial)
Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations by J.
Leskovec, J. Kleinberg, C. Faloutsos. (KDD), 2005.
Substructure Discovery in the SUBDUE System. L. B. Holder, D. J. Cook and S. Djoko. In
(PWKDD), 1994.
Efficient Aggregation for Graph Summarization – Yuanyuan Tian, Richard A. Hankins, Jignesh M.
Patel SIGMOD 2008
Graph Summarization with Bounded Error-Saket Navlakha, Rajeev Rastogi, Nisheeth Shrivastava
SIGMOD 2008
Mining graph patterns efficiently via randomized summaries Chen Chen, Cindy X. Lin, Matt
Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han - VLDB 2009
Progressive Clustering of Networks using
Structure-Connected Order of Traversal Dustin Bortner, Jiawei Han –ICDE 2010
PEGASUS: A Peta-Scale Graph Mining System U. Kang, Charalampos E. Tsourakakis,
ChristosFaloutsos, ICDM
Graph based induction as a unified learning framework, K. Yoshida, H. Motoda, and N. Indurkhya.
Applied Intelligence volume 4, 1994.
Complete mining of frequent patterns from graphs: Mining graph data. Akihiro,W. Takashi, and
M. Hiroshi. Mach. Learn., 50(3):321–354, 2003.

Reference (cont.)

Frequent subgraph discovery, K. Michihiro and G. Karypis. In ICDM, pages 313–320, 2001.
gSpan: Graph-based substructure pattern mining, X. Yan and J. Han. ICDM 2002.
WARMR Discovery of frequent datalog patterns. L. Dehaspe and H. Toivonen. Data Mining and
Knowledge Discovery, 3(7-36), 1999.
FARMAR Fast association rules for multiple relations. S. Nijssen and J. Kok. Data Mining and
Knowledge Discovery, 3(7–36), 1999.

272
Mining and Searching Complex                       Chapter 6 Structures Massive Graph Mining

Part I (1.5 hrs)
Graph Mining Primer
Recent advances in Massive Graph Mining
Part 2(1.5 hrs)
CSV: cohesive subgraph Mining
Dngraph mining: a triangle based approach

CSV

1. Cohesive sub-graph mining, with visualization
2. Existing approaches
3. CSV provides effective visual solution
– Algorithm principle
– Connectivity Estimation
4. Experimental Study

273
Mining and Searching Complex                            Chapter 6 Structures Massive Graph Mining

Existing solutions

1. Current state-of-the-art to abstract information from huge
graphs.                                 information Yes,
1. Graph partition algorithms.         structure No.
Spectral clustering[Ng01]: high computational cost
METIS[Karypis96]: favors balanced pattern
2. Graph Pattern Mining algorithms
CODENSE[Hu05], CLAN[Zeng06]: exponentially running time
2. Graph Layout Tools:
Osprey [Breitkreutz03] Visant [Mellor04]: Do not have mining
capability                              information No,
We want structured information
structure Yes.

CSV: General Approach

• Separate vertices in the graph into VISITED, UNVISITED
• Start: Pick a vertex and add into VISITED
• Repeat until UNVISITED=empty
–Among all vertices that are in UNVISITED, pick one vertex V
most highly connected to VISITED
–Plot V’s connectivity

But how do we measure connectivity?

274
Mining and Searching Complex                                                   Chapter 6 Structures Massive Graph Mining

Connectivity measurement

Connectivity measurement is closely related to clique (fully connected sub-
graph) size.

The connectivity between two vertices in
a graph (ηmax) is defined to be the                           The “connectivity” of a vertex
biggest clique in the graph such that                            (ζmax) is similarly defined
both are members of the clique                                   as the biggest clique it
can participate.
b
b
a                c
a                  c

e        d
e        d
ηmax(a, d) = 0
ηmax(a, c) = 4                                          ζmax(a) = 5

CSV: Step by Step

heap
From Graph to Plot                             connectivity

D             4
A          B       C
3
F               H       I             2
E                  G                             1
J                                                            B
A                     vertices
unvisited
neighbors
Start from A, explore A’s neighbor B.
visiting
Calculate ζmax (A)=2 and output it
visited

275
Mining and Searching Complex                                         Chapter 6 Structures Massive Graph Mining

CSV algorithm on a synthetic graph

heap
From graph to plot             connectivity

D        4
A        B      C
3                                  C
F          H        I
E                                    2                                  F
G
J                                               B
H
1

unvisited                               AB             vertices

neighbors
Mark A visited, from B, explore B’s
visiting                     immediate neighbors CFH.
visited                      Calculate ηmax (AB)=2 and output it

CSV algorithm on a synthetic graph

heap
From graph to plot                connectivit
y
D         4
A        B      C                                                       F
3
H
C
F          H        I        2
E                                                                       F
G
G                     1
J                                               D
H
A BC      vertices
unvisited
neighbors                    Mark B visited, choose the closely
visiting                     connected C as next visiting vertex. From
C, explore C’s immediate neighbors DFGH,
visited
update ηmax when necessary.
Calculate ηmax (BC)=4 and output it

276
Mining and Searching Complex                                       Chapter 6 Structures Massive Graph Mining

CSV algorithm on a synthetic graph

From graph to plot                     connectivity
Cohesive sub-
graph
D               4
A        B      C
3
F          H        I
2
E               G
1
J

ABCH FGDE I J    vertices
unvisited
neighbors
visiting                Visit every vertex accordingly to produce a
visited                 plot.

Peaks represent cohesive sub-graphs.

Important Theorem

277
Mining and Searching Complex                                    Chapter 6 Structures Massive Graph Mining

Connectivity computation is a hard
problem

However, if graphs are very huge and massive, exact computation of
connectivity is prohibitive.

Direct computation
is costly

Connectivity computation is
prohibitive

•Exact algorithm relays on                                                   D
A       B       C
clique detection (NP-hard).
•Even approximation is hard.                             F           H       I
•Solution Part 1: Spatial                        E           G
Mapping                                                                  J
•Pick k pivots
P1   I
•Map graph into k-
dimensional space based on              3       A               E
their shortest distance to the                          F       GJ
pivots                                  2           B   C       D
•A clique will map into the
same grid.                              1               H
I
0   1   2       3           P0   A

278
Mining and Searching Complex                                    Chapter 6 Structures Massive Graph Mining

Connectivity computation

•Solution Part 2: Approximate
Upper Bound for ζmax(v) and
ηmax(v, v’)
•Each vertex in a clique of size k
must have
•degree=k-1
•k-1 neighbors with degree k-1

Let estimate ηmax(a, f)
•For each vertex v, find it immediate
neighbors in the same grid cell and      Locate the immediate neighborhood of a
construct a sub-graph                    and f, {a, b, c, d, e, f, g}. After sorting the
degree array in descending order, we have
array
•Iteratively readjust estimation for     6(a), 6(f), 5(d), 4(b), 4(c), 4(e), 3(g).
clique size

=5? =6? =7?

Experimental study on real datasets
DBLP: co-authorship graphs.
DBLP: v 2819, e 54990

Two groups of
German researchers

Peaks in DBLP CSV plot represents different research groups

279
Mining and Searching Complex                                   Chapter 6 Structures Massive Graph Mining

SMD: Stock Market Data
Bridging vertex

Partial clique
Partial clique
Peaks in SMD CSV plot
represents highly cohesive
stocks

DIP: Database of interacting proteins
8    SMD3
9    PFS2
89    LSM8
PRP4
10   RNA14
89    LSM2
PRP8
10   FIP1
89    DCP1
PRP6

89    LSM6
LUC7
Structure of a nucleotide-bound Clp1-Pcf11       10   REF2

89    LSM3
SMX2

89    LSM4
SNP1
Christian G. Noble, Barbara Beuth, and Ian       10   CFT2

89    PAT1
STO1
A. Taylor*. Nucleic Acids Res. 2007 January;     10   MPE1

89    LSM7
NAM8
35(1): 87–99.                                    10   GLC7

10   PAP1
89    LSM5
SNU71

8    PRP31
“CPF is also required in both the cleavage       10   PTA1

8    YHC1
and polyadenylation reactions. It contains a     10   YSH1

8    PRP40
core of eight subunits Cft1, Cft2, Ysh1, Pta1    10   YTH1

10   PTI1
8    MUD1
Mpe1, Pfs2, Fip1 and Yth1”
8    SNU56

280
Mining and Searching Complex                             Chapter 6 Structures Massive Graph Mining

Experimental Study

CSV as a pre-selection step
How?
•Apply CSV to identify potential
cohesive sub-graphs first.
•Use exact algorithm CLAN to run on
these candidates.
Result
•Get the exact cohesive sub-graphs as
running CLAN alone.
•Saves 28-84% of the time compared            CSV as a pre-selection methods
to running CLAN alone.

DNgraph mining: A triangle based approach

• Mining dense patterns out of an extremely large graph
•When the graph is extremely large, it is even difficult to mine
dense patterns.
• An iterative improvement mining approach is more desirable
•Users are able to obtain the most updated results on demand.
• Dense patterns have strong connection with triangles inside a
graphs.
• This has already observed and explained by the preferential
attachment property of large scaled graphs.

281
Mining and Searching Complex                         Chapter 6 Structures Massive Graph Mining

DNgraph mining: A triangle based approach

• What makes a pattern dense? Intuitively                      B      C

•A collection of vertices with high relevance.
•They share large number of common.                  A
D

• With that we propose the definition of
Dngraph
•A DNgraph is the largest sub graph sharing
A’                   E
F

the most neighbors.
•Require each connected vertex pair sharing at      λ(G) = 3, λ(GA’)=0
least λ neighbors.

Compare Dngraph with other dense pattern
definition
• Two interesting patterns
• 4-clique and a Turan graph T(14, 4) [14 vertices, 4 groups, fully
connected between groups]
• If mining quasi-clique, may ends up discovering 1 pattern, as in
(d)
• If searching for closed clique, may only find (e)

282
Mining and Searching Complex                          Chapter 6 Structures Massive Graph Mining

DNgraph mining: challenge

• Find common neigbhors for every connected vertices is
expensive
•Require O(E) join operations.
•Need random disc access.
•In fact, finding an DN-graph is an NP-problem.
• Solution
•Using triangles that two vertices participates to approximate
common neighbor size.
•Iterative refine the approximation following graph edge’s locality.

DNgraph mining: How

1. Initially: count # triangles each edge participates.
•Sort vertices and its neighbors in descending order of their degrees
•Scan the graphs to get # triangles for every vertex.
•The # triangle set the initial value of λ .
2. Next, Iteratively refine λ for every vertex
•Using streams of triangles.
•Iterative refine λcur.

283
Mining and Searching Complex                                 Chapter 6 Structures Massive Graph Mining

Triangle Counting: how?
1. Sort vertices and its neighbors in descending order of
their degrees

a     bde                           e       dbacgf
b     acde               Sort       d       eacgh
a        e
f        c     bde                           b       edac
d     acegh                         a       edb
b                    g    e     abcdgf                        c       edb
c
f     eg                            g       edf
d           h
g     def                           f       eg
h     d                             h       d

Triangle counting (cont.)
1. Sort vertices and its neighbors in                              a               f
e
descending order of their degrees
2. Join neighborhood for triangle count for                        b                   g
every edge
c       d       h
• The two vertices inhibits locality, due to
reordering and preferential attachment                3   e       dbacgf
property of large graphs                                  d       eacgh
3
b       edac
a       edb
c       edb
g       edf
f       eg
h       d

284
Mining and Searching Complex                          Chapter 6 Structures Massive Graph Mining

Triangle counting (cont.)
a              f
e
1. Sort vertices and its neighbors in
descending order of their degrees                      b                  g
2. Join neighborhood for triangle count for
c        d         h
every edge
vertex       λcur
3. Use that as the initial λ value for every
edge/vertex                                        e       3
• Vertex λ value is the maximal edge λ value    d       3
it participates                                 …       …
•λcur(e) = 3

DNgraph mining: How (cont.)

• Initially: count # triangles each edge participates.
• Next, Iteratively refine λ for every vertex
•Using streams of triangles.
•Iterative refine λcur.

285
Mining and Searching Complex                                                   Chapter 6 Structures Massive Graph Mining

Triangle stream

•Follow the same order of visiting graph during triangle
counting
•Triangles are not materialized, saving storage

n1                              nx
n2                                  n2                           n2
n1                                                                n1
a                 b n1              a         b
nx                              nx                            nx

a                       b         a                     b         a        lambda=k   b
lambda=k                            lambda=k

Iteratively refine λ

•Follow the same order of visiting graph during triangle
counting
•Triangles are not materialized, saving storage
•For every vertex v, when its triangles come, bound λcur(v)
using two other vertices’ λcur

286
Mining and Searching Complex                                Chapter 6 Structures Massive Graph Mining

Iteratively refine λ (cont.)
a                   f
e
• Initially: count # triangles each edge                                3
participates.                                                 b                       g
3
• Next, Iteratively refine λ for every vertex
c           d           h
•Using streams of triangles.                            vertex        λcur
•Iterative refine λcur.
e       3
• Until all vertices’ λcur are converged
b       3
…       …

DNgraph: Experiment
•Large scaled graph
•Flicker Dataset with with 1,715,255 vertices an 22,613,982
edges.
•1 iteration requires 1 hour, a workstation with a Quad-Core
AMD Opteron(tm) processor 8356, 128GB RAM and 700GB
hard disk.
•Converge in 66 iterations, almost stable after 35 iterations

287
Mining and Searching Complex                                                Chapter 6 Structures Massive Graph Mining

• Abstraction
Within the triangulation algorithm. The abstraction ensures
our approach’s extensibility to different input settings.
• Iteratively refine results
• The estimation of common neighborhood improves along
every iteration, users are able to obtain the most updated
results on demand.
• Pre-collection of Statistics to support effective buffer
management
• Process can be easily mapped to key->value pair for
further distributed processing.

Reference (partial)
[Hu05] H.Hu, X.Yan, Y.Huang, J.Han, and X.J.Zhou. Mining coherent dense subgraphs across
massive biological networks for functional discovery. Bioinformatics, 21(1):213--221, 2005.
[Ng01] A.Y. Ng, M.I. Jordan, and Y.Weiss. On spectral clustering: Analysis and an algorithm.
Advances in Neural Information Processing Systems, volume~14, 2001.
[Karypis96] G.Karypis and V.Kumar. Parallel multilevel k-way partitioning scheme for irregular
graphs. Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on
Supercomputing (CDROM), page~35, Washington, DC, USA, 1996. IEEE Computer Society.
[Breitkreutz03] B.J.Breitkreutz, C.Stark, and M.Tyers.Osprey: a network visualization system.
Genome Biology, 4, 2003.
[Mellor04] J.W.J. Z., Mellor and C. DeLisi. An online visualization and analysis tool for biological
interaction data. BMC Bioinformatics, 5:17--24, 2004.
[Zeng06]J. Wang, Z.Zeng, and L. Zhou. Clan: An algorithm for mining closed cliques from large
dense graph databases. Proceedings of the International Conference on Data Engineering},
page~73, 2006.
[Turan41] P. Turan. On an extremal problem in graph theory. Mat. Fiz. Lapok, 48:436–452, 1941
[Ankerst99] M.Ankerst, M.Breunig, H.P. Kriegel, and J.Sander. OPTICS: Ordering points to
identify the clustering structure. Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data
(SIGMOD'99), pages 49--60, Philadelphia, PA, June 1999.
[DNgraph10] On Triangle based DNgraph Mining. NUS technical report TRB4/10

288

```
To top