Embed
Email

Lecture Notes

Document Sample

Shared by: dffhrtcv3
Categories
Tags
Stats
views:
11
posted:
11/11/2011
language:
English
pages:
290
VLDB Database School (China) 2010

August 3-7, 2010, Shenyang









Lecture Notes

Part 1





Mining and Searching Complex

Structures







Anthony K.H. Tung(邓锦浩)

School of Computing

National University of Singapore

www.comp.nus.edu.sg/~atung

Mining and Searching Complex Structures







Contents





Chapter 1: Introduction ------------------------------------------ 1



Chapter 2: High Dimensional Data ------------------------- 34



Chapter 3: Similarity Search on Sequences ------------ 110



Chapter 4: Similarity Search on Trees ------------------- 156



Chapter 5: Graph Similarity Search ---------------------- 175



Chapter 6: Massive Graph Mining ------------------------ 234

Mining and Searching Complex Structures Chapter 1 Introduction









Mining and Searching Complex

Structures

Introduction

Anthony K. H. Tung(鄧锦浩)

School of Computing

National University of Singapore

www.comp.nus.edu.sg/~atung









Research Group Link: http://nusdm.comp.nus.edu.sg/index.html

Social Network Link: http://www.renren.com/profile.do?id=313870900









What is data mining?

Really nothing different from what scientists had been doing for

Correct,

Generate useful

data model







Collect data and verify or Nobel

Real World construct model of real world

Prize

Output most likely model

based on some statistical

Feed in data measure

What’s new?



Systematically and

efficiently test

many statistical

models









1

Mining and Searching Complex Structures Chapter 1 Introduction









Components of data mining

Structure of model

geneA=high and geneB=low ===> cancer

geneA, geneB and geneC exhibit strong correlation

Statistical Score for the model

Accuracy of rule 1 is 90%

Similarity function: Are they sufficiently similar group of records

that support a certain model or hypothesis?

Search method for the correct model parameters

Given 200 genes, there could be 2^200 rules. Which rule give the

best prediction power?

Database access method

Given 1 million records, how to quickly find relevant records to

compute the accuracy of a rule?









The Apriori Algorithm



• Bottom-up, breadth first a,b,c,e

search

• Only read is perform on

the databases a,b,c a,b,e a,c,e b,c,e

• Store candidates in

memory to simulate the

lattice search a,b a,c a,e b,c b,e c,e

• Iteratively follow the two

steps:

–generate candidates a b c e

–count and get actual

frequent items

start {}

4









2

Mining and Searching Complex Structures Chapter 1 Introduction









The K-Means Clustering Method



• Given k, the k-means algorithm is implemented in 4

steps:

–Partition objects into k nonempty subsets

–Compute seed points as the centroids of the clusters of the

current partition. The centroid is the center (mean point) of the

cluster.

–Assign each object to the cluster with the nearest seed point.

–Go back to Step 2, stop when no more new assignment.









5









The K-Means Clustering Method

• Example

10 10



9 9



8 8



7 7



6 6



5 5



4 4



3 3



2 2



1 1



0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10









10 10



9 9



8 8



7 7



6 6



5 5



4 4



3 3



2 2



1 1



0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10









6









3

Mining and Searching Complex Structures Chapter 1 Introduction









Training Dataset (Decision Tree)

Outlook Temp Humid Wind PlayTennis

Sunny Hot High Weak No

Sunny Hot High Strong No

Overcast Hot High Weak Yes

Rain Mild High Weak Yes

Rain Cool Normal Weak Yes

Rain Cool Normal Strong No

Overcast Cool Normal Strong Yes

Sunny Mild High Weak No

Sunny Cool Normal Weak Yes

Rain Mild Normal Weak Yes

Sunny Mild Normal Strong Yes

Overcast Mild High Strong Yes

Overcast Hot Normal Weak Yes

Rain Mild High

7

Strong No









Selecting the Next Attribute

S=[9+,5-] S=[9+,5-]

E=0.940 E=0.940

Humidity Wind





High Normal Weak Strong



[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E=0.985 E=0.592 E=0.811 E=1.0

Gain(S,Humidity) Gain(S,Wind)

=0.940-(7/14)*0.985 =0.940-(8/14)*0.811

– (7/14)*0.592 – (6/14)*1.0

=0.151 =0.048

8









4

Mining and Searching Complex Structures Chapter 1 Introduction









Selecting the Next Attribute

S=[9+,5-]

E=0.940

Outlook



Over

Sunny Rain

cast



[2+, 3-] [4+, 0] [3+, 2-]

E=0.971 E=0.0 E=0.971

Gain(S,Outlook)

=0.940-(5/14)*0.971

-(4/14)*0.0 – (5/14)*0.0971

=0.247

9









ID3 Algorithm

[D1,D2,…,D14] Outlook

[9+,5-]



Sunny Overcast Rain





Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]

[2+,3-] [4+,0-] [3+,2-]

? Yes ?

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970

Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570

Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019



10









5

Mining and Searching Complex Structures Chapter 1 Introduction









Decision Tree for PlayTennis



Outlook





Sunny Overcast Rain





Humidity Yes Wind





High Normal Strong Weak



No Yes No Yes

11









Can we fit what we learn into the framework?



Apriori K-means ID3

task rule pattern discovery clustering classification



structure of the model association rules clusters decision tree

or pattern

search space lattice of all possible choice of any k all possible

combination of items points as center combination of

size= 2m size=infinity decision tree

size= potentially

infinity

score function support, confidence square error accuracy,

information gain

search /optimization breadth first with gradient descent greedy

method pruning



data management TBD TBD TBD

12

technique









6

Mining and Searching Complex Structures Chapter 1 Introduction









Components of data mining(II)







Models Enumeration

Algorithm





Statistical Score Function

Similarity/Search Function

Database Access Method







Database









Background knowledge

• We assume you have some basic knowledge about data

mining, some of the slides here will be very useful for this

purpose

• Association Rule Mining

http://www.comp.nus.edu.sg/~atung/Renmin56.pdf

• Classification and Regression

http://www.comp.nus.edu.sg/~atung/Renmin67.pdf

• Clustering

http://www.comp.nus.edu.sg/~atung/Renmin78.pdf









7

Mining and Searching Complex Structures Chapter 1 Introduction









IT Trend

Processors are cheap and will become cheaper(multi-core processor,

graphic cards)

Storage will be cheap but might not be fast

Bandwidth will be growing

What can we do with this?

Play more realistic games!

Not exactly a joke since any technologies that speed up games can speed up scientific

simulation

Smarter (more intensive) computation

Can store more personal semantic/ontology

People can collaborate more over the Internet (Flickr, Wikipedia) to make

things more intelligent

The AI dream now have the support of much better hardwares

Essentially, data mining can be made much more simple for the man

on the street

Data mining should be human-centered, not machine centered

2010-7-31

15









What is complex data?

What is “simple” data? data?

What are complex

tabular table, with small number of attributes (of the same type), no

Test1 Regular Progress

Gene1 comments



values.

Pos missing Fever

2.0 ……



Neg -0.3 Unconscious



N.A 5.7



High dimensional data: Lots of attributes with different data types with missing values









Sequences/ time series Trees Graphs









8

Mining and Searching Complex Structures Chapter 1 Introduction









Why complex data?

They come naturally in many applications. Bring research nearer to real

world

Lots of challenges which mean more fun!

Some fundamental challenges:

How do you compare complex objects effectively and efficiently?

How do you find special subset in the data that is interesting?

Test1 Gene1 Progress comments

Pos What new type of models and score function must you used?

2.0 Fever ……

Neg

How do you handle noise and error ?

-0.3 Unconscious

N.A 5.7







a a



d b e b c d



c d c d b

c d

e

T1 T2









Personalized Semantic for Personal Data Management

everyone will own terabytes of data soon

improve query/search interface by mining and extracting personalized semantics like

entities and their relationship etc. by comparing them against high quality tagged databases









Query by Query by Query by

documents audio/music Query by video

photographs/images

Wikipedia



singers

authors

High Quality

Data semantic

actor/actress songs

Sources

papers

layer

places

movies



Personal Data









documents audio video photographs/i Webpage/Blogs/Bookmarks

music mages









9

Mining and Searching Complex Structures Chapter 1 Introduction









Integrated Approach to Mining Software Engineering Data

software engineering data: code base, change history, bug reports, runtime trace

integrated into a data warehouse to support decision making and mining,

Example: Which code module should I modify to create a new function? Which

module need maintenance?





programming defect detection testing debugging maintenance …



software engineering tasks helped by data mining







association/

classification clustering …

patterns







Data Warehouse









code change program structural bug

bases history states entities reports/nl …



software engineering data









WikiScience

Web 2.0: Facebook for scientists

Collaborative platform for scientist to build scientific models/hypothesis and share

data, applications



Based on some

articles, I make some

changes to Model A supporting

to create Model B articles tagged to

Model B





Centralized,

Centralized,

Model A Hybrid Model

Hybrid Model Model B

Model A Model B

C Constructed

C Constructed

by System

by System

supporting

dataset tagged to

Model A

This is my model of

the solar system base

on my supporting

dataset









10

Mining and Searching Complex Structures Chapter 1 Introduction









Hey, why not Cloud Computing, Map/Reduce?

• These are platform for scaling up services to large

number of users on large amount of data

• But what exactly do you want to scale up?

• Services that provide useful and semantically

correct information to the users

• We have too many scalable data mining

algorithms that find nothing or too many things

• Let’s focus on finding useful things first

(assuming we have lot’s of processing power) and

then try to scale it up









Schedule of the Course

Date/Time Content

Lesson 1 Introduction

Lesson 2 Mining and Search High Dimensional Data I

Lesson 3 Mining and Search High Dimensional Data II

Lesson 4 Mining and Search High Dimensional Data III

Lesson 5 Similarity Search for Sequences and Trees I

Lesson 6 Similarity Search for Sequences and Trees III

Lesson 7 Similarity Search for Graph I

Lesson 8 Similarity Search for Graph II

Lesson 9 Similarity Search for Graph III

Lesson 10 Mining Massive Graph I

Lesson 11 Mining Massive Graph II

Lesson 12 Mining Massive Graph III









11

Mining and Searching Complex Structures Chapter 1 Introduction









Focus of the course

• Techniques that can handle high dimensional, complex

structures

–Providing semantics to similarity search

–Shotgun and Assembly: Column/Feature Wise Processing using

Inverted Index

–Row-wise Enumeration

–Using local properties to infer global properties

• Throughout the course, please try to think of how these

techniques are applicable across different type of complex

structures









Databases Queries



To start off, we will consider something very basic call

ranking queries since we need ranking any similarity search

(usually from most similar to most dissimilar)

In relational database, SQL returns all results at one go

How many tuples can be fitted in one screen?

How many tuples can you remember?

Options:

Summarize the results

Display representative tuples

How to select representative tuples?









12

Mining and Searching Complex Structures Chapter 1 Introduction









Retrieve Relevant Information



Search videos related to Shanghai Expo

Too many results: as long as you click “next”, there are 20

more new results

Are we interested in all results?

No, only most relevant ones

Search engines have to rank the results, out of which they

make money from









Question: How to Select a Small Result Set



Selecting the most representative or most interesting results

is not trivial

Find an apartment with rental cheaper than 1000, the

cheaper the better

The result tuples can be sorted in the ascending order of rental prices,

those in front are more favorable

Find an apartment with rental cheaper than 1000 near NEU,

the lower the better, the nearer the better

Apartment with lower rent may not be near, nearer one may not be

cheap

Order by prices? Order by distances?









13

Mining and Searching Complex Structures Chapter 1 Introduction









Top-k Queries

Define a scoring function, which maps a tuple to a real

number, as a score

The higher the score is, the more favorable the tuple is

Define an integer k

Answer: k objects with highest scores

Different scoring function may give different top-k result

Price Distance to NEU

Apartment A $800 500 meter

Apartment B $1200 200 meters



Given k = 1, if the score function is defined as the sum of

price and distance, the first tuple is better; if it is defined as

the product, the second tuple is better









Brute Force Top-k



Compute scores for each result tuple

Sort the tuples according to the descending order of the

scores

Select the first k tuples

What if the number of tuples is unlimited? Search engines

can give unlimited number of results

Even if the number of tuples is limited, it is too slow to

compute score for each tuple

We have to do it efficiently









14

Mining and Searching Complex Structures Chapter 1 Introduction









Outline



Two well-known top-k algorithms

Fagin's Algorithm (FA)

The Threshold Algorithm (TA)

Take random access into consideration

No Random Access Algorithm (NRA)

The Combined Algorithm (CA)









Monotonicity



A score function f is monotone if f(x1,x2,...,xm)≤f(y1,y2,...,ym)

whenever xi≤yi for every i

Select top-3 students with highest total score in mathematics,

physics and computer science:





select name, math+phys+comp as score

from student

order by score desc limit 3



sum(x.math,x.phys,x.comp)≤sum(y.math,y.phys,y.comp) if

x.math≤y.math and x.phys≤y.phys and x.comp≤y.comp









15

Mining and Searching Complex Structures Chapter 1 Introduction









Sorted Lists



We shall think of a database consisting of m sorted lists L1,

L2, … Lm



Lmath Lphys Lcomp

Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...









Outline



Two well-known top-k algorithms

Fagin's Algorithm (FA)

The Threshold Algorithm (TA)

Take random access into consideration

No Random Access Algorithm (NRA)

The Combined Algorithm (CA)









16

Mining and Searching Complex Structures Chapter 1 Introduction









Fagin's Algorithm (I)



Do sequential access until there are at least k matches



Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...





Sequential accesses are stopped when 3 students are seen, i.e.

Ann, Hugh and Kurt









Fagin's Algorithm (II)



For each object that has been seen, do random accesses on

other lists to compute its score



Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...





Random accesses need to be done for Ben, Carl, Jane and

Ryan









17

Mining and Searching Complex Structures Chapter 1 Introduction









Fagin's Algorithm (III)



Select the k objects with highest score as top-k result





Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...









Why is FA correct? (I)



There are at least k objects seen on all attributes when

sequential access is stopped

By monotonicity, those objects that are not seen do not have

higher score than the above k objects



Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...









18

Mining and Searching Complex Structures Chapter 1 Introduction









Why is FA correct? (II)



For those that have been seen, it is either all attributes has

been seen, or random accesses are performed to know all

attributes

The k objects with highest scores are therefore the top-k

result



Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...









Outline



Two well-known top-k algorithms

Fagin's Algorithm (FA)

The Threshold Algorithm (TA)

Take random access into consideration

No Random Access Algorithm (NRA)

The Combined Algorithm (CA)









19

Mining and Searching Complex Structures Chapter 1 Introduction









The Threshold Algorithm (I)



Do sequential access on all lists. If an object is seen, do

random access to the other lists to compute its score



Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...





Random accesses on Ann, Hugh and Kurt first, then on Ben

and Ryan









The Threshold Algorithm (II)



Remember the k objects with highest scores, together with

their scores

Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...



Score (Ann) = 285

Score (Hugh) = 280

Score (Kurt) = 280









20

Mining and Searching Complex Structures Chapter 1 Introduction









The Threshold Algorithm (III)



• Let threshold value τ be the function value on last seen values

on all sorted lists

• As soon as at least k objects with score at least τ, then halt





Ann 98 Hugh 97 Kurt 96

τ(1) = 291

Ben 96 Ryan 94 Ann 95 τ(2) = 285

Kurt 93 Ann 92 Jane 95 τ(3) = 280

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...









Why is TA correct?

• By monotonicity, those unseen objects do not have higher

score than τ

• For those that have been seen, random accesses are

performed, the k objects with highest scores are therefore the

top-k result



Ann 98 Hugh 97 Kurt 96

τ(1) = 291

Ben 96 Ryan 94 Ann 95 τ(2) = 285

Kurt 93 Ann 92 Jane 95 τ(3) = 280

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...









21

Mining and Searching Complex Structures Chapter 1 Introduction









Comparing TA with FA



• Number of sequential accesses

At the time FA stops sequential accesses, τ is guaranteed not

higher than the k objects seen on all sorted lists

• Number of random accesses

TA requires m-1 random accesses for each object

But FA is expected to random access more objects

• Size of buffers used

Buffer used by FA can be unbounded

TA only needs to remember k objects with k scores, and the

threshold value τ









Outline



Two well-known top-k algorithms

Fagin's Algorithm (FA)

The Threshold Algorithm (TA)

Take random access into consideration

No Random Access Algorithm (NRA)

The Combined Algorithm (CA)









22

Mining and Searching Complex Structures Chapter 1 Introduction









Random Access



Random accesses are impossible

Text retrieval: sorted lists are results of search engines

Random accesses are expensive

Sequential accesses on disk are orders of magnitude faster

than random accesses

We need to consider not using random accesses or using

them as few as possible









No Random Access

Without random access, all we know are the upper bounds



Lmath Lphys Lcomp

Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...





Carl’s scores on physics and computer science are not higher

than 89 and 92 respectively









23

Mining and Searching Complex Structures Chapter 1 Introduction









Lower and Upper Bounds

If an object has not been seen on one attribute

Lower bound is 0

Upper bound is the last seen value

Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...





The lower bound of Carl’s score on physics is 0

The upper bound of Carl’s score on physics is 89









Worse and Best Scores (I)

W (R): The worst possible score of tuple R

B (R): The best possible score of tuple R



Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...





W (Carl) = 90

B (Carl) = 90 + 89 + 92









24

Mining and Searching Complex Structures Chapter 1 Introduction









Worse and Best Scores (II)

W (R) ≤ Score of R ≤ B (R)

W (R) and B (R) get updated as its value gets sequential

accessed

Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...



Ann Hugh Kurt

W 98 97 96

B 291 291 291









Worse and Best Scores (II)

W (R) ≤ Score of R ≤ B (R)

W (R) and B (R) get updated as its value gets sequential

accessed

Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...



Ann Hugh Kurt Ben Ryan

W 98→193 97 96 96 94

B 291→287 291→288 291→286 285 285









25

Mining and Searching Complex Structures Chapter 1 Introduction









Worse and Best Scores (II)

W (R) ≤ Score of R ≤ B (R)

W (R) and B (R) get updated as its value gets sequential

accessed

Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...



Ann Hugh Kurt Ben Ryan Jane

W 193→285 97 96→189 96 94 95

B 287→285 288→285 286→281 285→283 285→282 280









Outline



Two well-known top-k algorithms

Fagin's Algorithm (FA)

The Threshold Algorithm (TA)

Take random access into consideration

No Random Access Algorithm (NRA)

The Combined Algorithm (CA)









26

Mining and Searching Complex Structures Chapter 1 Introduction









No Random Access Algorithm (I)

Maintain the last-seen values x1,x2,…,xm

For every seen object, maintain its worst possible score, its

known attributes and their values



Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...



xmath = 96; xphys = 94; xcomp = 95

Ann:193:{;}









No Random Access Algorithm (II)

Why not maintain the best possible score for each objects



Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...



Ann Hugh Kurt Ben Ryan Jane

W 193→285 97 96→189 96 94 95

B 287→285 288→285 286→281 285→283 285→282 280



Too Frequently Updated!







27

Mining and Searching Complex Structures Chapter 1 Introduction









No Random Access Algorithm (III)

Let M be the kth largest W value

An object R is viable if B (R) ≥ M

Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...



Ann Hugh Kurt Ben Ryan Jane

W 285 97→188 189→280 96→189 94 95 M = 189

B 285 285→280 281→280 283→280 282→278 280→277









No Random Access Algorithm (III)

Let M be the kth largest W value

An object R is viable if B (R) ≥ M



Ann 98 Hugh 97 Kurt 96

Ben 96 Ryan 94 Ann 95

Kurt 93 Ann 92 Jane 95

Hugh 91 Kurt 91 Ben 93

Carl 90 Jane 89 Hugh 92

... ... ... ... ... ...

... ... ... ... ... ...

... ... ... ... ... ...



Ann Hugh Kurt Ben Ryan Jane

W 285 188→280 280 188 94 95→184 M = 280

B 285 285→280 280 280→278 278→276 277→274









28

Mining and Searching Complex Structures Chapter 1 Introduction









No Random Access Algorithm (IV)

Let set T contain objects with W (R) ≥ M

Halt when

There are at least k objects seen on all sorted lists

No viable objects left outside set T





Ann Hugh Kurt Ben Ryan Jane

W 285 188→280 280 188 94 95→184 M = 280

B 285 285→280 280 280→278 278→276 277→274





T = {Ann, Hugh, Kurt}









Why is NRA correct?



W (R) ≤ Score of R ≤ B (R) always holds

If an object R is not viable, Score of R ≤ B (R) ≤ M, then

there are at least k objects with scores not lower than R

Therefore, if there is no viable object outside T and T

contains at least k objects, T is the set of top-k result









29

Mining and Searching Complex Structures Chapter 1 Introduction









Comparing NRA with TA



• Number of sequential accesses

The number of sequential accesses of NRA is at least the last

position of top-k result on all attributes

• Number of random accesses

NRA is obviously 0

• Size of buffers used

TA remembers k objects with k scores, and the threshold

value τ

NRA remembers all viable objects with its scores on all seen

attributes, and the last-seen value on all attributes









How deep can NRA go?

Ann 98 Hugh 97 Kurt 96

Hugh 97 Kurt 96 Ann 95

Ben 60 Ryan 60 Jane 60

Ryan 60 Ben 60 Ben 60

Carl 60 Jane 60 Carl 60

... ... ... ... ... ...

Jane 60 Carl 60 Ryan 60

Kurt 0 Ann 0 Hugh 0





The set T can be identified quickly, but their scores will only

be certain at the end of lists

If we allow relatively fewer number of random accesses,

scanning the entire lists can be avoided









30

Mining and Searching Complex Structures Chapter 1 Introduction









Outline



Two well-known top-k algorithms

Fagin's Algorithm (FA)

The Threshold Algorithm (TA)

Take random access into consideration

No Random Access Algorithm (NRA)

The Combined Algorithm (CA)









The Combined Algorithm (I)



CA combines TA and NRA

cR: the cost of a random access

cS: the cost of a sequential access

h=

Run NRA, but every h steps to run random accesses, like TA

h = ∞ → never do random access, CA is then NRA









31

Mining and Searching Complex Structures Chapter 1 Introduction









The Combined Algorithm (II)

Ann 98 Hugh 97 Kurt 96

Hugh 97 Kurt 96 Ann 95

Ben 60 Ryan 60 Jane 60

Ryan 60 Ben 60 Ben 60

Carl 60 Jane 60 Carl 60

... ... ... ... ... ...

Jane 60 Carl 60 Ryan 60

Kurt 0 Ann 0 Hugh 0







Random accesses for Ann, Hugh and Kurt quickly find out

the scores for Ann, Hugh and Kurt









The Combined Algorithm (III)



In CA, by doing random accesses, we wish to either

Confirm an object is a top-k result, or

Prune a viable object

As the number of random accesses in CA is limited, various

heuristics can be made to optimize CA in terms of total cost









32

Mining and Searching Complex Structures Chapter 1 Introduction









Reference

• Ronald Fagin, Amnon Lotem, Moni Naor: Optimal

aggregation algorithms for middleware. J. Comput. Syst.

Sci. 66(4): 614-656 (2003)









33

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Mining and Searching Complex

Structures

High Dimensional Data

Anthony K. H. Tung(鄧锦浩)

School of Computing

National University of Singapore

www.comp.nus.edu.sg/~atung









Research Group Link: http://nusdm.comp.nus.edu.sg/index.html

Social Network Link: http://www.renren.com/profile.do?id=313870900









Outline



• Sources of HDD

• Challenges of HDD

• Searching and Mining Mixed Typed Data

–Similarity Function on k-n-match

–ItCompress

• Bregman Divergence: Towards Similarity Search on Non-metric

Distance

• Earth Mover Distance: Similarity Search on Probabilistic Data

• Finding Patterns in High Dimensional Data









Mining and Searching Complex Structures









34

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Sources of High Dimensional Data



• Microarray gene expression

• Text documents

• Images

• Features of Sequences, Trees and Graphs

• Audio, Video, Human Motion Database (spatio-

temporal as well!)









Mining and Searching Complex Structures









Challenges of High Dimensional Data

• Indistinguishable

–Distance between two nearest points and two furthest points

could be almost the same

• Sparsity

–As a result of the above, data distribution are very sparse

giving no obvious indication on where the interesting

knowledge is

• Large number of combination

–Efficiency: How to test the number of combinations

–Effectiveness: How do we understand and interpret so many

combinations?







Mining and Searching Complex Structures









35

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Outline



• Sources of HDD

• Challenges of HDD

• Searching and Mining Mixed Typed Data

–Similarity Function on k-n-match

–ItCompress

• Bregman Divergence: Towards Similarity Search on Non-metric

Distance

• Earth Mover Distance: Similarity Search on Probabilistic Data

• Finding Patterns in High Dimensional Data









Mining and Searching Complex Structures









Similarity Search : Traditional Approach



• Objects represented by multidimensional vectors









Elevation Aspect Slope Hillshade (9am) Hillshade (noon) Hillshade (3pm) …



2596 51 3 221 232 148







• The traditional approach to similarity search: kNN query

Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist



P1 1.1 1 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.93



P2 1.4 1.4 1.4 1.5 1.4 1 1.2 1.2 1 1 0.98

P3 1 1 1 1 1 1 2 1 2 2 1.73

P4 20 20 21 20 22 20 20 19 20 20 57.7



P5 19 21 20 20 20 21 18 20 22 20 60.5



P6 21 21 18 19 20 19 21 20 20 20 59.8



Mining and Searching Complex Structures









36

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Deficiencies of the Traditional Approach



• Deficiencies

–Distance is affected by a few dimensions with high dissimilarity

–Partial similarities can not be discovered



• The traditional approach to similarity search: kNN query

Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)





ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist



P1 1.1 1

100 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.93

99.0



P2 1.4 1.4 1.4 1.5 1.4 1

100 1.2 1.2 1 1 99.0

0.98

P3 1 1 1 1 1 1 2 1

100 2 2 1.73

99.0



P4 20 20 21 20 22 20 20 19 20 20 57.7



P5 19 21 20 20 20 21 18 20 22 20 60.5



P6 21 21 18 19 20 19 21 20 20 20 59.8









Mining and Searching Complex Structures









Thoughts



• Aggregating too many dimensional differences into a single value

result in too much information loss. Can we try to reduce that loss?

• While high dimensional data typically give us problem when in

come to similarity search, can we turn what is against us into

advantage?

• Our approach: Since we have so many dimensions, we can

compute more complex statistics over these dimensions to

overcome some of the “noise” introduce due to scaling of

dimensions, outliers etc.









Mining and Searching Complex Structures









37

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









The N-Match Query : Warm-Up



• Description

–Matches between two objects in n dimensions. (n ≤ d)

–The n dimensions are chosen dynamically to make the two objects match best.



• How to define a “match”

–Exact match

–Match with tolerance δ



• The similarity search example

Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

n=6



ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist



P1 1.1 1

100 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.2



P2 1.4 1.4 1.4 1.5 1.4 1

100 1.2 1.2 1 1 0.4

0.98

P3 1 1 1 1 1 1 2 1

100 2 2 1.73

0



P4 20 20 21 20 22 20 20 19 20 20 19



P5 19 21 20 20 20 21 18 20 22 20 19



P6 21 21 18 19 20 19 21 20 20 20 19



Mining and Searching Complex Structures









The N-Match Query : The Definition



• The n-match difference

Given two d-dimensional points P(p1, p2, …, pd) and Q(q1, q2, …, qd), let δi

= |pi - qi|, i=1,…,d. Sort the array {δ1 , …, δd} in increasing order and let

the sorted array be {δ1’, …, δd’}. Then δn’ is the n-match difference

y

between P and Q.

1-match=A

10 E

• The n-match query

8 D 2-match=B

Given a d-dimensional database DB, a query point Q and an

integer n (n≤d), find the point P ∈ DB that has the smallest 6

n-match difference to Q. P is called the n-match of Q. 4 A

B

2 C

• The similarity search example

Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) 7

8

n=6 Q 2 4 6 8 10 x



ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist



P1 1.1 1

100 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.2

0.6



P2 1.4 1.4 1.4 1.5 1.4 1

100 1.2 1.2 1 1 0.4

0.98

P3 1 1 1 1 1 1 2 1

100 2 2 1.73

0

1



P4 20 20 21 20 22 20 20 19 20 20 19



P5 19 21 20 20 20 21 18 20 22 20 19



P6 21 21 18 19 20 19 21 20 20 20 19



Mining and Searching Complex Structures









38

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









The N-Match Query : Extensions

• The k-n-match query

Given a d-dimensional database DB, a query point Q, an integer k, and an

integer n, find a set S which consists of k points from DB so that for any

point P1 ∈ S and any point P2∈ DB-S, P1’s n-match difference is smaller

than P2’s n-match difference. S is called the k-n-match of Q. y

• The frequent k-n-match query 2-1-match={A,D}

10 E

Given a d-dimensional database DB, a query point Q, an integer

k, and an integer range [n0, n1] within [1,d], let S0, …, Si be 8 D 2-2-match={A,B}

the answer sets of k-n0-match, …, k-n1-match, respectively, 6

find a set T of k points, so that for any point P1 ∈ T and any point 4 A

P2 ∈ DB-T, P1’s number of appearances in S0, …, Si is larger B C

2

than or equal to P2’s number of appearances in S0, …, Si .

• The similarity search example Q 2 4 6 8 10 x

Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) n=6

ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 Dist



P1 1.1 1

100 1.2 1.6 1.1 1.6 1.2 1.2 1 1 0.2



P2 1.4 1.4 1.4 1.5 1.4 1

100 1.2 1.2 1 1 0.4

0.98

P3 1 1 1 1 1 1 2 1

100 2 2 1.73

0



P4 20 20 21 20 22 20 20 19 20 20 19



P5 19 21 20 20 20 21 18 20 22 20 19



P6 21 21 18 19 20 19 21 20 20 20 19

Mining and Searching Complex Structures









Cost Model



• The multiple system information retrieval model

–Objects are stored in different systems and scored by each system

–Each system can sort the objects according to their scores

–A query retrieves the scores of objects from different systems and then combine them using some

aggregation function

Q : color=“red” & shape=“round” & texture “cloud”









System 1: Color System 2: Shape System 3: Texture



Object ID Score Object ID Score Object ID Score

1 0.4

0.4 1 1.0

1.0 1 1.0

1.0

2 2.8

2.8 2

5 1.5

5.5 2 2.0

2.0

3

5 3.5

6.5 3

2 5.5

7.8 3 5.0

5.0

3

4 6.5

9.0 4

3 7.8

9.0 4

5 8.0

9.0

4

5 9.0

3.5 5

4 9.0

1.5 5

4 9.0

8.0





• The cost

–Retrieval of scores – proportional to the number of scores retrieved



• The goal

–To minimize the scores retrieved



Mining and Searching Complex Structures









39

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









The AD Algorithm



• The AD algorithm for the k-n-match query

–Locate the query’s attributes value in every dimension

–Retrieve the objects’ attributes value from the query’s attributes in both directions

–The objects’ attributes are retrieved in Ascending order of their Differences to the query’s attributes. An n-match is found

when it appears n times.



2-2-match 3.0 ( 3.0 , 7.0 , 4.0 )

shape=“round” )

Q : color=“red” &Q : (of Q ,:7.0 , 4.0& texture “cloud”



System 1: Color

d1 2:

Systemd2 Shape 3:

System d3 Texture



Object ID Score

Attr Object ID Score

Attr Object ID Score

Attr

1 0.4 1 1.0 1 1.0

2 2.8 3.0 5 1.5 2 2.0 4.0



5 3.5 2 5.5 7.0 3 5.0

3 6.5 3 7.8 5 8.0

4 9.0 4 9.0 4 9.0





Auxiliary structures d1 d2 d3

Next attribute to retrieve g[2d]

2 , 0.2

1 , 2.6 3 ,, 3.5

5 0.5 2 , 1.5 4 , 0.8

3 , 2.0 2 , 2.0 3 1.0

5 ,, 4.0



Number of appearances appear[c] 1 2 3 4 5



0 0

2

1 0

2

1 0 0

1

Answer set S

{

{ 3 , {23} }



Mining and Searching Complex Structures









The AD Algorithm : Extensions



• The AD algorithm for the frequent k-n-match query

–The frequent k-n-match query

• Given an integer range [n0, n1], find k-n0-match, k-(n0+1)-match, ... , k-n1-

match of the query, S0, S1, ... , Si.

• Find k objects that appear most frequently in S0, S1, ... , Si.



–Retrieve the same number of attributes as processing a k-n1-match query.



• Disk based solutions for the (frequent) k-n-match query



–Disk based AD algorithm

• Sort each dimension and store them sequentially on the disk

• When reaching the end of a disk page, read the next page from disk



–Existing indexing techniques

• Tree-like structures: R-trees, k-d-trees

• Mapping based indexing: space-filling curves, iDistance

• Sequential scan

• Compression based approach (VA-file)









Mining and Searching Complex Structures









40

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Experiments : Effectiveness

• Searching by k-n-match

–COIL-100 database

–54 features extracted, such as color histograms, area moments



k-n-match query, k=4

kNN query

n Images returned

k Images returned

5 36, 42, 78, 94

10 13, 35, 36, 40, 42

10 27, 35, 42, 78

64, 85, 88, 94, 96

15 3, 38, 42, 78

20 27, 38, 42, 78

25 35, 40, 42, 94

30 10, 35, 42, 94

35 35, 42, 94, 96

40 35, 42, 94, 96

45 35, 42, 94, 96

50 35, 42, 94, 96



Searching by frequent k-n- Data sets (d) IGrid HCINN Freq. k-n-match

match Ionosphere (34) 80.1% 86% 87.5%

UCI Machine learning repository

Competitors: Segmentation (19) 79.9% 83% 87.3%

IGrid Wdbc (30) 87.1% N.A. 92.5%

Human-Computer Interactive NN

search (HCINN) Glass (9) 58.6% N.A. 67.8%

Iris (4) 88.9% N.A. 89.6%

Mining and Searching Complex Structures









Experiments : Efficiency

• Disk based algorithms for the Frequent k-n-mach query

–Texture dataset (68,040 records); uniform dataset (100,000 records)

–Competitors:

• The AD algorithm

• VA-file

• Sequential scan









Mining and Searching Complex Structures









41

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Experiments : Efficiency (continued)

• Comparison with other similarity search techniques

–Texture dataset ; synthetic dataset

–Competitors:

• Frequent k-n-match query using the AD algorithm

• IGrid

• scan









Mining and Searching Complex Structures









Future Work(I)

• We now have a natural way to handle similarity search for

data with categorical , numerical and attributes. Investigating

k-n-match performance on such mixed-type data is currently

under way

• Likewise, applying k-n-match on data with missing or

uncertain attributes will be interesting

• Query={1,1,1,1,1,1,1,M,No,R}

ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10



P1 1.1 1 1.2 1.6 1.1 1.6 1.2 M Yes R



P2 1.4 1.4 1.4 1.5 1.4 1 1.2 F No B



P3 1 1 1 1 1 1 2 M No B



P4 20 20 21 20 22 20 20 M Yes G



P5 19 21 20 20 20 21 18 F Yes R



P6 21 21 18 19 20 19 21 F Yes Y



Mining and Searching Complex Structures









42

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Future Work(I)



• We now have a natural way to handle similarity search for

data with categorical , numerical and attributes. Investigating

k-n-match performance on such mixed-type data is currently

under way

• Likewise, applying k-n-match on data with missing or

uncertain attributes will be interesting

• Query={1,1,1,1,1,1,1,M,No,R}

ID d1 d2 d3 d4 d5 d6 d7 d8 d9 d10



P1 1 1.2 1.6 1.1 1.6 1.2 M R



P2 1.4 1.4 1.5 1 1.2 F No B



P3 1 1 1 1 1 2 M No B



P4 20 20 20 22 20 20 M G



P5 19 21 20 20 20 18 Yes R



P6 21 18 20 21 F Yes Y



Mining and Searching Complex Structures









Future Work(II)

• In general, three things affect the result from a similarity search:

noise, scaling and axes orientation. K-n-match reduce the effect of

noise. Ultimate aim is to have a similarity function that is robust

to noise, scaling and axes orientation

• Eventually will look at creating mining algorithms using k-n-

match









Mining and Searching Complex Structures









43

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Outline



• Sources of HDD

• Challenges of HDD

• Searching and Mining Mixed Typed Data

–Similarity Function on k-n-match

–ItCompress

• Bregman Divergence: Towards Similarity Search on Non-metric

Distance

• Earth Mover Distance: Similarity Search on Probabilistic Data

• Finding Patterns in High Dimensional Data









Mining and Searching Complex Structures









Motivation





query

Large

results

Data Sets







Ever-increasing data collection rates of modern

enterprises and the need for effective, guaranteed-

quality approximate answers to queries

Concern: compress as much as possible.





Mining and Searching Complex Structures 22









44

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Conventional Compression Method

• Try to find the optimal encoding of arbitrary strings for

the input data:

–Huffman Coding

–Lempel-Ziv Coding (gzip)

• View the whole table as a large byte string

• Statistical or dictionary based

• Operate at the byte level









Mining and Searching Complex Structures 23









Why not just “syntactic”?



• Do not exploit the complex dependency patterns in the table

• Individual retrieval of tuple is difficult

• Do not utilize lossy compression









Mining and Searching Complex Structures 24









45

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Semantic compression methods



• Derive a descriptive model M

• Identify the data values which can be derived from M (within

some error tolerance), which are essential for deriving, and

which are the outliers

• Derived values need not to be stored, only the outliers need









Mining and Searching Complex Structures 25









Advantages

• More Complex Analysis

–Example: detect correlation among columns

• Fast Retrieval

–Tuple-wise access

• Query Enhancement

–Possible to answer query directly from discover semantic

–Compress in way which enhanced answering of some complex

queries, eg. “Go Green: Recycle and Reuse Frequent Patterns”, C.

Gao, B. C. Ooi, K. L. Tan and A. K. H. Tung. ICDE’2004.



Choose a combination of compression methods

based on semantic and syntactic information

Mining and Searching Complex Structures 26









46

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Fascicles

• Key observation

–Often, numerous subsets of records in T have similar values for

many attributes





Protocol Duration Bytes Packets • Compress data by storing

http 12 20K 3

http 16 24K 5 representative values (e.g.,

http 15 20K 8 “centroid”) only once for each

http 19 40K 11 attribute cluster

http 26 58K 18

ftp 27 100K 24

ftp 32 300K 35

• Lossy compression:

ftp 18 80K 15 information loss is controlled by

the notion of “similar values” for

attributes (user-defined)





Mining and Searching Complex Structures 27









ItCompress: Compression Format

Representative Rows (Patterns)

Original Table

RRid age salary credit sex

age salary credit sex

1 30 90k good F

20 30k poor M

2 70 35k poor M

25 76k good F

Compressed Table

30 90k good F

Outlying

40 100k poor M RRid bitmap

value

50 110k good F 2 0111 20

60 50k good M 1 1111

70 35k poor F 1 1111

75 15k poor M 1 0100 40, poor, M

Error Tolerance: 1 0111 50

age salary credit sex 1 0010 60, 50k, M

5 25k 0 0 2 1110 F

28



2 1111

Mining and Searching Complex Structures









47

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Some definitions



• Error tolerance

–Numeric attributes

• The upper bound that x’ can be different from x

• x ∈ [ x’-ei, x’+ei ]

–Categorical attributes

• The upper bound on the probability that the compressed

value differs from actual value

• Given an actual value x and its error tolerance ei, the

compressed value x’ should satisfy: Prob( x=x’ ) ≥ 1 - ei









Mining and Searching Complex Structures 29









Some definitions



• Coverage

–Let R be a row in the table T, and Pi be a pattern

–The coverage of Pi on R :

cov( Pi , R ) = number of attributes X i in which

R[ X i ] is match by Pi [ X i ]

• Total coverage

–Let P be a set of patterns P1,…,Pk; and the table T

contains n rows R1,…,Rn



totalcov ( P, T ) = ∑ cov( P

i =1..n

max ( Ri ), Ri )

30





Mining and Searching Complex Structures









48

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









ItCompress: basic algorithm



• First randomly choose k rows as initial patterns

• Scan the table T: Phase1

–For each row R, compute the coverage of each pattern on it,

then try to find Pmax(R)

–Allocate R to its most covered pattern

• After each iteration, re-compute all patterns’ Phase2

attributes, always using the most frequent values

• Iterate until sum of total coverage does not increase







Mining and Searching Complex Structures 31









Example: the 1st iteration begins





age salary credit sex RRid age salary credit sex

20 30k poor M 1 20 30k poor M

25 76k good F 2 25 76k good F

30 90k good F

40 100k poor M

50 110k good F

60 50k good M

70 35k poor F

75 15k poor M

Error Tolerance:

age salary credit sex

5 25k 0 0 32



Mining and Searching Complex Structures









49

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Example: Phase 1

RRid age salary credit sex

age salary credit sex 1 20 30k poor M

20 30k poor M 2 25 76k good F

25 76k good F

age salary credit sex

30 90k good F

20 30k poor M

40 100k poor M

40 100k poor M

50 110k good F

60 50k good M

60 50k good M

70 35k poor F

70 35k poor F

75 15k poor M

75 15k poor M

age salary credit sex

Error Tolerance:

25 76k good F

age salary credit sex

30 90k good F

5 25k 0 0 33

50 110k good F

Mining and Searching Complex Structures









Example: Phase 2

RRid age salary credit sex

age salary credit sex 1 20 M

70 30k poor M

20 30k poor M

2 25

25 90k

76k good F

F

25 76k good F

30 90k good F age salary credit sex



40 100k poor M 20 30k poor M



50 110k good F 40 100k poor M



60 50k good M 60 50k good M



70 35k poor F 70 35k poor F



75 15k poor M 75 15k poor M



Error Tolerance: age salary credit sex

25 76k good F

age salary credit sex

30 90k good F

5 25k 0 0 34



50 110k good F



Mining and Searching Complex Structures









50

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Convergence(I)

• Phase 1:

–When we assign the rows to their most coverage patterns:

• For each row, the coverage increases or maintain

So the total coverage also increases or maintain

• Phase 2:

–When we re-compute the attribute values for the patterns:

• For each pattern, the coverage increases or maintains

So the total coverage also increases or maintains









Mining and Searching Complex Structures 35









Convergence(II)

• In both Phase 1&2, the total coverage is either increased

or maintained, and it has a obvious upper bound (cover

the whole table)



The algorithm will converge eventually









Mining and Searching Complex Structures 36









51

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Complexity

• Phase 1:

–In l iterations, we need to go through the n rows in the table and

match each row against the k patterns(2m comparisons,)

The running time complexity is O(kmnl) where m is the

number of attributes

• Phase 2:

–Computing each new pattern Pi will require going through all

the domain values/intervals of each value

Assuming the total number of domain values/intervals is d, the

running time complexity is O(kdl)



The total time complexity is O(kmnl+kdl)







Mining and Searching Complex Structures 37









Advantages of ItCompress

• Simplicity and Directness

–Two phases process of Fascicle and Spartan

• Find rules/patterns

• Compress database using discovered rules/patterns

–ItCompress optimize the compression directly without finding

rules/patterns that may not be useful (a.k.a microeconomic approach)

• Less constraints

–Do not need patterns to be matched completely or rules that apply

globally

• Easily tuned parameters









Mining and Searching Complex Structures 38









52

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Performance Comparison

• Algorithms

–ItCompress, ItCompress+gzip

–Fascicles, Fascicles+gzip

–SPARTAN+gzip

• Platform

–ItCompress,Fascicles: AMD Duron 700Mhz, 256MB Memory

–SPARTAN: Four 700Mhz Pentium CPU, 1GB Memory)

• Datasets

–Corel: 32 numeric attributes, 35000 rows, 10.5MB

–Census: 7 numeric, 7 categorical, 676000 rows, 28.6MB

–Forest-cover: 10 numeric, 44 categorical, 581000 rows, 75.2MB



Mining and Searching Complex Structures 39









Effectiveness (Corel)









Mining and Searching Complex Structures 40









53

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Effectiveness (Census)









Mining and Searching Complex Structures 41









Effectiveness (Forest Cover)









Mining and Searching Complex Structures 42









54

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Efficiency









Mining and Searching Complex Structures 43









Varying k









Mining and Searching Complex Structures 44









55

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Varying Sample Ratio









Mining and Searching Complex Structures 45









Adding Noises (Census)









Mining and Searching Complex Structures 46









56

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Effect of Corruption 20%

Corruption?



A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12









47



Mining and Searching Complex Structures









Effect of Corruption 20%

Corruption?



A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12









48



Mining and Searching Complex Structures









57

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Findings

• ItCompress is

–More efficient than SPARTAN

–More effective than Fascicles

–Insensitive to parameter setting

–Robust to noises









Mining and Searching Complex Structures 49









Future work



• Can we perform mining on the compressed datasets using

only the patterns and the bitmap ?

–Example: Building Bayesian Belief Network

• Is ItCompress a good “bootstrap” semantic compression

algorithm ?





ItCompress

Compressed

database database









Other Semantic

Compression Algorithms

50









Mining and Searching Complex Structures









58

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Outline



• Sources of HDD

• Challenges of HDD

• Searching and Mining Mixed Typed Data

–Similarity Function on k-n-match

–ItCompress

• Bregman Divergence: Towards Similarity Search on Non-metric

Distance

• Earth Mover Distance: Similarity Search on Probabilistic Data

• Finding Patterns in High Dimensional Data









Mining and Searching Complex Structures









Metric v.s. Non-Metric

• Euclidean distance dominates DB queries

• Similarity in human perception









• Metric distance is not enough!









2010-7-31 Mining and Searching Complex Structures 52









59

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Bregman Divergence



h









(q,f(q))

convex function f(x)



(p,f(p))

Bregman divergence

Df(p,q)







q p

Euclidean dist.









2010-7-31 Mining and Searching Complex Structures 53









Bregman Divergence

• Mathematical Interpretation

–The distance between p and q is defined as the difference

between f(p) and the first order Taylor expansion at q









f(x) at p first order Taylor expansion at q









2010-7-31 Mining and Searching Complex Structures 54









60

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Bregman Divergence

• General Properties

–Non-Negativity

• Df(p,q)≥0 for any p, q

–Identity of Indiscernible

• Df(p,p)=0 for any p

–Symmetry and Triangle Inequality

• Do NOT hold any more









2010-7-31 Mining and Searching Complex Structures 55









Examples



Distance f(x) Df(p,q) Usage



KL-Divergence x logx p log (p/q) distribution,

color histogram

Itakura-Saito -logx p/q-log (p/q)-1 signal, speech

Distance

Squared x2 (p-q)2 Euclidean space

Euclidean

Von-Nuemann tr(X log X – X) tr(X logX – X symmetric matrix

Entropy logY – X + Y)









2010-7-31 Mining and Searching Complex Structures 56









61

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Why in DB system?

• Database application

–Retrieval of similar images, speech signals, or time series

–Optimization on matrices in machine learning

–Efficiency is important!

• Query Types

–Nearest Neighbor Query

–Range Query









2010-7-31 Mining and Searching Complex Structures 57









Euclidean Space

• How to answer the queries

–R-Tree









2010-7-31 Mining and Searching Complex Structures 58









62

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Euclidean Space

• How to answer the queries

–VA File









2010-7-31 Mining and Searching Complex Structures 59









Our goal

• Re-use the infrastructure of existing DB system to support

Bregman divergence

–Storage management

–Indexing structures

–Query processing algorithms









2010-7-31 Mining and Searching Complex Structures 60









63

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Basic Solution

• Extended Space

–Convex function f(x) = x2







point D1 D2 point D1 D2 D3



p 0 1 p+ 0 1 1



q 0.5 0.5 q+ 0.5 0.5 0.5



r 1 0.8 r+ 1 0.8 1.64



t 1.5 0.3 t+ 1.5 0.3 3.15









2010-7-31 Mining and Searching Complex Structures 61









Basic Solution

• After the extension

–Index extended points with R-Tree or VA File

–Re-use existing algorithms with lower and upper bounds on

the rectangles









2010-7-31 Mining and Searching Complex Structures 62









64

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









How to improve?

• Reformulation of Bregman divergence

• Tighter bounds are derived

• No change on index construction or query processing

algorithm









2010-7-31 Mining and Searching Complex Structures 63









A New Formulation



h



h’ query vector vq









Df(p,q)+Δ





q p

D*f(p,q)









2010-7-31 Mining and Searching Complex Structures 64









65

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Math. Interpretation

• Reformulation of similarity search queries

–k-NN query: query q, data set P, divergence Df

• Find the point p, minimizing







–Range query: query q, threshold θ, data set P

• Return any point p that









2010-7-31 Mining and Searching Complex Structures 65









Naïve Bounds

• Check the corners of the bounding rectangles









2010-7-31 Mining and Searching Complex Structures 66









66

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Tighter Bounds

• Take the curve f(x) into consideration









2010-7-31 Mining and Searching Complex Structures 67









Query distribution

• Distortion of rectangles

–The difference between maximum and minimum distances

from inside the rectangle to the query









2010-7-31 Mining and Searching Complex Structures 68









67

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Can we improve it more?

• When Building R-Tree in Euclidean space

–Minimize the volume/edge length of MBRs

–Does it remain valid?









2010-7-31 Mining and Searching Complex Structures 69









Query distribution

• Distortion of bounding rectangles

–Invariant in Euclidean space (triangle inequality)

–Query-dependent for Bregman Divergence









2010-7-31 Mining and Searching Complex Structures 70









68

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Utilize Query Distribution

• Summarize query distribution with O(d) real number

• Estimation on expected distortion on any bounding

rectangle in O(d) time

• Allows better index to be constructed for both R-Tree and

VA File









2010-7-31 Mining and Searching Complex Structures 71









Experiments

• Data Sets

–KDD’99 data

• Network data, the proportion of packages in 72 different

TCP/IP connection Types

–DBLP data

• Use co-authorship graph to generate the probabilities of the

authors related to 8 different areas









2010-7-31 Mining and Searching Complex Structures 72









69

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Experiment

• Data Sets

–Uniform Synthetic data

• Generate synthetic data with uniform distribution

–Clustered Synthetic data

• Generate synthetic data with Gaussian Mixture Model









2010-7-31 Mining and Searching Complex Structures 73









Experiments

• Methods to compare





Basic Improved Query

Bounds Distribution

R-Tree R R-B R-BQ



VA File V V-B V-BQ



Linear Scan LS



BB-Tree BBT









2010-7-31 Mining and Searching Complex Structures 74









70

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Existing Solution

• BB-Tree (L. Clayton, ICML 2009)

–Memory-based indexing tree

–Construct with k-means clustering

–Hard to update

–Ineffective in high-dimensional space









2010-7-31 Mining and Searching Complex Structures 75









Experiments

• Index Construction Time









2010-7-31 Mining and Searching Complex Structures 76









71

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Experiments

• Varying dimensionality









2010-7-31 Mining and Searching Complex Structures 77









Experiments

• Varying dimensionality (cont.)









2010-7-31 Mining and Searching Complex Structures 78









72

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Experiments

• Varying data cardinality









2010-7-31 Mining and Searching Complex Structures 79









Conclusion

• A general technique on similarity for Bregman Divergence

• All techniques are based on existing infrastructure of

commercial database

• Extensive experiments to compare performances with R-

Tree and VA File with different optimizations









2010-7-31 Mining and Searching Complex Structures 80









73

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Outline



• Sources of HDD

• Challenges of HDD

• Searching and Mining Mixed Typed Data

–Similarity Function on k-n-match

–ItCompress

• Bregman Divergence: Towards Similarity Search on Non-metric

Distance

• Earth Mover Distance: Similarity Search on Probabilistic Data

• Finding Patterns in High Dimensional Data









Mining and Searching Complex Structures









Motivation

• Probabilistic data is ubiquitous

–To represent the data uncertainty (WSN, RFID, moving

object monitoring)

–To compress data (image processing)

• Histogram is a good way to represent the prob. data

–Easy to capture

–Is very useful in image representation

• Colors

• Textures

• Gradient

• Depth









Mining and Searching Complex Structures









74

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Motivation

• Similarity search is important for managing prob. data

–Given a threshold θ, can answer which sensors’ readings

are similar with sensor A (range query)

–Can answer which k pictures are similar (top-k query)

• Similarity function for prob. data should be carefully

chosen

–Bin by bin methods

• L1 and L2 norms

• χ2 distance

–Cross-bin methods

• Earth Mover’s Distance (EMD)

• Quadratic form







Mining and Searching Complex Structures









Outline

• Motivation

• Introduction to Earth Mover’s Distance (EMD)

• Related works

• Indexing the probabilistic data based on EMD

• Experimental results

• Conclusion and future work









Mining and Searching Complex Structures









75

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Introduction to Earth Mover’s Dist

• Bin by bin vs. cross bin









Bin-by-bin

Not good!









Cross bin

Good!

Can handle

distribution shift

Mining and Searching Complex Structures









Introduction to Earth Mover’s Dist

• What is EMD?

–Earth (泥土)

–Mover (搬运)

–Distance (代价)

–Can be understood as 搬运泥土的代价

• See an example…









Mining and Searching Complex Structures









76

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Moving Earth













Mining and Searching Complex Structures









Moving Earth













Mining and Searching Complex Structures









77

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Moving Earth









=



Mining and Searching Complex Structures









The Difference?







(amount moved)









=



Mining and Searching Complex Structures









78

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









The Difference?







Difference (amount moved) * (distance moved)









=



Mining and Searching Complex Structures









Linear programming









P



m bins

(distance moved) * (amount moved)





Q All movements





n bins









Mining and Searching Complex Structures









79

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Linear programming









P



m clusters

(distance moved) * (amount moved)





Q



n clusters









Mining and Searching Complex Structures









Linear programming









P



m clusters

* (amount moved)





Q



n clusters









Mining and Searching Complex Structures









80

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Linear programming









P



m clusters







Q



n clusters









Mining and Searching Complex Structures









Constraints





1. Move “earth” only from P to Q

P



m clusters

P’



Q



n clusters Q’





Mining and Searching Complex Structures









81

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Constraints





2. Cannot send more “earth” than

P there is



m clusters

P’



Q



n clusters Q’





Mining and Searching Complex Structures









Constraints





3. Q cannot receive more “earth”

P than it can hold



m clusters

P’



Q



n clusters Q’





Mining and Searching Complex Structures









82

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Constraints





4. As much “earth” as possible

P must be moved



m clusters

P’



Q



n clusters Q’





Mining and Searching Complex Structures









The Formal Definition of EMD

• Earth Mover’s Distance (EMD)

–the minimum amount of work needed to change one

histogram into another









• Challenge of EMD

–O(N^3logN)









Mining and Searching Complex Structures









83

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Related Works

• Filter-and-refine framework

–[1] Approximation Techniques for

Indexing the Earth Mover's Distance in

Multimedia Databases. ICDE 2006

• Cannot handle high

dimensional histograms



–[2] Efficient EMD-based Similarity

Search in Multimedia Databases via

Flexible Dimensionality Reduction.

SIGMOD 2008

• Based on scan framework and

influence the scalability

• Use scanning scheme to

process queries

–Merit: can obtain a good order to access

when execute the k-NN queries and thus

can minimize the number of candidates

–Demerit: need to scan the whole dataset

to obtain the order and thus low algo.

scalability







Mining and Searching Complex Structures









Related Works

• Related works

–Based on the filter-and-refine framework

–Based on scanning method and low scalability

• Our work

–Also based on the filter-and-refine method

–But avoid to scan the whole data set

• Use B+ trees

• And thus can obtain high scalability

• Our contributions

–To the best of our knowledge, the 1st paper to index the high

dimensional prob. data based on the EMD

–Proposed algorithms of processing the similarity query based on B+ tree

filter

–Improve the efficiency and scalability of EMD-based similarity search









Mining and Searching Complex Structures









84

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Indexing the probabilistic data

based on EMD

• Our intuition:

–primal-dual theory in linear programming





• Primal problem (EMD)









• Dual problem









Mining and Searching Complex Structures









Indexing the probabilistic data based on

EMD









• Good properties of dual space

–Constrains of dual space are independent of prob. data points (i.e., p and

q in this example)

• Thus, give any feasible solution (π, Ф) in dual space we can derives a

lower bound for EMD(p, q)

• Lower bound can help to filter out the not-hit histograms.

–given any feasible solution (π, Ф) in dual space, a histogram p can be

mapped as a value, using the operation of

• Can index histograms using B+ tree







Mining and Searching Complex Structures









85

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Indexing the probabilistic data based on EMD

• 1. Mapping Construction

–Key and counter key









Key Counter key

–Assuming p is a histogram in DB, given a feasible solution

(π, Ф), we calculate the Key for each record in DB

–We can index those keys using B+ tree

–For each feasible solution (π, Ф), a B+ tree can be

constructed







Mining and Searching Complex Structures









Answering Range Query

• Range query based on B+ index

–Given any feasible solution (π, Ф) , we construct a B+ tree

using keys of histograms

–Given a query histogram, we calculate its counter key using

the operation of

–Given a similarity search threshold θ, we have proved that

all candidate histogram’s key can be bounded by







–To further filter the candidates, we use L B+ tree and make

an intersection among their candidate results







Mining and Searching Complex Structures









86

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Answering KNN Query



• K-NN query based on B+ index

–Given a query q, we issue search on

each B+ tree Tl with key(q, Фl)

–We create two cursors for each tree and

let them to fetch records from different

directions (one left and one right)

–Whenever record r has already been

accessed by all B+ tree, it can be output

as a candidate for k-NN query









Mining and Searching Complex Structures









Experimental Setup

• 3 real data set

–RETINA1

• an image data set consists of 3932 feline retina scans labeled

with various antibodies.

–IRMA

• contains 10000 radiography images from the Image Retrieval

in Medical Application (IRMA) project

–DBLP

• With parameter setting









Mining and Searching Complex Structures









87

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Experimental Results on

Query CPU Time









Mining and Searching Complex Structures









Experimental Results on

Scalability





sigmod

our









Mining and Searching Complex Structures









88

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Conclusions

• We present a new indexing scheme for the general

purposes of similarity search on Earth Mover's Distance

• Our index method relies on the primal-dual theory to

construct mapping functions from the original

probabilistic space to one-dimensional domain

• Our B+ tree-based index framework has

–High scalability

–High efficiency

–can handle High dimensional data









Mining and Searching Complex Structures









Outline



• Sources of HDD

• Challenges of HDD

• Searching and Mining Mixed Typed Data

–Similarity Function on k-n-match

–ItCompress

• Bregman Divergence: Towards Similarity Search on Non-metric

Distance

• Earth Mover Distance: Similarity Search on Probabilistic Data

• Finding Patterns in High Dimensional Data









Mining and Searching Complex Structures









89

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









A Microarray Dataset

1000 - 100,000 columns



Class Gene1 Gene2 Gene3 Gene4 Gene Gene Ge

5 6

Sample1 Cancer

Sample2 Cancer

100-

500 .

rows .

.

SampleN-1 ~Cance

r

SampleN ~Cance

r



• Find closed patterns which occur frequently among genes.

• Find rules which associate certain combination of the

columns that affect the class of the rows

–Gene1,Gene10,Gene1001 -> Cancer

Mining and Searching Complex Structures









Challenge I

• Large number of patterns/rules

–number of possible column combinations is extremely high

• Solution: Concept of a closed pattern

–Patterns are found in exactly the same set of rows are grouped together

and represented by their upper bound

• Example: the following patterns are found in row 2,3 and 4



upper

aeh bound i ri Class

(closed 1 a ,b,c,l,o,s C

pattern) 2 a ,d, e , h ,p,l,r C

ae ah 3 a ,c, e , h ,o,q,t C

eh 4 a , e ,f, h ,p,r ~C

5 b,d,f,g,l,q,s,t ~C

e h

lower bounds “a” however not part of

the group

Mining and Searching Complex Structures









90

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Challenge II

• Most existing frequent pattern discovery algorithms perform

searches in the column/item enumeration space i.e. systematically

testing various combination of columns/items

• For datasets with 1000-100,000 columns, this search space is

enormous

• Instead we adopt a novel row/sample enumeration algorithm for

this purpose. CARPENTER (SIGKDD’03) is the FIRST

algorithm which adopt this approach









Mining and Searching Complex Structures









Column/Item Enumeration Lattice



• Each nodes in the lattice represent

a combination of columns/items a,b,c,e

• An edge exists from node A to B if

A is subset of B and A differ from a,b,c a,b,e a,c,e b,c

B by only 1 column/item

• Search can be done

breadth first a,b a,c a,e b,c b



i ri Class

1 a,b,c,l,o,s C

a b c

2 a,d,e,h,p,l,r C

3 a,c,e,h,o,q,t C

4 a,e,f,h,p,r ~C

5 b,d,f,g,l,q,s,t ~C

start {}

Mining and Searching Complex Structures









91

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Column/Item Enumeration Lattice

• Each nodes in the lattice represent

a combination of columns/items

a,b,c,e

• An edge exists from node A to B if

A is subset of B and A differ from

B by only 1 column/item a,b,c a,b,e a,c,e b,c

• Search can be done depth first

• Keep edges from parent to child

only if child is the prefix of parent a,b a,c a,e b,c b



i ri Class

1 a,b,c,l,o,s C

a b c

2 a,d,e,h,p,l,r C

3 a,c,e,h,o,q,t C

4 a,e,f,h,p,r ~C

5 b,d,f,g,l,q,s,t ~C

start {}

Mining and Searching Complex Structures









General Framework for Column/Item Enumeration



Read-based Write-based Point-based





Association Mining Apriori[AgSr94], Eclat, Hmine

DIC MaxClique[Zaki01],

FPGrowth [HaPe00]





Sequential Pattern GSP[AgSr96] SPADE

Discovery [Zaki98,Zaki01],

PrefixSpan

[PHPC01]



Iceberg Cube Apriori[AgSr94] BUC[BeRa99], H-

Cubing [HPDW01]









Mining and Searching Complex Structures









92

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









A Multidimensional View



types of data others

or knowledge other interest

measure

associative

pattern constraints



pruning method

sequential

pattern compression method





closed/max

iceberg pattern

cube

lattice transversal/

main operations



read write point

Mining and Searching Complex Structures









Sample/Row Enumeration Algorihtms



• To avoid searching the large column/item enumeration space, our

mining algorithm search for patterms/rules in the sample/row

enumeration space

• Our algorithms does not fitted into the column/item enumeration

algorithms

• They are not YAARMA (Yet Another Association Rules Mining

Algorithm)

• Column/item enumeration algorithms simply does not scale for

microarray datasets









Mining and Searching Complex Structures









93

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Existing Row/Sample Enumeration Algorithms



• CARPENTER(SIGKDD'03)

–Find closed patterns using row enumeration

• FARMER(SIGMOD’04)

–Find interesting rule groups and building classifiers based on them

• COBBLER(SSDBM'04)

–Combined row and column enumeration for tables with large

number of rows and columns

• Topk-IRG(SIGMOD’05)

–Find top-k covering rules for each sample and build classifier

directly

• Efficiently Finding Lower Bound Rules(TKDE’2010)

–Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao.

What is Unequal among the Equals? Ranking Equivalent Rules from

Gene Expression Data. Accepted in TKDE

Mining and Searching Complex Structures









Concepts of CARPENTER

ij R (ij )

C ~C

i ri Class a 1,2,3 4

b 1 5

1 a,b,c,l,o,s C C ~C

c 1,3

2 a,d,e,h,p,l,r C d 2 5 a 1,2,3 4

3 a,c,e,h,o,q,t C e 2,3 4 e 2,3 4

4 a,e,f,h,p,r ~C f 4,5 h 2,3 4

5 b,d,f,g,l,q,s,t ~C g 5

h 2,3 4 TT|{2,3}

Example Table l 1,2 5

o 1,3

p 2 4

q 3 5

r 2 4

s 1 5

t 3 5



Transposed Table,TT

Mining and Searching Complex Structures









94

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









ij R (ij )



Row Enumeration a

C

1,2,3 4

~C



b 1 5

c 1,3

d 2 5

e 2,3 4

123 12345 f 4,5

{a} 1234 {}

{a} g 5

12 124 h 2,3 4

{al} {a} 1235 l 1,2 5

{} ij R (ij )

13 125 o 1,3

{aco} {l} C ~C p 2 4

1 1245 a 1,2,3 4 q 3 5

14 134 {}

{abclos} {a} {a} TT|{1} b 1 5 r 2 4

s 1 5

15 135 c 1,3

{bls} {} 1345 t 3 5

{} l 1,2 5

23 145 o 1,3

2 {} ij R (ij )

{aeh} s 1 5

{adehplr} C ~C

24 234 2345

{aeh} {} a 1,2,3 4

{aehpr}

TT|{12} l

{}

1,2 5

3 25 235

{dl} {} ij R (ij )

{acehoqt}

245 C ~C

34 {}

{aeh} a 1,2,3 4

4 TT|{123}

{124}

{aefhpr} 345

35 {}

{q}





5 45

{bdfglqst} {f}

Mining and Searching Complex Structures









Pruning Method 1



• Removing rows that appear in all tuples

of transposed table will not affect results

C ~C

a 1,2,3 4

e 2,3 4

h 2,3 4

r2 r3 r2 r3 r4

TT|{2,3}

{aeh} {aeh}







r4 has 100% support in the conditional table of

“r2r3”, therefore branch “r2 r3r4” will be

pruned.









Mining and Searching Complex Structures









95

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Pruning method 2



123

{a} 1234

{a}

• if a rule is discovered

12345

{}

12

{al} 124

{a} 1235

before, we can prune

{}

13 125

{l}

enumeration below this

{aco}

1 14 134

{a}

1245

{} node

{abclos} {a}

15 135 –Because all rules below

{bls} {} 1345

{} this node has been

23 145

2 {aeh}

{} discovered before

{adehplr} 234

24 {aeh} 2345 –For example, at node 34, if

{} {aehpr} {}



3 25 235 we found that {aeh} has

{dl} {}

{acehoqt}

245 been found, we can prune

34 {} C ~Coff all branches below it

{aeh}

4

345 a 1,2,3 4

{aefhpr} 35 {}

{q} e 2,3 4

h 2,3 4

5 45

{f}

TT|{3,4}

{bdfglqst}

Mining and Searching Complex Structures









Pruning Method 3: Minimum Support





• Example: From TT|{1}, we can see ij R (ij )

that the support of all possible

pattern below node {1} will be at C ~C

most 5 rows.

TT|{1}

a 1,2,3 4

b 1 5

c 1,3

l 1,2 5

o 1,3

s 1 5





Mining and Searching Complex Structures









96

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









From CARPENTER to FARMER

• What if classes exists ? What more can we

do ?

• Pruning with Interestingness Measure

–Minimum confidence

–Minimum chi-square

• Generate lower bounds for classification/

prediction









Mining and Searching Complex Structures









Interesting Rule Groups

• Concept of a rule group/equivalent class

–rules supported by exactly the same set of rows are grouped together

• Example: the following rules are derived from row 2,3 and 4 with

66% confidence





i ri Class

upper 1 a ,b,c,l,o,s C

aeh--> C(66%)

bound 2 a ,d, e , h ,p,l,r C

3 a ,c, e , h ,o,q,t C

ae-->C (66%) ah--> C(66%) eh-->C (66%) 4 a , e ,f, h ,p,r ~C

5 b,d,f,g,l,q,s,t ~C



a-->C however is not in

e-->C (66%) h-->C (66%)

the group

lower bounds



Mining and Searching Complex Structures









97

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Pruning by Interestingness Measure

• In addition, find only interesting rule groups (IRGs) based

on some measures:

–minconf: the rules in the rule group can predict the class on

the RHS with high confidence

–minchi: there is high correlation between LHS and RHS of

the rules based on chi-square test

• Other measures like lift, entropy gain, conviction etc. can

be handle similarly









Mining and Searching Complex Structures









ij R (ij )

C ~C



Ordering of Rows: All Class C before ~C a

b

1,2,3 4

1 5

c 1,3

d 2 5

e 2,3 4

123 12345 f 4,5

{a} 1234 {}

{a} g 5

12 124 h 2,3 4

{al} {a} 1235 l 1,2 5

{} ij R (ij )

13 125 o 1,3

{aco} {l} C ~C p 2 4

1 1245 a 1,2,3 4 q 3 5

14 134 {}

{abclos} {a} {a} TT|{1} b 1 5 r 2 4

s 1 5

15 135 c 1,3

{bls} {} 1345 t 3 5

{} l 1,2 5

23 145 o 1,3

2 {} ij R (ij )

{aeh} s 1 5

{adehplr} C ~C

24 234 2345

{aeh} {} a 1,2,3 4

{aehpr}

TT|{12} l

{}

1,2 5

3 25 235

{dl} {} ij R (ij )

{acehoqt}

245 C ~C

34 {}

{aeh} a 1,2,3 4

4 TT|{123}

{124}

{aefhpr} 345

35 {}

{q}





5 45

{bdfglqst} {f}

Mining and Searching Complex Structures









98

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Pruning Method: Minimum Confidence







• Example: In TT|{2,3} on the right, C ~C

the maximum confidence of all rules a 1,2,3,6 4,5

below node {2,3} is at most 4/5 e 2,3,7 4,9

h 2,3 4



TT|{2,3}









Mining and Searching Complex Structures









Pruning method: Minimum chi-square





C ~C

Same as in computing

maximum confidence a 1,2,3,6 4,5

e 2,3,7 4,9

h 2,3 4



TT|{2,3}

C ~C Total

A max=5 min=1 Computed

~A Computed Computed Computed

Constant Constant Constant







Mining and Searching Complex Structures









99

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Finding Lower Bound, MineLB



–Example: An upper bound

rule with antecedent A=abcde

a,b,c,d,e

and two rows (r1 : abcf ) and

(r2 : cdeg)

ad ae bd be –Initialize lower bounds {a, b,

abc

cde c, d, e}

–add “abcf”--- new lower

{d ,e}

a e

b c d –Add “cdeg”--- new lower

bound{ad, bd, ae, be}



Candidate lower bound: ad, ae, bd, be,

Candidate lower bound: ad, ae, bd, be cd, ce

Removed since d,e are still lower them

Kept since no lower bound overridebound



Mining and Searching Complex Structures









Implementation



• In general, CARPENTER FARMER can be ij R (ij )

implemented in many ways: C ~C

a 1,2,3 4

–FP-tree b 1 5

–Vertical format c 1,3

d 2 5

• For our case, we assume the dataset can be e 2,3 4

fitted into the main memory and used f 4,5

g 5

pointer-based algorithm similar to BUC h 2,3 4

l 1,2 5

o 1,3

p 2 4

q 3 5

r 2 4

s 1 5

t 3 5







Mining and Searching Complex Structures









100

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Experimental studies



• Efficiency of FARMER

–On five real-life dataset

• lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALL-

AML leukemia (ALL), Colon Tumor(CT)

–Varying minsup, minconf, minchi

–Benchmark against

• CHARM [ZaHs02] ICDM'02

• Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99

• Usefulness of IRGs

–Classification









Mining and Searching Complex Structures









Example results--Prostate



100000

FA RM ER

10000 Co lumnE

1000 CHA RM



100



10



1

3 4 5 6 7 8 9



mi ni mum sup p o r t









Mining and Searching Complex Structures









101

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Example results--Prostate



1200

FA RM ER:minsup=1:minchi=10

1000

FA RM ER:minsup =1

800



600



400



200



0

0 50 70 80 85 90 99



minimum confidence(%)





Mining and Searching Complex Structures









Top k Covering Rule Groups



• Rank rule groups (upper bound) according to

– Confidence

– Support

• Top k Covering Rule Groups for row r

– k highest ranking rule groups that has row r as support and support

> minimum support

• Top k Covering Rule Groups =

TopKRGS for each row









Mining and Searching Complex Structures









102

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Usefulness of Rule Groups



• Rules for every row

• Top-1 covering rule groups sufficient to build CBA classifier

• No min confidence threshold, only min support

• #TopKRGS = k x #rows









Mining and Searching Complex Structures









Top-k covering rule groups



• For each row, we find the most

significant k rule groups:

class Items

–based on confidence first

–then support

C1 a,b,c

• Given minsup=1, Top-1

–row 1: abc C1(sup = 2, conf= 100%) C1 a,b,c,d

–row 2: abc C1



C1 c,d,e

• abcd C1(sup=1,conf = 100%)

–row 3: cd C1(sup=2, conf = 66.7%)

• If minconf = 80%, ?

–row 4: cde C2 (sup=1, conf = 50%) C2 c,d,e









Mining and Searching Complex Structures









103

Mining and Searching Complex Chapter 2 Structures High Dimensional Data









Main advantages of Top-k coverage rule group



• The number is bounded by the product of k and the number

of samples

• Treat each sample equally provide a complete description

for each row (small)

• The minimum confidence parameter-- instead k.

• Sufficient to build classifiers while avoiding excessive

computation









Mining and Searching Complex Structures









Top-k pruning

• At node X, the maximal set of rows covered by rules to

be discovered down X-- rows containing X and rows

ordered after X.

– minconf MIN confidence of the discovered TopkRGs for all rows in the above

set

– minsup the corresponding minsup

• Pruning

–If the estimated upper bound of confidence down X 0, j>0



⎧V (i − 1, j − 1) + δ ( S [i ], T [ j ]) Match/mismatch



V (i, j ) = max ⎨ V (i − 1, j ) + δ ( S [i ], _) Delete

⎪ V (i, j − 1) + δ (_, T [ j ])

⎩ Insert



In the alignment, the last pair must be either

match/mismatch, delete, insert.

xxx…xx xxx…xx xxx…x_

| | |

xxx…yy yyy…y_ yyy…yy

match/mismatch delete insert

2010-7-31









Example (I)



_ A G C A T G C

_ 0 -1 -2 -3 -4 -5 -6 -7

A -1

C -2

A -3

A -4

T -5

C -6

C -7

2010-7-31









129

Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences









Example (II)



_ A G C A T G C

_ 0 -1 -2 -3 -4 -5 -6 -7

A -1 2 1 0 -1 -2 -3 -4

C -2 1 1 ?

3 2

A -3

A -4

T -5

C -6

C -7

2010-7-31









Example (III)



_ A G C A T G C

_ 0 -1 -2 -3 -4 -5 -6 -7

A -1 2 1 0 -1 -2 -3 -4

C -2 1 1 3 2 1 0 -1

A -3 0 0 2 5 4 3 2

A -4 -1 -1 1 4 4 3 2

T -5 -2 -2 0 3 6 5 4

C -6 -3 -3 0 2 5 5 7

C -7 -4 -4 -1 1 4 4 7

2010-7-31









130

Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences









“q-grams” of strings







universal









2-grams









2010-7-31









q-gram inverted lists





at 4

ch 0 2

id strings

ck 1 3

0 rich

2-grams

ic 0 1 2 4

1 stick

ri 0

2 stich

st 1 2 3 4

3 stuck

ta 4

4 static

ti 1 2 4

tu 3

uc 3







2010-7-31









131

Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences









Searching using inverted lists

Query: “shtick”, ED(shtick, ?)≤1

sh ht ti ic ck # of common grams >= 3



at 4

ch 0 2

id strings

ck 1 3

0 rich

2-grams

ic 0 1 2 4

1 stick

ri 0

2 stich

st 1 2 3 4

3 stuck

ta 4

4 static

ti 1 2 4

tu 3

uc 3



2010-7-31









2-grams -> 3-grams?

Query: “shtick”, ED(shtick, ?)≤1

sht hti tic ick # of common grams >= 1



ati 4

ich 0 2

id strings ick 1

0 rich ric 0

1 stick 3-grams sta 4

2 stich sti 1 2

3 stuck stu 3

4 static tat 4

tic 1 2 4

tuc 3

2010-7-31

uck 3









132

Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences









Observation 1: dilemma of choosing “q”

Increasing “q” causing:

Longer grams Shorter lists

Smaller # of common grams of similar strings

at 4

ch 0 2

id strings

ck 1 3

0 rich

2-grams

ic 0 1 2 4

1 stick

ri 0

2 stich

st 1 2 3 4

3 stuck

ta 4

4 static

ti 1 2 4

tu 3

uc 3

2010-7-31









Observation 2: skew distributions of gram

frequencies

DBLP: 276,699 article titles

Popular 5-grams: ation (>114K times), tions, ystem, catio









2010-7-31









133

Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences









VGRAM: Main idea

Grams with variable lengths (between qmin and qmax)

zebra

ze(123)

corrasion

co(5213), cor(859), corr(171)

Advantages

Reduce index size ☺

Reducing running time ☺

Adoptable by many algorithms ☺









2010-7-31









Challenges

Generating variable-length grams?

Constructing a high-quality gram dictionary?

Relationship between string similarity and their

gram-set similarity?

Adopting VGRAM in existing algorithms?









2010-7-31









134

Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences









Challenge 1: String Variable-length grams?



Fixed-length 2-grams



universal







Variable-length grams

[2,4]-gram dictionary

universal ni

ivr

sal

uni

vers

2010-7-31









Representing gram dictionary as

a trie





ni

ivr

sal

uni

vers









2010-7-31









135

Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences









Challenge 2: Constructing gram

dictionary

Step 1: Collecting frequencies of grams with length in [qmin,

qmax]





st 0, 1, 3

sti 0, 1

stu 3

stic 0, 1

stuc 3









Gram trie with frequencies

2010-7-31









Step 2: selecting grams

Pruning trie using a frequency threshold T (e.g., 2)









2010-7-31









136

Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences









Step 2: selecting grams (cont)



Threshold T = 2









2010-7-31









Final gram dictionary









[2,4]-grams



2010-7-31









137

Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences









Challenge 3: Edit operation’s effect on grams





Fixed length: q

universal







k operations could affect k * q grams









2010-7-31









Deletion affects variable-length grams





Not affected Affected Not affected







i-qmax+1 i i+qmax- 1

Deletion









2010-7-31









138

Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences









Grams affected by a deletion



Affected?





i-qmax+1 i i+qmax- 1

Deletion

[2,4]-grams

Deletion ni

ivr

universal

sal

uni

Affected? vers

2010-7-31









Grams affected by a deletion (cont)

Affected?





i-qmax+1 i i+qmax- 1

Deletion









Trie of grams

2010-7-31 Trie of reversed grams









139

Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences









# of grams affected by each operation





Deletion/substitution Insertion



0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0

_u_n_i_v_e_r_s_a_l_









2010-7-31









Max # of grams affected by k operations



Vector of s =



With 2 edit operations, at most 4 grams can be affected







Called NAG vector (# of affected grams)

Precomputed and stored









2010-7-31









140

Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences









Summary of VGRAM index









2010-7-31









Challenge 4: adopting VGRAM

Easily adoptable by many algorithms



Basic interfaces:

String s grams

String s1, s2 such that ed(s1,s2) =:

(|s1|- q + 1) – k * q







Variable lengths: # of grams of s1 – NAG(s1,k)

2010-7-31









Example: algorithm using inverted lists

Query: “shtick”, ED(shtick, ?)≤1

sh ht tick

2-grams 2-4 grams

… Lower bound = 3 …

ck 1 3 ck 1 3

ic 0 1 2 4 ic 1 4

… ich 0 2

ti 1 2 4 …

… id strings tic 2 4

0 rich tick 1

1 stick …

2 stich

3 stuck Lower bound = 1



2010-7-31

4 static









142

Mining and Searching Complex Structures Chapter 3 Similarity Search on Sequences









PartEnum + VGRAM

PartEnum, fixed q-grams:

ed(s1,s2) b b->λ λ->b

a a









si(ei1,ei2,…,eik) : T1->T2; cost(si)= ∑j γ(eij)

EDist(T1,T2)=mini(cost(si)) unit cost: EDist(T1,T2)=min(k)



Computational Complexity:

O (| T1 | × | T2 | × min(depth(T1 ), leaves(T1 )) × min(depth(T2 ), leaves(T2 )))



7/31/2010 21









Edit Operation Mapping



Edit operations mapping

One-to-one

Preserve sibling order

Preserve ancestor order

a a





d b e b c d





c d c d b

M(T1,T2) c d



e

T1 T2





7/31/2010 22









166

Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees









Observation



Edit operations do not change many sibling

relationship

a a

c->λ



c d

b e

b f g h i d e



f g h i Sibling relation:

(b,c)->(b,f)

(c,d)->(i,d)





Node: Varying number of children v.s. at most 2 siblings



7/31/2010 23









Binary Tree Representation

a

Binary Tree Representation

b e

Left-child, right sibling b



Normalized Binary Tree c d c d





a(1,8)

a b b b b c d d d e

b … c … c … c … e … ε … ε …ε … ε … ε

b(2,3)

ε b c e ε d b e ε ε

ε

b(5,6) T1

c(3,1)

c(6,4) e(8,7) 1 …1 …0 … 1 … 0 … 2 …0 …0 … 2 … 1

ε d(4,2)

T2

ε ε ε d(7,5) ε ε

1 …0 …1 … 0 … 1 … 2 …0 …1 … 0 … 1

1

ε ε

|Γ |

BBDist (T1 , T2 ) = ∑ | b1i − b2i | = 8 Triangular Inequality

i =1



7/31/2010 24









167

Mining and Searching Complex Chapter 4 Structures Similarity Search on Trees









One Edit Operation Effect

v’ v’







... ... ... ... ... ...

w1 w2 wl w l+m w l+m+1 w1 w2 wl v w l+m+1

... ... ... ...

... Each node appears in

w l+1 w l+m at most two binary

v’ v’ branches



... ...

w1 w1



w2 ...

w2

... ... ...

wl wl v ...

w l+1 ...

...

w l+m+1

w l+m w l+1

... ...

w l+m+1 w l+2

...

...

w l+m

...



ε



7/31/2010 25









Theorem



1 insertion/deletion incurs at most 5 difference on BBDist

1 rellabeling incurs at most 4 difference on BBDist

T, T’, EDIST(T, T’) = k = ki + kd + kr ,

BDist(T,T’) d, Gi can be safely filtered;

•if τ(Q, Gi) ≤ d, Gi can be reported as a result directly;

•if ρ(Q, Gi) ≤ d, Gi can be reported as a result directly;

•otherwise, λ(Q, Gi) must be computed.









Subgraph exact Search



• Lemma

•Given two graphs G1 and G2 , if no vertex relabelling is

allowed in the edit operations, μ’(G1, G2) ≤ 4 · λ’(G1, G2),

where μ’ and λ’ are computed without vertex relabelling.

•(This Lemma can be used in subgraph search, because if a

graph is subisomorphism to another graph, no vertex

relabelling happens.)

• AppSUB algorithm:

•Filtering based on the lower bound .









207

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









Experimental Results





• Compare with the exact algorithm









1,000 graphs were generated, D = 1k,T = 10,V = 4.

Randomly select 10 seed graphs to form D; a seed has 10 vertices.

6 query groups. Each group has 10 graphs. Graphs in the same

group have the same number of vertices.









Experimental Results



• Compare with the BLP method









208

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









Experimental Results



• Scalability over real datasets









Experimental Results



• Scalability over synthetic datasets









209

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









Experimental Results



• Performance of AppFULL









Experimental Results



• Performance of AppSUB









210

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









Outline



• Introduction

• Foundation

• State of the Art on Graph Matching

•Exact Graph Matching

•Error-Tolerant Graph Matching

• Search Graph Databases

•Graph Indexing Methods

• Our Works

•Star Decomposition

•Sorted Index For Graph Similarity Search









SEGOS: SEarch similar

Graphs On Star index



Xiaoli Wang

Xiaofeng Ding

Anthony K.H. Tung

Shanshan Ying

Hai Jin









211

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









Our Solutions



• Work 1: Scalability issue

•A full database scan

•A index mechanism is needed

• Existing indexing methods: Filtering power

•Rough bounds with poor filtering power



• Work 2: Sorted index for graph similarity search

•Propose a novel indexing and query processing framework

•Deploy a filtering strategy based on TA and CA methods

•All exiting lower and upper GED bounds can be directly

integrated into our filtering framework









TA Method on the Top-k Query



• The database model used in TA





M

Object Sorted L1 Sorted L2

A1 A2

ID

0.9 0.85 (a, 0.9) (d, 0.9)

a



0.8 0.7 (b, 0.8) (a, 0.85)

b



0.72 0.2 (c, 0.72) (b, 0.7)

c

. .

d 0.6 0.9 . .

. . . . .

. . . . .

. . . (d, 0.6) (c, 0.2)

N . . .









212

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









TA method on the top-k query



• A simple query

•Find the top-2 objects on the ‘query’ of ‘A1&A2 ’

•This query results in the TA method combing the scores of

A1 and A2 by an aggregation function like



sum(A1,A2)







Aggregation function:

function that gives objects an overall score based on attribute

scores

examples: sum, min functions

Monotonicity!









Monotony on TA (Halting Condition)



• Main idea

•How do we know that scores of seen objects are higher than

the grades of unseen objects?

•Predict maximum possible score unseen objects:





L1 L2



a: 0.9 d: 0.9

Seen

b: 0.8 a: 0.85

c: 0.72 b: 0.7 ω = sum(0.72, 0.7) =

. . 1.42

. f: 0.6

.

.

f: 0.65 .

Possibly unseen . . Threshold value

d: 0.6 c: 0.2









213

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









A Top-2 Query Example

• Given 2 sorted lists for attributes A1 and A2,









L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)



(c, 0.72) (b, 0.7)



. .

. .

. .

. .



(d, 0.6) (c, 0.2)









A Top-2 Query Example

• Step 1

•Parallel sorted access attributes from every sorted list









L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9

(c, 0.72) (b, 0.7)

d 0.9 1.5

. .

. .

. .

. .



(d, 0.6) (c, 0.2)









214

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









A Top-2 Query Example

• Step 1

•Sorted access attributes from every sorted list

•For each object seen:

• get all scores by random access

• determine sum(A1,A2)

• amongst 2 highest seen? keep in buffer

L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9

(c, 0.72) (b, 0.7)

d 0.9

. .

. .

. .

. .



(d, 0.6) (c, 0.2)









A Top-2 Query Example

• Step 1

•Sorted access attributes from every sorted list

•For each object seen:

• get all scores by random access

• determine sum(A1,A2)

• amongst 2 highest seen? keep in buffer

L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9 0.85

(c, 0.72) (b, 0.7)

d 0.9

. .

. .

. .

. .



(d, 0.6) (c, 0.2)









215

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









A Top-2 Query Example

• Step 1

•Sorted access attributes from every sorted list

•For each object seen:

• get all scores by random access

• determine sum(A1,A2)

• amongst 2 highest seen? keep in buffer

L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9 0.85 1.75

(c, 0.72) (b, 0.7)

d 0.9

. .

. .

. .

. .



(d, 0.6) (c, 0.2)









A Top-2 Query Example

• Step 1

•Sorted access attributes from every sorted list

•For each object seen:

• get all scores by random access

• determine sum(A1,A2)

• amongst 2 highest seen? keep in buffer

L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9 0.85 1.75

(c, 0.72) (b, 0.7)

d 0.6 0.9 1.5

. .

. .

. .

. .



(d, 0.6) (c, 0.2)









216

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









A Top-2 Query Example

• Step 2

•Determine threshold value based on objects currently seen

under sorted access. ω = sum(L1, L2)







L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9 0.85 1.75

(c, 0.72) (b, 0.7)

d 0.6 0.9 1.5

. .

. .

. .

. .



(d, 0.6) (c, 0.2)









A Top-2 Query Example

• Step 2

•Determine threshold value based on objects currently seen

under sorted access. ω = sum(L1, L2)







L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9 0.85 1.75

(c, 0.72) (b, 0.7)

d 0.6 0.9 1.5

. .

. .

. .

. .



(d, 0.6) (c, 0.2) ω = sum(0.9, 0.9) = 1.8









217

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









A Top-2 Query Example

• Step 2

•Determine threshold value based on objects currently seen

under sorted access. ω = sum(L1, L2)

•2 objects with overall score ≥ threshold value ω? Stop

•else go to next entry position in sorted list and go to step 1



L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9 0.85 1.75

(c, 0.72) (b, 0.7)

d 0.6 0.9 1.5

. .

. .

. .

. .



(d, 0.6) (c, 0.2) ω = sum(0.9, 0.9) = 1.8









A Top-2 Query Example

• Step 2

•Determine threshold value based on objects currently seen

under sorted access. ω = sum(L1, L2)

•2 objects with overall score ≥ threshold value ω? Stop

•else go to next entry position in sorted list and go to step 1



L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9 0.85 1.75

(c, 0.72) (b, 0.7)

d 0.6 0.9 1.5

. .

. .

. .

. .



(d, 0.6) (c, 0.2)









218

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









A Top-2 Query Example

• Step 2

•Determine threshold value based on objects currently seen

under sorted access. ω = sum(L1, L2)

•2 objects with overall score ≥ threshold value ω? Stop

•else go to next entry position in sorted list and go to step 1



L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9 0.85 1.75

(c, 0.72) (b, 0.7)

d 0.6 0.9 1.5

. .

. .

. .

. .



(d, 0.6) (c, 0.2)









A Top-2 Query Example

• Step 1 (Again)

•Sorted access attributes from every sorted list









L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9 0.85 1.75

(c, 0.72) (b, 0.7)

d 0.6 0.9 1.5

. .

. .

. .

. .



(d, 0.6) (c, 0.2)









219

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









A Top-2 Query Example

• Step 1 (Again)

•Sorted access attributes from every sorted list

•For each object seen:

• get all scores by random access

• determine sum(A1,A2)

• amongst 2 highest seen? keep in buffer

L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9 0.85 1.75

(c, 0.72) (b, 0.7)

d 0.6 0.9 1.5

. .

. . b 0.8 0.7 1.5

. .

. .



(d, 0.6) (c, 0.2)









A Top-2 Query Example

• Step 1 (Again)

•Sorted access attributes from every sorted list

•For each object seen:

• get all scores by random access

• determine sum(A1,A2)

• amongst 2 highest seen? keep in buffer

L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9 0.85 1.75

(c, 0.72) (b, 0.7)

d 0.6 0.9 1.5

. .

. .

. .

. .



(d, 0.6) (c, 0.2)









220

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









A Top-2 Query Example

• Step 2 (Again)

•Determine threshold value based on objects currently seen

under sorted access. ω = sum(L1, L2)

•2 objects with overall score ≥ threshold value ω? Stop

•else go to next entry position in sorted list and go to step 1



L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9 0.85 1.75

(c, 0.72) (b, 0.7)

d 0.6 0.9 1.5

. .

. .

. .

. .



(d, 0.6) (c, 0.2) ω = sum(0.8, 0.85) = 1.65









A Top-2 Query Example

• Step 2 (Again)

•Determine threshold value based on objects currently seen

under sorted access. ω = sum(L1, L2)

•2 objects with overall score ≥ threshold value ω? Stop

•else go to next entry position in sorted list and go to step 1



L1 L2



(a, 0.9) (d, 0.9)

ID A1 A2 sum(A1,A2)

(b, 0.8) (a, 0.85)

a 0.9 0.85 1.75

(c, 0.72) (b, 0.7)

d 0.6 0.9 1.5

. .

. .

. .

. .



(d, 0.6) (c, 0.2)









221

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









A Top-2 Query Example

Situation at stopping:

ω = sum(0.72, 0.7) = 1.42 d (= 4)

. .

. :

g6. 5



Possibly unseen g6. 5

: .

. . Threshold value

g4: 6 g3: 9









TA-based Filtering Strategy for Graph

Search Problem

• A graph database with a query example







Sorted list L1 Sorted list L2 Sorted list L3









223

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









Requirement



• An index structure

•Convenient for score-sorted lists construction

• Efficient star search algorithm

•Quickly return similar stars to a query star

• Sorted properties for the halting condition of TA

•The mapping distance of any unseen graph gi satisfies

λ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’)

q is the query graph, τ is the distance threshold, and



where D’ is the set of all unseen graphs.









•Requirement distance in our previous work

Recall the mapping

satisfy:

•μAn index structure

(q, gi) ≤ max{4, [min{δ (q), δ(gi)]} + 1]} · λ(q, gi)

•Convenient for score-sorted lists construction

Efficient δ search max{4, [min{δ (q), δ(gi)]} + 1]},

•We denotestar(q, gi) =algorithm

•Quickly gi) ≤ δ’.

then δ (q,return similar stars to a query star

If μ(q, g ) > τ*δ’, then λ(q, gi) > τ*δ’/δ > τ,

• Sorted iproperties for the halting condition of TA

and this graph can be safely filtered out.

•The mapping distance of any unseen graph gi satisfies

• μ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’)

•q is the query graph, τ is the distance threshold, and

•δ’ = max{4, [min{δ(q), δ(D’)]} + 1]}

•where D’ is the set of all unseen graphs.









224

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









Requirement

• An index structure

•Convenient for score-sorted lists construction

• Efficient star search algorithm

•Quickly return similar stars to a query star

• Sorted properties for the halting condition of TA

•The mapping distance of any unseen graph gi satisfies

• λ(q, gi) ≥ ω = sum(λ1,…, λm) > d (=τ*δ’)

•q is the query graph, τ is the distance threshold, and



•where D’ is the set of all unseen graphs.









Build Inverted Index Structures based

on the Star Decomposition

• The upper-level index

•Build an inverted index between stars and graphs

•Used to quickly returned graph lists

• The lower-level index

•Build an inverted index between labels and stars

•Used to construct the sorted lists

•for top-k star search based on TA

•filtering strategy









225

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









Build Inverted Index Structures based

on the Star Decomposition









Top-k Star Search Algorithm



• Construct sorted lists









226

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









Graph Score-sorted Lists



• Construct lists based on the top-k results









TA-based Graph Range Query



• Definition

•Given a graph database D and a query q, find all gi ∈ D that

are similar to q with λ(q, gi) ≤ τ. τ is the distance

threshold.

• Steps: given m sorted lists for a query graph q

•Perform sorted retrieval in a round-robin schedule to each

sorted list. For a retrieved graph gi, if Lm(q, gi) > τ, filter out

the graph; if Um(q, gi) ≤ τ, report the graph to the answer

set.

•For each sorted list SLj, let χj be the corresponding distance

last seen under sorted access. If ω = sum(χ1,…, χm) >

τ∗δ’, then halt. Otherwise, go to step 1.









227

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









CA-based Filtering Strategy



• The difference between TA and CA

•TA computes the mapping distance between two graphs

when retrieving a new graph through sorted accesses



•Only in each h depth of the sorted scan, for seen and

unprocessed graphs, CA uses estimated mapping distance

bounds to first filter graphs; Then, it uses Incremental

Hungarian algorithm to compute the partial mapping

distances for filtering









CA-based Filtering Strategy



• Suppose l(g) = {l1,…,ly} ⊆ {1,2,…,m} is a set of known

lists of g seen below q. Let χ(g) be the multiset of

distances of the distinct stars of g last seen in known lists.

•Lower bound denoted by Lμ(q, g) is obtained by substituting

the missing lists j ∈ {1,2,…,m}\l(g) with χj (the distance

last seen under the jth list) in ζ(q, g)

•Upper bound denoted by Uμ(q, g) is computed as

Uμ(q, g) = t′(χ(g)) + χ ∗ (|g| − |χ(g)|)

• Theorem: Let g1 and g2 be two graphs, the bounds

obtained as above satisfies

ζ(g1, g2) ≤ Lμ(g1, g2) ≤ μ(g1, g2) ≤ Uμ(g1, g2)









228

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









CA-based Filtering Strategy



• Dynamic hungarian for partial mapping distance

•Given m sorted lists for q, suppose S′(g) ⊆ S(g) is a

multiset of stars in g seen below lists. Then we have μ(S(q),

S′(g)) ≤ μ(q, g)









CA-based Graph Range Query



• Steps: given m sorted lists for a query graph q

•Perform sorted retrieval in a round-robin schedule to each

sorted list. At each depth h of lists:

• Maintain the lowest values χ1, . . . , χm encountered in the

lists. Maintain a distance accumulator ζ(q, gi) and a multiset

of retrieved stars S′(gi) ⊆ S(gi) for each gi seen under lists.

• For each gi that is retrieved but unprocessed, if ζ(q, gi) > τ∗δgi,

filter out it; if Lμ(q, gi) > τ∗δgi, filter out it; if Uμ(q, gi) ≤ τ∗δgi ,

add the graph to the candidate set. Otherwise, if μ(S(q), S′(gi )

> τ∗δgi, filter out the graph. Finally, run the Dynamic

Hungarian to obtain Lm(q, gi) and Um(q, gi) for filtering.

•When a new distance is updated, compute a new ω. If ω =

t′(χ) > τ∗δ′, then halt. Otherwise, go to step 1.









229

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









Experimental Results: Sensitivity test









Experimental Results: Index construction









230

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









Experimental Results: compare with other

works varying distance thresholds









Experimental Results: compare with other

works varying dataset sizes









231

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









References

• D. Conte, Pasquale Foggia, Carlo Sansone, and Mario Vento.

Thirty Years of Graph Matching in Pattern Recognition.

• P. Foggia, C. Sansone and M. Vento. A performance

comparison of five algorithms for graph isomorphism. In 3rd

IAPR-TC15 workshop on graph-based representations in

pattern recognition, 2001.

• K. Riesen, M. Neuhaus, and H. Bunke. Bipartite graph

matching for computing the edit distance of graphs. In GBRPR,

2007.

• P. Hart, N. Nilsson, and B. Raphael. A formal basis for the

heuristic determination of minimum cost paths. IEEE Trans.

SSC, 1966.









References

• D. Justice. A binary linear programming formulation of the

graph edit distance. IEEE TPAMI, 2006.

• R. Giugno and D. Shasha. Graphgrep: A fast and universal

method for querying graphs. In ICPR, 2002.

• R. D. Natale, A. Ferro, R. Giugno, M. Mongiovì, A. Pulvirenti,

and D. Shasha. SING: subgraph search in non-homogeneous

graphs. BMC Bioinformatics, 2010.

• X. Yan, P.S. Yu, and J. Han. Graph indexing: a frequent

structure-based approach. In SIGMOD, 2005.

• J. Cheng, Y. Ke, W. Ng, and A. Lu. Fg-index: towards

verification-free query processing on graph databases. In

SIGMOD, 2007.









232

Mining and Searching Complex Structures Chapter 5 Graph Similarity Search









References

• D.W. Williams, J. Huan, and W. Wang. Graph database

indexing using structured graph decomposition. In ICDE, 2007.

• S. Zhang, M. Hu, and J. Yang. Treepi: a novel graph indexing

method. In ICDE, 2007.

• P. Zhao, J. X. Yu, and P. S. Yu. Graph indexing: tree + delta

>= graph. In VLDB, 2007.

• G. Wang, B. Wang, X. Yang, and G. Yu. Efficiently indexing

large sparse graphs for similarity search. IEEE TKDE, 2010.









233

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Searching and Mining Complex

Structures

Massive Graph Mining

Anthony K. H. Tung(鄧锦浩)

School of Computing

National University of Singapore

www.comp.nus.edu.sg/~atung









Research Group Link: http://nusdm.comp.nus.edu.sg/index.html

Social Network Link: http://www.renren.com/profile.do?id=313870900









Graph applications: everywhere



And often, they are huge and messy.







social network



Bio Pathway









Co-authorship

network









234

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Knowledge: NOWHERE





Unless we manage to find where they hide.

Too many clues is like no clue.









Roadmap





Part I (1.5 hrs)

•Graph Mining Primer

•Recent advances in Massive Graph Mining

Part 2(1.5 hrs)

•CSV: cohesive subgraph Mining

•Dngraph mining: a triangle based approach









235

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Roadmap

• Graph Mining Primer

• Data mining vs. Graph mining

• Massive graph mining domain

• Types of graph patterns

• Properties of large graph structure

• Recent advances in Massive Graph Mining

• CSV: cohesive sub graph Mining

• DNgraph mining: a triangle based approach









From Data Mining to Graph Mining



Data Mining

raph Mining

• Classification

• Captures more complicated

• Clustering entity relationships.

• Association rule • Output: patterns, which are

learning smaller subgraphs with

interpretable meanings.









236

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Massive graph mining domains

• Financial data analyzing

• Bioinformatics network

• User profiling for customized search

• Identify financial crime









Financial data analysis

In stock market,

correlations among

stocks helps in profit

making.

Mining stock

correlation graphs Stocks Correlation Tabular Form

predicting stocks'

price change for

estimating future

return, allocating

portfolio and

controlling risks etc.

Stocks Correlation Patterns









237

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Financial data analysis

In stock market,

correlations among

stocks helps in profit

making.

Mining stock

correlation graphs Stocks Correlation Tabular Form

predicting stocks'

price change for

estimating future Highly

return, allocating correlated

portfolio and stock sets

controlling risks etc.

Stocks Correlation Patterns









Bioinformatics network



•Protein-protein interaction

• The fundamental

activities for very

numerous living cells.

• A dense graph pattern

indicates these proteins

have similar functionalities.









one representation of an assembled

NEDD9 network









238

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









User profiling for customized search

The Internet Movie Database (IMDB)

Registered users can comment on movies of their interest.

Mining on comments sharing network provides insight of

user’s interest thus further facilitate customized search.







Movie centric

view of IMDB

review network









Identify financial crime



Large classes of financial crimes such as money laundering,

follow certain transactional patterns.









Geospatial information of suspects A money laundering pattern









239

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Dense Graph Patterns

Clique/Quasi-Clique

A clique represents the highest level of internal interactions.

Quasi-clique is an ``almost'' clique with few missing edges.

High Degree Patterns

Concern the average vertex degree, which is the number of

edges intercepting the vertex.









Dense graph patterns (cont.)

Dense Bipartite Patterns Heavy Patterns









Weighted, directed graph of

Bipartite graph of pathways and online citation network, by

genes for the AML/ALL dataset. Rosvall & Bergstrom









240

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Properties of large graph structure

Static

•Power law degree distributions.

•Small world phenomenon.

•Communities and clusters.

Dynamic

•Shrinking diameters of enlarging graphs

•Densification along time









Power law









241

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Large graph: properties and laws (cont.)

Dynamic

•Shrinking diameters of enlarging graphs.

•Densification along time









Roadmap

• Graph Mining Primer

• Large graph: properties and laws

• Approaches in Graph mining

• Pattern based Mining algorithms

• Practical techniques in Massive Graph Mining

• Graph summarization with randomized sampling

•Connectivity based traversal

•MapReduce based

• CSV: cohesive subgraph Mining

• Dngraph mining: a triangle based approach









242

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Pattern based Mining algorithms

Greedy methods

SUBDUE (PWKDD04), GBI(JAI94)

Apriori-based approaches (detail in next few slides)

AGM , FSG, gSpan

Inductive logic programming (ILP) oriented solutions

WARMR, FARMAR

Kernel based solutions

Kernels for graph classification









Apriori Paradigm Recall





Search in breadth-first

manner

Use a Lattices structure

to count candidate

subgraph sets

efficiently.





A search lattice for item set mining









243

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Apriori-based Graph Mining

Performance bottleneck: candidate subgraph generation.

Solution:

1. Build a lexicographic order among graphs.

2. Search using depth-first strategy.

Very effective in mining large collections of small to medium

size graphs.









Graph summarization with randomized

sampling

• Efficient Aggregation for Graph Summarization –

SIGMOD 2008

• Graph Summarization with Bounded Error-SIGMOD

2008

• Mining graph patterns efficiently via randomized

summaries - VLDB 2009









244

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Efficient Aggregation for Graph

Summarization

As graph size increases, graphs summarization becomes

crucial when visualize the whole graph.

Criteria for an efficient summarization solution

Able to produce meaningful summarization for real

application.

Scalable to large graphs.

The choice: graph aggregation









Graph Aggregation

1. Summarization based on user-selected node attributes and

relationships.

2. Produce summaries with controllable resolutions.

“drill-down” and “roll-up” abilities to navigate

Propose two aggregation operations

SNAP – address 1

k-SNAP - address 2









245

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Operation SNAP

Group nodes by user-selected node attributes & relationships

Nodes in each group are homogenous (in terms of attributes

and relationships).

Goal: minimum # of groups









How does SNAP work?



Top down approach

Initial Step: Use user selected attributes to group nodes.

Iterative Step:

If a group are not homogeneous w.r.t. relationships, split the

group based on its relationships with other groups.









246

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









SNAP limitation

Homogeneity requirement for relationships

Noise and uncertainty









Users have no control over the resolutions of summaries

SNAP operation can result in a large number of small groups









Operation k-SNAP

The entities inside a group are not necessarily

homogenous in terms of relationships with other

groups.

Users can control resolution by specifying k (#

groups).

Varying of k provides “drill-down” and “roll-up”

abilities.









247

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Access quality of summarization



Determined by sum of noisy relations.

When the relationship between two relationships are strong

(>50%), count missing participants.

When the relationship between two relationships are weak

(Mining->Verification

Raw DB









Summarized DB









Raw DB









Reduce false positive

• Technique 1: merge vertices that are far away from each

other.

•The length of the shortest path

•The probability of random walk

• Technique 2: merge vertices whose neighborhood overlap.

•Cosine, Chi^2, Lift, Coherence

• Technique 3: Go back to raw database to do verification

It is guaranteed that there is no false positives.

Summarization may cause false positive

a b a

False Embeddings

False Positives

a b b b









255

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Summarization: Reduce false negative

a b a

Miss Embeddings False Negatives



a c b c



Technique 1: For raw database with frequency threshold min_sup,

we adopt a lower frequency threshold pseudo min_sup for

summarized database.

Technique 2: Iterate the mining steps for T times and combine the

results generated in each time.

It is NOT guaranteed that there is no false positives, but the

possibility is bounded by









Connectivity based traversal

CSV: Cohesive Subgraph Mining –SIGMOD 2008

(Discussed in detail in Part II)

Progressive Clustering of Networks using

Structure-Connected Order of Traversal –ICDE 2010









256

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Progressive clustering of networks using structure-

connected order of traversal

SCAN Algorithm

•Similar to DBSCAN: connectivity-based

•Average O(n) time

•Uses structural similarity measure, minimum cluster size mu, and

minimum similarity epsilon

•Finds outliers and hubs

Problems

•No automated way to find good epsilon

•Must rerun algorithm for each possible epsilon

•Epsilon is global threshold

• No hierarchical clusters

• No variation in cluster subtlety









Solution



• Structure-Connected Order of Traversal (SCOT)

•Contains all possible epsilon-clusterings

• Efficient method to find global epsilon

• New Contiguous Subinterval Heap structure

(ContigHeap)

• New Progressive Mean Heap Clustering (ProClust)

•Epsilon-free

•Hierarchical

• Refinement by Gap Constraint (GapMerge)









257

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Original Network:







SCOT plot:









Optimal Global Epsilon



SCAN paper only contains supervised

sampling method.

Sample points, find k-NN similarities, sort,

plot, find knee visually

O(nd log n) time

In addition to clustering time



Our solution:

Knee hypothesis implies approx concave

plot

Optimal epsilon minimizes obtuse angle

between segments

Modified histogram and binary search: O(n)

time

Uses already done SCOT result









258

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









ContigHeap



BuildContigHeap produces heap containing

all contiguous subintervals from SCOT

output in O(n) time, and integrates with

SCOT

Example:









GapMerge: Gap Constraint Refinement

Merges chained clusters, heap branches with single children

Does not merge across pruned heap nodes (local maxima boundary)

Gap constraint prevents clusters whose left or right boundaries differ by more

than mu from being merged

Such clusters are not redundant relative to the minimum interesting cluster size

Steps

1.Identify chains that meet gap constraint

2.When a node has more than one child or violates gap constraint, begin new chain.

3.Within each chain, calculate significance of each cluster in both up and down

directions

4.Begin with most redundant node, merge nodes in direction of least significance

5.After each merge, recalculate significances

6.Continue until chain contains one node, or no merging possible under gap constraint.









259

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









MapReduce based approach

PEGASUS: A Peta-Scale Graph Mining System –ICDM

2009

Pregel: a system for large-scale graph processing SIGMOD

2010









PEGASUS: A Peta-Scale

Graph Mining System

Dealing with real graph such as Yahoo! Web graph up to 6.7

billion edges.

A Hadoop based graph mining package.

Target at primitive matrix operations such as matrix

multiplication (GMI-v).









260

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Motivation

Many Graph mining tasks require matrix

multiplication

PageRank,

Random Walk with Restart(RWR),

Diameter estimation, and

Connected components …

MapReduce provides a simplified programming

concept for large data processing

Details of the data distribution, replication, load balancing are

taken care of.

Provides a similar programming structure. i.e. functional

programming









GIM-V: Generalized Iterative Matrix-

Vector multiplication

Intuition: Matrix Multiplication

M × v = v'

combine2

v i' = ∑ j =1 m

n

i, j vj

combineall

Assign

Operator× G are matrix multiplication expressed by above 3

steps

× G is iteratively carried out until converge.









261

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









× G and SQL

The matrix multiplication operation can be expressed by

an SQL query.

If view graphs are two table: ×G

edge table E(sid, did, val) and

a vector table V(id, val)

becomes



×G





SELECT E.sid, combineAllE.sid(combine2(E.val, V.val))

FROM E, V

WHERE E.did = V.id

GROUP BY E.sid









Generalize × G





Vary definition of three steps to generalize × G

PageRank row normalization adj.

matrix





p = (cE T + (1 - c)U)p

All element = 1/n



Damping factor = 0.85









262

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Generalize × G

Vary definition of three steps to generalize

PageRank

×G





p = (cE T + (1 - c)U)p

combine2 = c × mi, jvj

1- c

+ ∑ j=1 xj

n

combineAll =

n









Generalize (cont.)

By altering three functions, GIM-V adapts to

• Random Walk with Restart

• Diameter Estimation

• Connected Components









263

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









GIM-V: How to

Stage 1

Combine2

V: Key = id, v: vval, E: Key = idsrc

State 2

Combineall & assign









Bottleneck: shuffling and disc I/O









GIM_V Block Multiplication (BL)









Advantage

Save on sorting

Data compressing

Clustered Edge









264

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Block advantage (cont.)

Clustered edge:









GIM-V DI Dialogonal Block Iteration



Intuition

Increase multiplication

inside an iteration to

reduce # of iterations.

How

Reach local convergence

within a block first before

iterate Compare GIM-V BL and DI









265

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Main Results

Scalability

GIM-V BL DI is ~5 times faster than GIM-V Base









Main Results (cont.)



Evolution of LinkedIn

Distribution of connected

components are stable after a

‘gelling’ point in 2003.









266

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Main Results (cont.)

Bimodal structure of Radius









Pregel: A System for Large-Scale

Graph Processing

A scalable and fault-tolerant platform with an API that is

sufficiently flexible to express arbitrary graph algorithms.

Model of computing:

Vertex centric, synchronized iterative model









267

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Graph Algorithms Implementation in

Pregel

Graph data are in respect machines, pass messages only, NO

graph state passing.









Pregel C++ API

• Compute() - executed at each active vertex in every

superstep.

•Query information about the current vertex and its edges.

•Send messages to other vertices.

•Inspect or modify the value associated with its vertex/out-

edges.

•state updates are visible immediately. no data races on

concurrent value access from diefferent vertices

• Limiting the graph state managed by the framework to

single value per vertex or edge simplifies the main

computation cycle, graph distribution, and failure

recovery.









268

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Pregel C++ API (cont.)

• Message Passing

•No guaranteed order, but it will be delivered and no

duplication.

• Combiners

•Combine several messages to reduce overhead

• Aggregators

•Mechanism for global communication, monitoring, and data.

•A number of predefined aggregators, such as min, max, or

sum operations

• Topology mutation

•Change graph toplogy, resolve conflicts when individual

vertices sent conflict messages.









Pregel C++ API (cont.)

• Input and output

• Readers and writers









269

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Pregel implementation

• Design for Google cluster architecture

•Each consists of thousands of commercial PCs

• Persistent data

•Stored in files on distributed file systems such as GFS or

BigTable

• Temporary data

•Stored as buffered message on local disk.









Pegel: Assign load

• Divide graph vertices into partitions and assign to

different machines

•controllable by users, default method: hash

• In absence of fault:

•One master, many other workers on a cluster of machines.

• master assign load jobs, i/o and instruct on super steps

• Fault tolerent:

•Use checkpoint: master ping workers

•Confined recovery (undergoing): master log outgoing

message









270

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Graph Application

PageRank

Shortest Path

Bipartite Matching

Semi Cluster









Pregel: Main Result









271

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Reference (partial)

Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations by J.

Leskovec, J. Kleinberg, C. Faloutsos. (KDD), 2005.

Substructure Discovery in the SUBDUE System. L. B. Holder, D. J. Cook and S. Djoko. In

(PWKDD), 1994.

Efficient Aggregation for Graph Summarization – Yuanyuan Tian, Richard A. Hankins, Jignesh M.

Patel SIGMOD 2008

Graph Summarization with Bounded Error-Saket Navlakha, Rajeev Rastogi, Nisheeth Shrivastava

SIGMOD 2008

Mining graph patterns efficiently via randomized summaries Chen Chen, Cindy X. Lin, Matt

Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han - VLDB 2009

Progressive Clustering of Networks using

Structure-Connected Order of Traversal Dustin Bortner, Jiawei Han –ICDE 2010

PEGASUS: A Peta-Scale Graph Mining System U. Kang, Charalampos E. Tsourakakis,

ChristosFaloutsos, ICDM

Graph based induction as a unified learning framework, K. Yoshida, H. Motoda, and N. Indurkhya.

Applied Intelligence volume 4, 1994.

Complete mining of frequent patterns from graphs: Mining graph data. Akihiro,W. Takashi, and

M. Hiroshi. Mach. Learn., 50(3):321–354, 2003.









Reference (cont.)





Frequent subgraph discovery, K. Michihiro and G. Karypis. In ICDM, pages 313–320, 2001.

gSpan: Graph-based substructure pattern mining, X. Yan and J. Han. ICDM 2002.

WARMR Discovery of frequent datalog patterns. L. Dehaspe and H. Toivonen. Data Mining and

Knowledge Discovery, 3(7-36), 1999.

FARMAR Fast association rules for multiple relations. S. Nijssen and J. Kok. Data Mining and

Knowledge Discovery, 3(7–36), 1999.









272

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Roadmap

Part I (1.5 hrs)

Graph Mining Primer

Recent advances in Massive Graph Mining

Part 2(1.5 hrs)

CSV: cohesive subgraph Mining

Dngraph mining: a triangle based approach









CSV



1. Cohesive sub-graph mining, with visualization

2. Existing approaches

3. CSV provides effective visual solution

– Algorithm principle

– Connectivity Estimation

4. Experimental Study









273

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Existing solutions



1. Current state-of-the-art to abstract information from huge

graphs. information Yes,

1. Graph partition algorithms. structure No.

Spectral clustering[Ng01]: high computational cost

METIS[Karypis96]: favors balanced pattern

2. Graph Pattern Mining algorithms

CODENSE[Hu05], CLAN[Zeng06]: exponentially running time

2. Graph Layout Tools:

Osprey [Breitkreutz03] Visant [Mellor04]: Do not have mining

capability information No,

We want structured information

structure Yes.









CSV: General Approach



• Separate vertices in the graph into VISITED, UNVISITED

• Start: Pick a vertex and add into VISITED

• Repeat until UNVISITED=empty

–Among all vertices that are in UNVISITED, pick one vertex V

most highly connected to VISITED

–Plot V’s connectivity

–Add V into VISITED









But how do we measure connectivity?









274

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Connectivity measurement



Connectivity measurement is closely related to clique (fully connected sub-

graph) size.



The connectivity between two vertices in

a graph (ηmax) is defined to be the The “connectivity” of a vertex

biggest clique in the graph such that (ζmax) is similarly defined

both are members of the clique as the biggest clique it

can participate.

b

b

a c

a c



e d

e d

ηmax(a, d) = 0

ηmax(a, c) = 4 ζmax(a) = 5









CSV: Step by Step



heap

From Graph to Plot connectivity



D 4

A B C

3

F H I 2

E G 1

J B

A vertices

unvisited

neighbors

Start from A, explore A’s neighbor B.

visiting

Calculate ζmax (A)=2 and output it

visited









275

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









CSV algorithm on a synthetic graph



heap

From graph to plot connectivity



D 4

A B C

3 C

F H I

E 2 F

G

J B

H

1



unvisited AB vertices



neighbors

Mark A visited, from B, explore B’s

visiting immediate neighbors CFH.

visited Calculate ηmax (AB)=2 and output it









CSV algorithm on a synthetic graph



heap

From graph to plot connectivit

y

D 4

A B C F

3

H

C

F H I 2

E F

G

G 1

J D

H

A BC vertices

unvisited

neighbors Mark B visited, choose the closely

visiting connected C as next visiting vertex. From

C, explore C’s immediate neighbors DFGH,

visited

update ηmax when necessary.

Calculate ηmax (BC)=4 and output it









276

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









CSV algorithm on a synthetic graph





From graph to plot connectivity

Cohesive sub-

graph

D 4

A B C

3

F H I

2

E G

1

J



ABCH FGDE I J vertices

unvisited

neighbors

visiting Visit every vertex accordingly to produce a

visited plot.



Peaks represent cohesive sub-graphs.









Important Theorem









277

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Connectivity computation is a hard

problem



However, if graphs are very huge and massive, exact computation of

connectivity is prohibitive.







Direct computation

is costly









Connectivity computation is

prohibitive



•Exact algorithm relays on D

A B C

clique detection (NP-hard).

•Even approximation is hard. F H I

•Solution Part 1: Spatial E G

Mapping J

•Pick k pivots

P1 I

•Map graph into k-

dimensional space based on 3 A E

their shortest distance to the F GJ

pivots 2 B C D

•A clique will map into the

same grid. 1 H

I

0 1 2 3 P0 A









278

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Connectivity computation





•Solution Part 2: Approximate

Upper Bound for ζmax(v) and

ηmax(v, v’)

•Each vertex in a clique of size k

must have

•degree=k-1

•k-1 neighbors with degree k-1



Let estimate ηmax(a, f)

•For each vertex v, find it immediate

neighbors in the same grid cell and Locate the immediate neighborhood of a

construct a sub-graph and f, {a, b, c, d, e, f, g}. After sorting the

degree array in descending order, we have

array

•Iteratively readjust estimation for 6(a), 6(f), 5(d), 4(b), 4(c), 4(e), 3(g).

clique size



=5? =6? =7?









Experimental study on real datasets

DBLP: co-authorship graphs.

DBLP: v 2819, e 54990









Two groups of

German researchers









Peaks in DBLP CSV plot represents different research groups









279

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









SMD: Stock Market Data

Bridging vertex









Partial clique

Partial clique

Peaks in SMD CSV plot

represents highly cohesive

stocks









DIP: Database of interacting proteins

8 SMD3

9 PFS2

89 LSM8

PRP4

10 RNA14

89 LSM2

PRP8

10 FIP1

89 DCP1

PRP6



89 LSM6

LUC7

Structure of a nucleotide-bound Clp1-Pcf11 10 REF2





89 LSM3

SMX2

polyadenylation factor 10 CFT1





89 LSM4

SNP1

Christian G. Noble, Barbara Beuth, and Ian 10 CFT2





89 PAT1

STO1

A. Taylor*. Nucleic Acids Res. 2007 January; 10 MPE1





89 LSM7

NAM8

35(1): 87–99. 10 GLC7



10 PAP1

89 LSM5

SNU71



8 PRP31

“CPF is also required in both the cleavage 10 PTA1





8 YHC1

and polyadenylation reactions. It contains a 10 YSH1





8 PRP40

core of eight subunits Cft1, Cft2, Ysh1, Pta1 10 YTH1



10 PTI1

8 MUD1

Mpe1, Pfs2, Fip1 and Yth1”

8 SNU56









280

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Experimental Study



CSV as a pre-selection step

How?

•Apply CSV to identify potential

cohesive sub-graphs first.

•Use exact algorithm CLAN to run on

these candidates.

Result

•Get the exact cohesive sub-graphs as

running CLAN alone.

•Saves 28-84% of the time compared CSV as a pre-selection methods

to running CLAN alone.









DNgraph mining: A triangle based approach



• Mining dense patterns out of an extremely large graph

•When the graph is extremely large, it is even difficult to mine

dense patterns.

• An iterative improvement mining approach is more desirable

•Users are able to obtain the most updated results on demand.

• Dense patterns have strong connection with triangles inside a

graphs.

• This has already observed and explained by the preferential

attachment property of large scaled graphs.









281

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









DNgraph mining: A triangle based approach



• What makes a pattern dense? Intuitively B C





•A collection of vertices with high relevance.

•They share large number of common. A

D



• With that we propose the definition of

Dngraph

•A DNgraph is the largest sub graph sharing

A’ E

F



the most neighbors.

•Require each connected vertex pair sharing at λ(G) = 3, λ(GA’)=0

least λ neighbors.









Compare Dngraph with other dense pattern

definition

• Two interesting patterns

• 4-clique and a Turan graph T(14, 4) [14 vertices, 4 groups, fully

connected between groups]

• If mining quasi-clique, may ends up discovering 1 pattern, as in

(d)

• If searching for closed clique, may only find (e)









282

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









DNgraph mining: challenge



• Find common neigbhors for every connected vertices is

expensive

•Require O(E) join operations.

•Need random disc access.

•In fact, finding an DN-graph is an NP-problem.

• Solution

•Using triangles that two vertices participates to approximate

common neighbor size.

•Iterative refine the approximation following graph edge’s locality.









DNgraph mining: How



1. Initially: count # triangles each edge participates.

•Sort vertices and its neighbors in descending order of their degrees

•Scan the graphs to get # triangles for every vertex.

•The # triangle set the initial value of λ .

2. Next, Iteratively refine λ for every vertex

•Using streams of triangles.

•Iterative refine λcur.









283

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Triangle Counting: how?

1. Sort vertices and its neighbors in descending order of

their degrees



a bde e dbacgf

b acde Sort d eacgh

a e

f c bde b edac

d acegh a edb

b g e abcdgf c edb

c

f eg g edf

d h

g def f eg

h d h d









Triangle counting (cont.)

1. Sort vertices and its neighbors in a f

e

descending order of their degrees

2. Join neighborhood for triangle count for b g

every edge

c d h

• The two vertices inhibits locality, due to

reordering and preferential attachment 3 e dbacgf

property of large graphs d eacgh

3

b edac

a edb

c edb

g edf

f eg

h d









284

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Triangle counting (cont.)

a f

e

1. Sort vertices and its neighbors in

descending order of their degrees b g

2. Join neighborhood for triangle count for

c d h

every edge

vertex λcur

3. Use that as the initial λ value for every

edge/vertex e 3

• Vertex λ value is the maximal edge λ value d 3

it participates … …

•λcur(e) = 3









DNgraph mining: How (cont.)



• Initially: count # triangles each edge participates.

• Next, Iteratively refine λ for every vertex

•Using streams of triangles.

•Iterative refine λcur.









285

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Triangle stream



•Follow the same order of visiting graph during triangle

counting

•Triangles are not materialized, saving storage



n1 nx

n2 n2 n2

n1 n1

a b n1 a b

nx nx nx









a b a b a lambda=k b

lambda=k lambda=k









Iteratively refine λ



•Follow the same order of visiting graph during triangle

counting

•Triangles are not materialized, saving storage

•For every vertex v, when its triangles come, bound λcur(v)

using two other vertices’ λcur









286

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Iteratively refine λ (cont.)

a f

e

• Initially: count # triangles each edge 3

participates. b g

3

• Next, Iteratively refine λ for every vertex

c d h

•Using streams of triangles. vertex λcur

•Iterative refine λcur.

e 3

• Until all vertices’ λcur are converged

b 3

… …









DNgraph: Experiment

•Large scaled graph

•Flicker Dataset with with 1,715,255 vertices an 22,613,982

edges.

•1 iteration requires 1 hour, a workstation with a Quad-Core

AMD Opteron(tm) processor 8356, 128GB RAM and 700GB

hard disk.

•Converge in 66 iterations, almost stable after 35 iterations









287

Mining and Searching Complex Chapter 6 Structures Massive Graph Mining









Advantage



• Abstraction

Within the triangulation algorithm. The abstraction ensures

our approach’s extensibility to different input settings.

• Iteratively refine results

• The estimation of common neighborhood improves along

every iteration, users are able to obtain the most updated

results on demand.

• Pre-collection of Statistics to support effective buffer

management

• Process can be easily mapped to key->value pair for

further distributed processing.









Reference (partial)

[Hu05] H.Hu, X.Yan, Y.Huang, J.Han, and X.J.Zhou. Mining coherent dense subgraphs across

massive biological networks for functional discovery. Bioinformatics, 21(1):213--221, 2005.

[Ng01] A.Y. Ng, M.I. Jordan, and Y.Weiss. On spectral clustering: Analysis and an algorithm.

Advances in Neural Information Processing Systems, volume~14, 2001.

[Karypis96] G.Karypis and V.Kumar. Parallel multilevel k-way partitioning scheme for irregular

graphs. Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on

Supercomputing (CDROM), page~35, Washington, DC, USA, 1996. IEEE Computer Society.

[Breitkreutz03] B.J.Breitkreutz, C.Stark, and M.Tyers.Osprey: a network visualization system.

Genome Biology, 4, 2003.

[Mellor04] J.W.J. Z., Mellor and C. DeLisi. An online visualization and analysis tool for biological

interaction data. BMC Bioinformatics, 5:17--24, 2004.

[Zeng06]J. Wang, Z.Zeng, and L. Zhou. Clan: An algorithm for mining closed cliques from large

dense graph databases. Proceedings of the International Conference on Data Engineering},

page~73, 2006.

[Turan41] P. Turan. On an extremal problem in graph theory. Mat. Fiz. Lapok, 48:436–452, 1941

[Ankerst99] M.Ankerst, M.Breunig, H.P. Kriegel, and J.Sander. OPTICS: Ordering points to

identify the clustering structure. Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data

(SIGMOD'99), pages 49--60, Philadelphia, PA, June 1999.

[DNgraph10] On Triangle based DNgraph Mining. NUS technical report TRB4/10









288



Related docs
Other docs by dffhrtcv3
Chromosomal Miss-Segregation and DNA Damage
Views: 16  |  Downloads: 0
Christmas
Views: 16  |  Downloads: 0
Christmas Party Counting
Views: 15  |  Downloads: 0
Christmas dishes
Views: 14  |  Downloads: 0
CHRISTIAS FOR BIBLICAL ISRAEL or CFBI
Views: 16  |  Downloads: 0
Christian Ethics Living a Responsible Life
Views: 16  |  Downloads: 0
Christian Duty - Seymour Church of Christ
Views: 16  |  Downloads: 0
Chp 9 Power Point 08-09
Views: 15  |  Downloads: 0
Choose Your Own Adventure 2
Views: 16  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!