Docstoc

Multi dimensional Sequential Pattern Mining (PowerPoint)

Document Sample
Multi dimensional Sequential Pattern Mining (PowerPoint) Powered By Docstoc
					Multi-dimensional Sequential Pattern Mining
Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal

1

Outline


   

Why multidimensional sequential pattern mining? Problem definition Algorithms Experimental results Conclusions

2

Why Sequential Pattern Mining?




Sequential pattern mining: Finding time-related frequent patterns (frequent subsequences) Many data and applications are time-related


Customer shopping patterns, telephone calling patterns


E.g., first buy computer, then CD-ROMS, software, within 3 mos.

    

Natural disasters (e.g., earthquake, hurricane) Disease and treatment Stock market fluctuation Weblog click stream analysis DNA sequence analysis
3

Motivating Example


Sequential patterns are useful


“free internet access  buy package 1  upgrade to package 2”
Marketing, product design & development Various groups of customers may have different patterns





Problems: lack of focus




MD-sequential pattern mining: integrate multidimensional analysis and sequential pattern mining
4

Sequences and Patterns


Given a set of sequences, find the complete set of frequent subsequences
A

A sequence database
SID 10 20 30 40 sequence <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc>

sequence : < (ef) (ab)

(df) c b >

element are listed alphabetically <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a

Elements items within an

sequential pattern

5

Sequential Pattern: Basics
A sequence database
Seq. ID 10 20 30 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> A

sequence : <(bd) c

b (ac)>

Elements
<ad(ae)> is a subsequence of <a(bd)bcb(ade)>
Given support threshold min_sup =2, <(bd)cb> is a

40
50

<(be)(ce)d>
<a(bd)bcb(ade)>

sequential pattern

6

MD Sequence Database




P=(*,Chicago,*,<bf>) matches tuple 20 and 30 If support =2, P is a MD sequential pattern
City Boston Age_grp sequence Middle <(bd)cba>

cid Cust_grp 10 Business

20 Professional Chicago Young 30 Business Chicago Middle 40 Education New York Retired

<(bf)(ce)(fg)> <(ah)abf> <(be)(ce)>
7

Mining of MD Seq. Pat.


Embedding MD information into sequences


Using a uniform seq. pat. mining method



Integration of seq. pat. mining and MD analysis method

8

UNISEQ

cid 10 20 30 40 Cust_grp Business Professional Business Education

Embed MD information into sequences
City Boston Chicago Chicago New York Age_grp Middle Young Middle Retired sequence <(bd)cba> <(bf)(ce)(fg)> <(ah)abf> <(be)(ce)>

Mine the extended sequence database using sequential pattern mining methods

cid 10

MD-extension of sequences <(Business,Boston,Middle)(bd)cba>

20
30 40

<(Professional,Chicago,Young)(bf)(ce)(fg)>
<(Business,Chicago,Middle)(ah)abf> <(Education,New York,Retired)(be)(ce)>
9

Mine Sequential Patterns by Prefix Projections


Step 1: find length-1 sequential patterns


<a>, <b>, <c>, <d>, <e>, <f>



Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:
   

The ones having prefix <a>;
The ones having prefix <b>; …

SID 10 20 30 40

sequence <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc>
10

The ones having prefix <f>

Find Seq. Patterns with Prefix <a>




Only need to consider projections w.r.t. <a>  <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>  Further partition into 6 subsets SID sequence  Having prefix <aa>; 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)>  … 30 <(ef)(ab)(df)cb>  Having prefix <af> 40 <eg(af)cbc>
11

Completeness of PrefixSpan
SDB
SID sequence

10
20 30 40

<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc>

Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f>

Having prefix <a>

Having prefix <c>, …, <f> <b>-projected database

Having prefix <b>
Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

<a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc>

…

……

Having prefix <aa> Having prefix <af> <aa>-proj. db

…

<af>-proj. db
12

Efficiency of PrefixSpan
  

No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: constructing projected

databases


Can be improved by bi-level projections

13

Mining MD-Patterns
MD pattern (*,Chicago,*)
(cust-grp,city,age-grp) (cust-grp,city) (cust-grp,*,*)
cid 10 20 30 40 Cust_grp Business Professional Business Education City Boston Chicago Chicago New York Age_grp Middle Young Middle Retired sequence <(bd)cba> <(bf)(ce)(fg)> <(ah)abf> <(be)(ce)>

Cust-grp,*,age-grp)

(*,city,*)
All

(*,*,age-grp) BUC processing
14

Dim-Seq


First find MD-patterns




Form projected sequence database


E.g. (*,Chicago,*)



Find seq. pat in projected database


<(bf)(ce)(fg)> and <(ah)abf> for (*,Chicago,*) E.g. (*,Chicago,*,<bf>)
cid

Cust_grp

City

Age_grp

sequence

10
20 30 40

Business
Professional Business Education

Boston
Chicago Chicago New York

Middle
Young Middle Retired

<(bd)cba>
<(bf)(ce)(fg)> <(ah)abf> <(be)(ce)>
15

Seq-Dim


Find sequential patterns




Form projected MD-database


E.g. <bf>



Mine MD-patterns


E.g. (Professional,Chicago,Young) and (Business,Chicago,Middle) for <bf> E.g. (*,Chicago,*,<bf>)
cid 10 20 30 40 Cust_grp Business Professional Business Education City Boston Chicago Chicago New York Age_grp Middle Young Middle Retired sequence <(bd)cba> <(bf)(ce)(fg)> <(ah)abf> <(be)(ce)> 16

Scalability Over Dimensionality

17

Scalability Over Cardinality

18

Scalability Over Support Threshold

19

Scalability Over Database Size

20

Pros & Cons of Algorithms


Seq-Dim is efficient and scalable


Fastest in most cases
Fastest with low dimensionality



UniSeq is also efficient and scalable




Dim-Seq has poor scalability

21

Conclusions




MD seq. pat. mining are interesting and useful Mining MD seq. pat. efficiently


Uniseq, Dim-Seq, and Seq-Dim Applications of sequential pattern mining



Future work


22

References (1)
 









R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94, pages 487-499. R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, pages 3-14. C. Bettini, X. S. Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. Data Engineering Bulletin, 21:32-38, 1998. M. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern mining with regular expression constraints. VLDB'99, pages 223-234. J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, pages 106-115. J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern-projected sequential pattern mining. KDD'00, pages 355-359.
23

References (2)
 









J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, pages 1-12. H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional intertransaction association rules. DMKD'98, pages 12:1-12:7. H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259-289, 1997. B. "Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, pages 412-421. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefixprojected pattern growth. ICDE'01, pages 215-224. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.
24


				
DOCUMENT INFO