# Multi-dimensional Sequential Pattern Mining

Document Sample

```					Multi-dimensional Sequential
Pattern Mining

Helen Pinto, Jiawei Han, Jian Pei,
Ke Wang, Qiming Chen,
Umeshwar Dayal

1
Outline
   Why multidimensional sequential
pattern mining?
   Problem definition
   Algorithms
   Experimental results
   Conclusions

2
Why Sequential Pattern Mining?
   Sequential pattern mining: Finding time-related
frequent patterns (frequent subsequences)
   Many data and applications are time-related
   Customer shopping patterns, telephone calling patterns
   E.g., first buy computer, then CD-ROMS, software, within 3 mos.
   Natural disasters (e.g., earthquake, hurricane)
   Disease and treatment
   Stock market fluctuation
   Weblog click stream analysis
   DNA sequence analysis
3
Motivating Example
   Sequential patterns are useful
   “free internet access  buy package 1  upgrade to
package 2”
   Marketing, product design & development
   Problems: lack of focus
   Various groups of customers may have different patterns
   MD-sequential pattern mining: integrate multi-
dimensional analysis and sequential pattern mining
4
Sequences and Patterns
   Given a set of sequences, find the
complete set of frequent subsequences
A sequence database        A   sequence : < (ef) (ab)   (df) c b >

SID        sequence                Elements items within an
10     <a(abc)(ac)d(cf)>        element are listed alphabetically
20      <(ad)c(bc)(ae)>        <a(bc)dc> is a subsequence
30      <(ef)(ab)(df)cb>       of <a(abc)(ac)d(cf)>
40        <eg(af)cbc>
Given support threshold
min_sup =2, <(ab)c> is a
sequential pattern                    5
Sequential Pattern: Basics
A   sequence : <(bd) c   b (ac)>
A sequence database
Seq. ID      Sequence                          Elements
10       <(bd)cb(ac)>     <ad(ae)> is a subsequence
20      <(bf)(ce)b(fg)>   of <a(bd)bcb(ade)>
30       <(ah)(bf)abf>
40        <(be)(ce)d>     Given support threshold
50      <a(bd)bcb(ade)>   min_sup =2, <(bd)cb> is a
sequential pattern

6
MD Sequence Database
   P=(*,Chicago,*,<bf>) matches tuple 20
and 30
   If support =2, P is a MD sequential
pattern
cid Cust_grp   City      Age_grp sequence
10 Business    Boston    Middle  <(bd)cba>
20 Professional Chicago  Young     <(bf)(ce)(fg)>
30 Business     Chicago  Middle    <(ah)abf>
40 Education    New York Retired   <(be)(ce)>
7
Mining of MD Seq. Pat.
   Embedding MD information into
sequences
   Using a uniform seq. pat. mining method
   Integration of seq. pat. mining and MD
analysis method

8
UNISEQ
    Embed MD information into sequences
cid   Cust_grp       City       Age_grp   sequence
Mine the extended
10    Business       Boston     Middle    <(bd)cba>
20    Professional   Chicago    Young     <(bf)(ce)(fg)>
sequence database
30    Business       Chicago    Middle    <(ah)abf>         using sequential pattern
40    Education      New York   Retired   <(be)(ce)>            mining methods

cid                MD-extension of sequences
20    <(Professional,Chicago,Young)(bf)(ce)(fg)>
40      <(Education,New York,Retired)(be)(ce)>
9
Mine Sequential Patterns by
Prefix Projections
   Step 1: find length-1 sequential patterns
   <a>, <b>, <c>, <d>, <e>, <f>
   Step 2: divide search space. The complete set of
seq. pat. can be partitioned into 6 subsets:
   The ones having prefix <a>;
SID       sequence
   The ones having prefix <b>;
10    <a(abc)(ac)d(cf)>
   …                              20    <(ad)c(bc)(ae)>
   The ones having prefix <f>     30    <(ef)(ab)(df)cb>
40      <eg(af)cbc>
10
Find Seq. Patterns with Prefix <a>
   Only need to consider projections w.r.t. <a>
 <a>-projected database: <(abc)(ac)d(cf)>,

<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
   Find all the length-2 seq. pat. Having prefix <a>:
<aa>, <ab>, <(ab)>, <ac>, <ad>, <af>
 Further partition into 6 subsets
SID     sequence
 Having prefix <aa>;            10 <a(abc)(ac)d(cf)>
 …
30  <(ef)(ab)(df)cb>
 Having prefix <af>             40    <eg(af)cbc>

11
Completeness of PrefixSpan
SDB
SID        sequence
Length-1 sequential patterns
10    <a(abc)(ac)d(cf)>
<a>, <b>, <c>, <d>, <e>, <f>
30    <(ef)(ab)(df)cb>
40       <eg(af)cbc>
Having prefix <a>                                          Having prefix <c>, …, <f>
Having prefix <b>
<a>-projected database                                   <b>-projected database
<(abc)(ac)d(cf)>            Length-2 sequential
…
<(_d)c(bc)(ae)>             patterns
<(_b)(df)cb>                <aa>, <ab>, <(ab)>,
<(_f)cbc>                   <ac>, <ad>, <af>
……
Having prefix <aa> Having prefix <af>

<aa>-proj. db   …    <af>-proj. db

12
Efficiency of PrefixSpan

   No candidate sequence needs to be generated

   Projected databases keep shrinking

   Major cost of PrefixSpan: constructing projected
databases
   Can be improved by bi-level projections

13
Mining MD-Patterns
MD pattern            cid    Cust_grp       City       Age_grp   sequence
10     Business       Boston     Middle    <(bd)cba>
(*,Chicago,*)          20     Professional   Chicago    Young     <(bf)(ce)(fg)>
30     Business       Chicago    Middle    <(ah)abf>
(cust-grp,city,age-grp)
40     Education      New York   Retired   <(be)(ce)>

(cust-grp,city)   Cust-grp,*,age-grp)

(cust-grp,*,*)            (*,city,*)              (*,*,age-grp)

All                         BUC processing
14
Dim-Seq
   First find MD-patterns
   E.g. (*,Chicago,*)
   Form projected sequence database
   <(bf)(ce)(fg)> and <(ah)abf> for (*,Chicago,*)
   Find seq. pat in projected database
   E.g. (*,Chicago,*,<bf>)
cid   Cust_grp       City       Age_grp   sequence
10    Business       Boston     Middle    <(bd)cba>
20    Professional   Chicago    Young     <(bf)(ce)(fg)>
30    Business       Chicago    Middle    <(ah)abf>
40    Education      New York   Retired   <(be)(ce)>
15
Seq-Dim
   Find sequential patterns
   E.g. <bf>
   Form projected MD-database
   E.g. (Professional,Chicago,Young) and
   Mine MD-patterns
   E.g. (*,Chicago,*,<bf>)
cid   Cust_grp       City       Age_grp   sequence
10    Business       Boston     Middle    <(bd)cba>
20    Professional   Chicago    Young     <(bf)(ce)(fg)>
30    Business       Chicago    Middle    <(ah)abf>
40    Education      New York   Retired   <(be)(ce)>
16
Scalability Over Dimensionality

17
Scalability Over Cardinality

18
Scalability Over Support Threshold

19
Scalability Over Database Size

20
Pros & Cons of Algorithms
   Seq-Dim is efficient and scalable
   Fastest in most cases
   UniSeq is also efficient and scalable
   Fastest with low dimensionality
   Dim-Seq has poor scalability

21
Conclusions
   MD seq. pat. mining are interesting and
useful
   Mining MD seq. pat. efficiently
   Uniseq, Dim-Seq, and Seq-Dim
   Future work
   Applications of sequential pattern mining

22
References (1)
   R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94, pages 487-499.
   R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, pages
3-14.
   C. Bettini, X. S. Wang, and S. Jajodia. Mining temporal relationships with
multiple granularities in time sequences. Data Engineering Bulletin,
21:32-38, 1998.
   M. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern
mining with regular expression constraints. VLDB'99, pages 223-234.
   J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns
in time series database. ICDE'99, pages 106-115.
   J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu.
FreeSpan: Frequent pattern-projected sequential pattern mining.
KDD'00, pages 355-359.
23
References (2)
   J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
generation. SIGMOD'00, pages 1-12.
   H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional
intertransaction association rules. DMKD'98, pages 12:1-12:7.
   H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent
episodes in event sequences. Data Mining and Knowledge Discovery,
1:259-289, 1997.
   B. "Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules.
ICDE'98, pages 412-421.
   J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C.
Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-
projected pattern growth. ICDE'01, pages 215-224.
   R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations
and performance improvements. EDBT'96, pages 3-17.
24

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 54 posted: 4/9/2010 language: Indonesian pages: 24