Docstoc

Multi-dimensional Sequential Pattern Mining

Document Sample
Multi-dimensional Sequential Pattern Mining Powered By Docstoc
					Multi-dimensional Sequential
Pattern Mining

  Helen Pinto, Jiawei Han, Jian Pei,
      Ke Wang, Qiming Chen,
          Umeshwar Dayal


                                       1
Outline
   Why multidimensional sequential
    pattern mining?
   Problem definition
   Algorithms
   Experimental results
   Conclusions


                                      2
         Why Sequential Pattern Mining?
   Sequential pattern mining: Finding time-related
    frequent patterns (frequent subsequences)
   Many data and applications are time-related
       Customer shopping patterns, telephone calling patterns
            E.g., first buy computer, then CD-ROMS, software, within 3 mos.
       Natural disasters (e.g., earthquake, hurricane)
       Disease and treatment
       Stock market fluctuation
       Weblog click stream analysis
       DNA sequence analysis
                                                                          3
        Motivating Example
   Sequential patterns are useful
       “free internet access  buy package 1  upgrade to
        package 2”
       Marketing, product design & development
   Problems: lack of focus
       Various groups of customers may have different patterns
   MD-sequential pattern mining: integrate multi-
    dimensional analysis and sequential pattern mining
                                                             4
      Sequences and Patterns
     Given a set of sequences, find the
      complete set of frequent subsequences
A sequence database        A   sequence : < (ef) (ab)   (df) c b >


SID        sequence                Elements items within an
10     <a(abc)(ac)d(cf)>        element are listed alphabetically
20      <(ad)c(bc)(ae)>        <a(bc)dc> is a subsequence
30      <(ef)(ab)(df)cb>       of <a(abc)(ac)d(cf)>
40        <eg(af)cbc>
                               Given support threshold
                               min_sup =2, <(ab)c> is a
                               sequential pattern                    5
  Sequential Pattern: Basics
                             A   sequence : <(bd) c   b (ac)>
 A sequence database
Seq. ID      Sequence                          Elements
  10       <(bd)cb(ac)>     <ad(ae)> is a subsequence
  20      <(bf)(ce)b(fg)>   of <a(bd)bcb(ade)>
  30       <(ah)(bf)abf>
  40        <(be)(ce)d>     Given support threshold
  50      <a(bd)bcb(ade)>   min_sup =2, <(bd)cb> is a
                            sequential pattern


                                                                6
    MD Sequence Database
       P=(*,Chicago,*,<bf>) matches tuple 20
        and 30
       If support =2, P is a MD sequential
        pattern
cid Cust_grp   City      Age_grp sequence
10 Business    Boston    Middle  <(bd)cba>
20 Professional Chicago  Young     <(bf)(ce)(fg)>
30 Business     Chicago  Middle    <(ah)abf>
40 Education    New York Retired   <(be)(ce)>
                                                    7
Mining of MD Seq. Pat.
   Embedding MD information into
    sequences
       Using a uniform seq. pat. mining method
   Integration of seq. pat. mining and MD
    analysis method




                                                  8
            UNISEQ
                 Embed MD information into sequences
cid   Cust_grp       City       Age_grp   sequence
                                                               Mine the extended
10    Business       Boston     Middle    <(bd)cba>
20    Professional   Chicago    Young     <(bf)(ce)(fg)>
                                                              sequence database
30    Business       Chicago    Middle    <(ah)abf>         using sequential pattern
40    Education      New York   Retired   <(be)(ce)>            mining methods

                                    cid                MD-extension of sequences
                                     10         <(Business,Boston,Middle)(bd)cba>
                                     20    <(Professional,Chicago,Young)(bf)(ce)(fg)>
                                     30        <(Business,Chicago,Middle)(ah)abf>
                                     40      <(Education,New York,Retired)(be)(ce)>
                                                                                        9
        Mine Sequential Patterns by
        Prefix Projections
   Step 1: find length-1 sequential patterns
       <a>, <b>, <c>, <d>, <e>, <f>
   Step 2: divide search space. The complete set of
    seq. pat. can be partitioned into 6 subsets:
       The ones having prefix <a>;
                                       SID       sequence
       The ones having prefix <b>;
                                       10    <a(abc)(ac)d(cf)>
       …                              20    <(ad)c(bc)(ae)>
       The ones having prefix <f>     30    <(ef)(ab)(df)cb>
                                       40      <eg(af)cbc>
                                                            10
        Find Seq. Patterns with Prefix <a>
   Only need to consider projections w.r.t. <a>
      <a>-projected database: <(abc)(ac)d(cf)>,

       <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
   Find all the length-2 seq. pat. Having prefix <a>:
    <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>
      Further partition into 6 subsets
                                        SID     sequence
         Having prefix <aa>;            10 <a(abc)(ac)d(cf)>
                                         20  <(ad)c(bc)(ae)>
         …
                                         30  <(ef)(ab)(df)cb>
         Having prefix <af>             40    <eg(af)cbc>

                                                          11
            Completeness of PrefixSpan
                                      SDB
                              SID        sequence
                                                           Length-1 sequential patterns
                               10    <a(abc)(ac)d(cf)>
                                                           <a>, <b>, <c>, <d>, <e>, <f>
                               20    <(ad)c(bc)(ae)>
                               30    <(ef)(ab)(df)cb>
                               40       <eg(af)cbc>
    Having prefix <a>                                          Having prefix <c>, …, <f>
                               Having prefix <b>
<a>-projected database                                   <b>-projected database
<(abc)(ac)d(cf)>            Length-2 sequential
                                                                                     …
<(_d)c(bc)(ae)>             patterns
<(_b)(df)cb>                <aa>, <ab>, <(ab)>,
<(_f)cbc>                   <ac>, <ad>, <af>
                                                               ……
Having prefix <aa> Having prefix <af>

<aa>-proj. db   …    <af>-proj. db

                                                                                      12
        Efficiency of PrefixSpan

   No candidate sequence needs to be generated

   Projected databases keep shrinking

   Major cost of PrefixSpan: constructing projected
    databases
       Can be improved by bi-level projections



                                                  13
       Mining MD-Patterns
      MD pattern            cid    Cust_grp       City       Age_grp   sequence
                            10     Business       Boston     Middle    <(bd)cba>
     (*,Chicago,*)          20     Professional   Chicago    Young     <(bf)(ce)(fg)>
                            30     Business       Chicago    Middle    <(ah)abf>
(cust-grp,city,age-grp)
                            40     Education      New York   Retired   <(be)(ce)>



(cust-grp,city)   Cust-grp,*,age-grp)


(cust-grp,*,*)            (*,city,*)              (*,*,age-grp)


                             All                         BUC processing
                                                                               14
        Dim-Seq
   First find MD-patterns
       E.g. (*,Chicago,*)
   Form projected sequence database
       <(bf)(ce)(fg)> and <(ah)abf> for (*,Chicago,*)
   Find seq. pat in projected database
       E.g. (*,Chicago,*,<bf>)
                       cid   Cust_grp       City       Age_grp   sequence
                       10    Business       Boston     Middle    <(bd)cba>
                       20    Professional   Chicago    Young     <(bf)(ce)(fg)>
                       30    Business       Chicago    Middle    <(ah)abf>
                       40    Education      New York   Retired   <(be)(ce)>
                                                                         15
    Seq-Dim
   Find sequential patterns
       E.g. <bf>
   Form projected MD-database
       E.g. (Professional,Chicago,Young) and
        (Business,Chicago,Middle) for <bf>
   Mine MD-patterns
       E.g. (*,Chicago,*,<bf>)
                    cid   Cust_grp       City       Age_grp   sequence
                    10    Business       Boston     Middle    <(bd)cba>
                    20    Professional   Chicago    Young     <(bf)(ce)(fg)>
                    30    Business       Chicago    Middle    <(ah)abf>
                    40    Education      New York   Retired   <(be)(ce)>
                                                                       16
Scalability Over Dimensionality




                             17
Scalability Over Cardinality




                               18
Scalability Over Support Threshold




                                     19
Scalability Over Database Size




                            20
Pros & Cons of Algorithms
   Seq-Dim is efficient and scalable
       Fastest in most cases
   UniSeq is also efficient and scalable
       Fastest with low dimensionality
   Dim-Seq has poor scalability




                                            21
Conclusions
   MD seq. pat. mining are interesting and
    useful
   Mining MD seq. pat. efficiently
       Uniseq, Dim-Seq, and Seq-Dim
   Future work
       Applications of sequential pattern mining



                                                22
         References (1)
   R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
    VLDB'94, pages 487-499.
   R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, pages
    3-14.
   C. Bettini, X. S. Wang, and S. Jajodia. Mining temporal relationships with
    multiple granularities in time sequences. Data Engineering Bulletin,
    21:32-38, 1998.
   M. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern
    mining with regular expression constraints. VLDB'99, pages 223-234.
   J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns
    in time series database. ICDE'99, pages 106-115.
   J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu.
    FreeSpan: Frequent pattern-projected sequential pattern mining.
    KDD'00, pages 355-359.
                                                                          23
         References (2)
   J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
    generation. SIGMOD'00, pages 1-12.
   H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional
    intertransaction association rules. DMKD'98, pages 12:1-12:7.
   H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent
    episodes in event sequences. Data Mining and Knowledge Discovery,
    1:259-289, 1997.
   B. "Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules.
    ICDE'98, pages 412-421.
   J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C.
    Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-
    projected pattern growth. ICDE'01, pages 215-224.
   R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations
    and performance improvements. EDBT'96, pages 3-17.
                                                                         24

				
DOCUMENT INFO