Mining Sequential Patterns Generalizations and Performance

Document Sample
Mining Sequential Patterns Generalizations and Performance Powered By Docstoc
					Mining Sequential Patterns:
Generalizations and Performance

      R. Srikant R. Agrawal
   IBM Almaden Research Center

           Advisor: Dr. Hsu
        Presented by: M.H. Lin
   Motivation
   Objective
   Introduction
   Problem Statement
   The New Algorithm: GSP
   Performance Evaluation
   Conclusion
   Personal Opinion
   The problem of mining sequential patterns was
    recently introduced.
   Limitations of the AprioriAll [Agrawal, 1995]
        Absence of time constraints
        Rigid definition of a transaction
        Absence of taxonomies
   We present GSP, a new algorithm that
    discovers these generalized sequential patterns
   Empirically compared the performance of GSP
    with the AprioriAll algorithm.
   Instance
        A database of sequences, called data-sequences
        Each sequence is a list of transactions ordered by transaction-
        Each transaction is a set of items
   Definitions:
        A sequential pattern consists a list of itemsets
        Support:the number of data-sequences that contain the
   Problem:
        To discover all the sequential patterns with a user-specified
         minimum support
    Example Of A Sequential Pattern
   Database of book-club, each data-sequence
    corresponds to a given customer’s all book
    selection, each transaction contains the books
    selected by the given customer in one order

   A sequential pattern:
    5% of customers bought ‘Foundation’, then
    ‘Foundation and Empire’ and ‘Ringworld’,
    then ‘Second Foundation’
    Features of A Sequential Pattern
   E.g: 5% cust. bought ‘Foundation’, then ‘Foundation and Empire’ and
    ‘Ringworld’, then ‘Second Foundation’
   The Maximum and/or minimum time gaps between adjacent
        Eg: the time between buying ‘Foundation’, and then ‘Foundation and
         Empire’ and ‘Ringworld’ should be within 3 months
   A sliding time window over the sequence-pattern elements
        E.g.: one week
        Mo: BK-a Sa: BK-b Next Su: BK-c ;
        This data-sequence supports the pattern “BK-a” and “ BK-b”, then
   User-defined Taxonomies
        Example
          coming soon….
      A User-defined Taxonomy

A customer who bought Foundation,then Perfect Spy, would support the
following patterns:
       •Foundation, then Perfect Spy
       •Asimov, then Perfect Spy
       •Science Fiction, then Le Carre

        The Old Algorithm--AprioriAll
   A 3-phase algorithm
       Phase 1: finds all frequent itemsets with min. support
       Phase 2: transforms the DB s.t. each transaction only
        contains the frequent itemsets
       Phase 3: finds sequential patterns
   Pros.
       Can Discover all frequent sequential patterns
   Cons.
       Computationally expensive: space, time
       Not feasible to incorporate sliding windows
    Problem Statement
   Definitions:
        Let I = {i1,i2,…,im} be a set of literals, called items
        Let T be a directed acyclic graph on the literals.
        An itemset is a non-empty set of items
        A sequence is an ordered list of itemsets
        We denote a sequence s by <s1s2…sn>, where sj is an itemset.
        We denote an element of sequence by (x1,x2,…,xm), where xj is
         an item.
        A sequence <a1a2…an> is a subsequence of another sequence
         <b1b2…bm> if there exist integers i1<i2<…<in such that a1  bi1 ,
         a2 bi2 , …, an bin.

             E.g:<(3)(4,5)(8)> is a subsequence of <(7)(3,8)(9)(4,5,6)(8)>
             E.g:<(3)(5)> is not a subsequence of <(3,5)>
    Problem Statement(contd.)
   A data-sequence contains a sequence s if s is a
    subsequence of the data-sequence.
   Plus taxonomies:
        a transaction T contains an item x I if x is in T or x is an
         ancestor of some item in T.
   Plus sliding windows:
        A data-sequence d = <d1…dm> contains a sequence s = <s1…sn>
         if there exist integers l1≤u1<l2≤u2<…<ln ≤un such that
        1. si is contained in      , 1 ≤ i ≤ n , and
        2. transaction-time(dui) – transaction-time(dli) ≤window-size , 1 ≤ i ≤
   Plus time constraints:
        3. transaction-time(dli) - transaction-time(dui-1) > min-gap, 2 ≤ i ≤ n,
        4. transaction-time(dui) - transaction-time(dli-1) ≤ max-gap, 2 ≤ i ≤ n.
    Problem Definition
   Input:
        Database D : data sequences
        Taxonomy T : a DAG, not a tree
        User-specified min-gap and max-gap time
        A user-specified sliding window size
        A user-specified minimum support
   Goal:
        To find all sequences whose support is greater than
         the given support

   minimum support: 2 data-sequences
   With the AprioriAll
        <(Ringworld)(Ringworld Engineers)>
   Sliding-window of 7 days adds the pattern
        <(Foundation, Ringworld)(Ringworld Engineers)>
   Max-gap of 30 days
        both patterns dropped
   Add the taxonomy, no sliding-window or time constraints, one is
        <(Foundation)(Asimov)>
    GSP:Basic Structure
   Phase 1: makes the first pass over database
        To yield all the 1-element frequent sequences
   Phase 2: the kth pass:
        starts with seed set found in the (k-1)th pass to generate
         candidate sequences, which has one more item than a seed
        A new pass over D to find the support for these candidate
        These frequent candidates become the seed for the next pass
   Phase 3: terminates when
        no more frequent sequences are found
        no candidate sequences are generated
    GSP: implementation
   Generating Candidates:
        To generate as few candidates as possible while
         maintaining completeness
   Counting Candidates:
        To determine the candidate sequence’s support
   Implementing Taxonomies
    Candidate Generation
   Definition:
        K-sequence : a sequence with k items,
        Lk : the set of frequent k-sequences,
        Ck : the set of candidate k-sequences
   Goal: given the set of all frequent (k-1)-sequences,
    generate a candidate set of all frequent k-sequences
   Algorithm:
        Join Phase: joining Lk-1 with Lk-1 . s1 can join with s2 if (s1 – first
         item) is the same as (s2 – last item)
        Prune Phase: delete candidate sequences that have a
         contiguous (k-1) subsequence whose support count is less
         than the minimum support
    Candidate Generation: Example

   Join phase:
        <(1,2)(3)> joins with <(2)(3,4)> => <(1,2)(3,4)>
        <(1,2)(3)> joins with <(2)(3)(5)> => <(1,2)(3)(5)>
   Prune phase:
        <(1,2)(3)(5)> is dropped => <(1)(3)(5)> is not in L3
    Counting Candidates
   Problem: given a set of candidate sequences C
    and a data sequence d, find all sequences in C
    that are contained in d.
   Two techniques are used
        Hash-tree data structure: to reduce the number of
         candidates in C that need to be checked.
        Transformation the representation of the data-
         sequences d : to find whether a specific candidate
         is a subsequence of d efficiently.
    Hash-Tree Structure
   Purpose: reducing the number of candidates
   Leaf node: a list of sequences
   Interior node: a hash table
   Operations:
        Adding candidate sequences to the hash-tree
        Finding the candidates contained in a data-
             Min-gap
             Max-gap
             Sliding window size
    Representation Transformation
   Purpose: to efficiently find the first occurrence of an
   Transform the data sequences into transaction-links,
    each link is identified by one item
        E.g.:max-gap=30,min-gap=5,window-size=0,<(1,2)(3)(4)>
        E.g.:window-size:7,find(2,6) after time=20
    Implementing Taxonomies
   Basic Idea:
        to replace each data-sequence d with an “extended sequence”
         d’, where each transaction di ’ contains all the items in the
         corresponding transaction di ,as well as all their ancestors.
        E.g.:<(Foundation, Ringworld)(Second Foundation)> =>
         <Foundation,Ringworld,Asimov,Niven,Science Fiction)(Second
         Foundation,Asimov,Science Fiction)>
   Optimizations
        Pre-compute the ancestors of each item, drop infrequent
         ancestors before a new pass
        Not count patterns with an element that contains an item x
         and its ancestor y
   Problem: redundancy
        E.g.
    Performance Evaluation
   Comparison of GSP and AprioriAll
        Result: 2 to 20 times faster
        Contributing factors:
             Fewer candidates
             Directly finding the candidates
   Scale-up:
        scales linearly with the number of data-sequences
   Effects of Time Constraints and Sliding
        there was no performance degradation
Experiment Result
Experiment Result(contd.)
Experiment Result(contd.)
Experiment Result(contd.)
Experiment Result(contd.)
   GSP is a Generalized Sequence Mining Algorithm
   Discovering all the sequential patterns
   Good Customizability
   Has been incorporated into IBM’s data mining product
    Personal Opinion
   Hash-tree Structure: main memory limitation
   Multi-pass over the database
   Apply GSP to CIS data

Shared By: