Document Sample

Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by: M.H. Lin Outline Motivation Objective Introduction Problem Statement The New Algorithm: GSP Performance Evaluation Conclusion Personal Opinion Motivation The problem of mining sequential patterns was recently introduced. Limitations of the AprioriAll [Agrawal, 1995] Absence of time constraints Rigid definition of a transaction Absence of taxonomies Objective We present GSP, a new algorithm that discovers these generalized sequential patterns Empirically compared the performance of GSP with the AprioriAll algorithm. Introduction Instance A database of sequences, called data-sequences Each sequence is a list of transactions ordered by transaction- time Each transaction is a set of items Definitions: A sequential pattern consists a list of itemsets Support:the number of data-sequences that contain the pattern Problem: To discover all the sequential patterns with a user-specified minimum support Example Of A Sequential Pattern Database of book-club, each data-sequence corresponds to a given customer’s all book selection, each transaction contains the books selected by the given customer in one order A sequential pattern: 5% of customers bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’ Features of A Sequential Pattern E.g: 5% cust. bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’ The Maximum and/or minimum time gaps between adjacent elements. Eg: the time between buying ‘Foundation’, and then ‘Foundation and Empire’ and ‘Ringworld’ should be within 3 months A sliding time window over the sequence-pattern elements E.g.: one week Mo: BK-a Sa: BK-b Next Su: BK-c ; This data-sequence supports the pattern “BK-a” and “ BK-b”, then “BK-c” User-defined Taxonomies Example coming soon…. A User-defined Taxonomy A customer who bought Foundation,then Perfect Spy, would support the following patterns: •Foundation, then Perfect Spy •Asimov, then Perfect Spy •Science Fiction, then Le Carre … The Old Algorithm--AprioriAll A 3-phase algorithm Phase 1: finds all frequent itemsets with min. support Phase 2: transforms the DB s.t. each transaction only contains the frequent itemsets Phase 3: finds sequential patterns Pros. Can Discover all frequent sequential patterns Cons. Computationally expensive: space, time Not feasible to incorporate sliding windows Problem Statement Definitions: Let I = {i1,i2,…,im} be a set of literals, called items Let T be a directed acyclic graph on the literals. An itemset is a non-empty set of items A sequence is an ordered list of itemsets We denote a sequence s by <s1s2…sn>, where sj is an itemset. We denote an element of sequence by (x1,x2,…,xm), where xj is an item. A sequence <a1a2…an> is a subsequence of another sequence <b1b2…bm> if there exist integers i1<i2<…<in such that a1 bi1 , a2 bi2 , …, an bin. E.g:<(3)(4,5)(8)> is a subsequence of <(7)(3,8)(9)(4,5,6)(8)> E.g:<(3)(5)> is not a subsequence of <(3,5)> Problem Statement(contd.) A data-sequence contains a sequence s if s is a subsequence of the data-sequence. Plus taxonomies: a transaction T contains an item x I if x is in T or x is an ancestor of some item in T. Plus sliding windows: A data-sequence d = <d1…dm> contains a sequence s = <s1…sn> if there exist integers l1≤u1<l2≤u2<…<ln ≤un such that 1. si is contained in , 1 ≤ i ≤ n , and 2. transaction-time(dui) – transaction-time(dli) ≤window-size , 1 ≤ i ≤ n Plus time constraints: 3. transaction-time(dli) - transaction-time(dui-1) > min-gap, 2 ≤ i ≤ n, and 4. transaction-time(dui) - transaction-time(dli-1) ≤ max-gap, 2 ≤ i ≤ n. Problem Definition Input: Database D : data sequences Taxonomy T : a DAG, not a tree User-specified min-gap and max-gap time constraints A user-specified sliding window size A user-specified minimum support Goal: To find all sequences whose support is greater than the given support Example minimum support: 2 data-sequences With the AprioriAll <(Ringworld)(Ringworld Engineers)> Sliding-window of 7 days adds the pattern <(Foundation, Ringworld)(Ringworld Engineers)> Max-gap of 30 days both patterns dropped Add the taxonomy, no sliding-window or time constraints, one is added <(Foundation)(Asimov)> GSP:Basic Structure Phase 1: makes the first pass over database To yield all the 1-element frequent sequences Phase 2: the kth pass: starts with seed set found in the (k-1)th pass to generate candidate sequences, which has one more item than a seed sequence; A new pass over D to find the support for these candidate sequences These frequent candidates become the seed for the next pass Phase 3: terminates when no more frequent sequences are found no candidate sequences are generated GSP: implementation Generating Candidates: To generate as few candidates as possible while maintaining completeness Counting Candidates: To determine the candidate sequence’s support Implementing Taxonomies Candidate Generation Definition: K-sequence : a sequence with k items, Lk : the set of frequent k-sequences, Ck : the set of candidate k-sequences Goal: given the set of all frequent (k-1)-sequences, generate a candidate set of all frequent k-sequences Algorithm: Join Phase: joining Lk-1 with Lk-1 . s1 can join with s2 if (s1 – first item) is the same as (s2 – last item) Prune Phase: delete candidate sequences that have a contiguous (k-1) subsequence whose support count is less than the minimum support Candidate Generation: Example Join phase: <(1,2)(3)> joins with <(2)(3,4)> => <(1,2)(3,4)> <(1,2)(3)> joins with <(2)(3)(5)> => <(1,2)(3)(5)> Prune phase: <(1,2)(3)(5)> is dropped => <(1)(3)(5)> is not in L3 Counting Candidates Problem: given a set of candidate sequences C and a data sequence d, find all sequences in C that are contained in d. Two techniques are used Hash-tree data structure: to reduce the number of candidates in C that need to be checked. Transformation the representation of the data- sequences d : to find whether a specific candidate is a subsequence of d efficiently. Hash-Tree Structure Purpose: reducing the number of candidates Leaf node: a list of sequences Interior node: a hash table Operations: Adding candidate sequences to the hash-tree Finding the candidates contained in a data- sequence Min-gap Max-gap Sliding window size Representation Transformation Purpose: to efficiently find the first occurrence of an element Transform the data sequences into transaction-links, each link is identified by one item E.g.:max-gap=30,min-gap=5,window-size=0,<(1,2)(3)(4)> E.g.:window-size:7,find(2,6) after time=20 Implementing Taxonomies Basic Idea: to replace each data-sequence d with an “extended sequence” d’, where each transaction di ’ contains all the items in the corresponding transaction di ,as well as all their ancestors. E.g.:<(Foundation, Ringworld)(Second Foundation)> => <Foundation,Ringworld,Asimov,Niven,Science Fiction)(Second Foundation,Asimov,Science Fiction)> Optimizations Pre-compute the ancestors of each item, drop infrequent ancestors before a new pass Not count patterns with an element that contains an item x and its ancestor y Problem: redundancy E.g. Performance Evaluation Comparison of GSP and AprioriAll Result: 2 to 20 times faster Contributing factors: Fewer candidates Directly finding the candidates Scale-up: scales linearly with the number of data-sequences Effects of Time Constraints and Sliding Windows: there was no performance degradation Experiment Result Experiment Result(contd.) Experiment Result(contd.) Experiment Result(contd.) Experiment Result(contd.) Conclusion GSP is a Generalized Sequence Mining Algorithm Discovering all the sequential patterns Good Customizability Has been incorporated into IBM’s data mining product Personal Opinion Hash-tree Structure: main memory limitation Multi-pass over the database Apply GSP to CIS data

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 6 |

posted: | 5/26/2012 |

language: | English |

pages: | 29 |

OTHER DOCS BY ert554898

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.