Document Sample

Mining Frequent Patterns, Associations Data Mining Techniques 1 Outline • What is association rule mining and frequent pattern mining? • Methods for frequent-pattern mining • Constraint-based frequent-pattern mining • Frequent-pattern mining: achievements, promises and research problems Data Mining Techniques 2 Market Basket Analysis This basket contains an assortment of products What one customer purchased at one time??? What merchandise customers are buying and when??? Marketing basket analysis is a process that analyzes customer buying habits Data Mining Techniques 3 What Market Basket Analysis Can Help? • Customer: who they are? why they make certain purchase? • Merchandise: which products tend to be purchased together? Which are most amenable to promotion? Does a brand of products make a difference? • Usage: – Store layout; – Product layout; Data Mining Techniques 4 – Coupons issue; Association Rules from Market Basket Analysis Method: Froze Transaction 1: Frozen pizza, cola, milk n Mil Col Potato Pretzel Transaction 2: Milk, potato chips Pizza k a Chips s Transaction 3: Cola, frozen pizza Frozen Transaction 4: Milk, pretzels Pizza 2 1 2 0 0 Transaction 5: Cola, pretzels Milk 1 3 1 1 1 Cola 2 1 3 0 1 Hints that frozen pizza and cola may sell well together, and should be placed Potato side-by-side in the convenience store.. Chips 0 1 0 1 0 Pretzel Results: s 0 1 1 0 2 we could derive the association rules: If a customer purchases Frozen Pizza, then they will probably purchase Cola. If a customer purchases Cola, then they will probably purchase Frozen Pizza. Data Mining Techniques 5 Use of Rule Associations • Coupons, discounts – Don‘t give discounts on 2 items that are frequently bought together. Use the discount on 1 to ―pull‖ the other • Product placement – Offer correlated products to the customer at the same time. Increases sales • Timing of cross-marketing – Send camcorder offer to VCR purchasers 2-3 months after VCR purchase • Discovery of patterns – People who bought X, Y and Z (but not any pair) bought W over half the time Data Mining Techniques 6 What are Frequent Patterns? • Frequent patterns: patterns (itemsets, subsequences, substructures, etc.) that occur frequently in a database [AIS93] For example: – A set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent itemset – A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a frequent sequential pattern – A substructure, can refer to different structural forms, such as subgraph, subtree, or sublattics Data Mining Techniques 7 Motivation • Frequent pattern mining: finding regularities in data – What products were often purchased together? –beer and diapers?! – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to a new drug? – Can we automatically classify web documents based on frequent key-word combinations? Data Mining Techniques 8 Why Is Freq. Pattern Mining Important? • Forms the foundation for many essential data mining tasks – Association, correlation, and causality analysis – Sequential, structural (e.g., sub-graph) patterns – Pattern analysis in spatiotemporal, multimedia, time-series, and stream data – Classification: associative classification – Cluster analysis: frequent pattern-based clustering – Data warehousing: iceberg cube and cube-gradient – Semantic data compression: fascicles – Broad applications: Basket data analysis, cross-marketing, catalog design, sale campaign analysis, web log (click stream) analysis, … Data Mining Techniques 9 A Motivating Example • Market basket analysis (customers shopping behavior analysis) – Which groups or sets of items are customers likely to purchase on a given trip of the store? – Results can be used in plan marketing or advertising strategies, or in the design of a new catalog. – These patterns can be presented in the form of association rules below: • Computer antivirus_software [support=2%, confidence=60%] Data Mining Techniques 10 Basic Concepts • I is the set of items {i1, i2, … id} • A transaction T is a set of items: T={ia, ib, …, it}, T I . Each transaction is associated with an identifier, called TID. • D, the task-relevant data, is a set of transactions D={T1, T2, … Tn}. • An association rule is of the form: A ���� B, where A ⊂ I, B ⊂ I, and A∩B = ∅ Data Mining Techniques 11 Rule Measures: Support and Confidence • Itemset X = {x1, …, xk}, k-itemset • support, s, probability of transactions in D that contain X Y, P(X Y)- relative support; the number of transactions in D that contain the itemset- absolute support • confidence, c, conditional probability of transactions in D having X that also contain Y, P(Y︱X） sup( X Y ) c sup( X ) • Frequent itemset: If the support of an itemset X satisfies a predefined minimum support threshold, then X is a frequent itemset Data Mining Techniques 12 An Example TID Items bought 10 A, B, D Let supmin = 50%, 20 A, C, D confmin = 50% 30 A, D, E 40 B, E, F 50 B, C, D, E, F Frequent patterns are: Customer Customer {A:3, B:3, D:4, E:3, AD:3} buys both buys diaper Association rules: A D (60%, 100%) Customer D A (60%, 75%) buys beer Data Mining Techniques 13 Problem Definition • Given I={i1, i2,…, im}, D={t1, t2, …, tn} ,and the minimum support and confidence thresholds, – frequent pattern mining problem is to find all frequent patterns in the D – association rule mining problem is to identify all strong association rules X���� Y, that must satisfy minimum support and minimum confidence Data Mining Techniques 14 Frequent Pattern Mining: A road Map (Ⅰ) • Based on the types of values in the rule – Boolean associations: involve associations between the presence and absence of items • buys (x, ―SQLServer‖) buys (x, ―DMBook‖) • buys (x, ―DM Software‖) [0.2%, 60%] – Quantitative associations: describe associations between quantitative items or attributes • age (x, ―30..39‖) ^ income (x, ―42..48K‖) ���� buys (x, ―PC‖) Data Mining Techniques 15 Frequent Pattern Mining: A road Map (Ⅱ) • Based on the number of data dimensions involved in the rule – Single dimension associations: the items or attributes in an association rule reference only one dimension • buys (x, ―computer‖) buys (x, ―printer‖) – Multiple dimensional associations: reference two or more dimensions, such as age, income, and buys • age (x, ―30..39‖) ^ income (x, ―42..48K‖) ���� buys (x, ―PC‖) Data Mining Techniques 16 Frequent Pattern Mining: A road Map (Ⅲ) • Based on the levels of abstraction involved in the rule set – Single level • buys (x, ―computer‖) buys (x, ―printer‖) – multiple-level analysis • What brands of computers are associated with what brands of digital cameras? • buys (x, ―laptop_computer‖) buys (x, ―HP_printer‖) Data Mining Techniques 17 Multiple-Level Association Rules • Items often form hierarchies TID Items Purchased 1 IBM-ThinkPad-R40/P4M, Symantec-Norton-Antivirus-2003 2 Microsoft-Office-Proffesional-2003, Microsoft- 3 logiTech-Mouse, Fellows-Wrist-Rest … … all Level 0 Level 1 Computer Software Printer & Camera Accessory Level 2 laptop desktop office antivirus printer camera mouse pad Level 3 Data Mining Techniques 18 IBM Dell Microsoft Frequent Pattern Mining: A road Map (Ⅳ) • Based on the completeness of patterns to be mined – Complete set of frequent itemsets – Closed frequent itemsets – Maximal frequent itemsets Frequent – Constrained frequent itemsets Itemsets – Approximate frequent itemsets Closed Frequent Itemsets – … Maximal Frequent Itemsets Data Mining Techniques 19 Outline • What is association rule mining and frequent pattern mining? • Methods for frequent-pattern mining • Constraint-based frequent-pattern mining • Frequent-pattern mining: achievements, promises and research problems Data Mining Techniques 20 Frequent Pattern Mining Methods • Apriori and its variations/improvements • Mining frequent-patterns without candidate generation • Mining max-patterns and closed itemsets • Mining multi-dimensional, multi-level frequent patterns with flexible support constraints • Interestingness: correlation and causality Data Mining Techniques 21 Data Representation • Transactional vs. Binary TID a b c d e TID Items 10 1 0 1 1 0 10 a, c, d 20 0 1 1 0 1 20 b, c, e 30 1 1 1 0 1 30 a, b, c, e 40 0 1 0 0 1 40 b, e Item TIDs • Horizontal vs. Vertical a 10, 30 b 20, 30, 40 c 10, 20, 30 d 10 e 20, 30, 40 Data Mining Techniques 22 Apriori: A Candidate Generation-and- Test Approach • Apriori is a seminal algorithm proposed by R. Agrawal & R. Srikant [VLDB‘94] • Apriori consists of two phases: – Generate length (k+1) candidate itemsets from length k frequent itemsets • Join step • Prune step – Test the candidates against DB Data Mining Techniques 23 Apriori-Based Mining • Method: – Initially, scan DB once to get frequent 1-itemset – Generate length (k+1) candidate itemsets from length k frequent itemsets – Test the candidates against DB – Terminate when no frequent or candidate set can be generated Data Mining Techniques 24 Apriori Property • Apriori pruning property: If there is any itemset which is infrequent, its superset should not be generated/tested! – No superset of any infrequent itemset should be generated or tested – Many item combinations can be pruned! Data Mining Techniques 25 Illustrating Apriori Principle The whole process of start null frequent pattern mining can be seen as a search A B C D E In the lattice AB AC AD AE BC BD BE CD CE DE Found to be Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Pruned Data Mining Techniques ABCDE 26 supersets Apriori Algorithm—An Example Supmin = 2 Itemset sup Itemset sup Database TDB {A} 2 L1 {A} 2 Tid Items C1 {B} 3 {B} 3 10 A, C, D {C} 3 20 B, C, E 1st scan {C} 3 {D} 1 {E} 3 30 A, B, C, E {E} 3 40 B, E C2 Itemset sup C2 Itemset {A, B} 1 L2 Itemset sup 2nd scan {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 {B, C} 2 {B, C} 2 {A, E} {B, E} 3 {B, E} 3 {B, C} {C, E} 2 {C, E} 2 {B, E} {C, E} C3 Itemset 3rd scan L3 Itemset sup {B, C, E} {B, C, E} 2 Data Mining Techniques 27 The Apriori Algorithm Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; Data Mining Techniques 28 Important Details of Apriori • How to generate candidates? – Step 1: self-joining Lk – Step 2: pruning • How to count supports of candidates? Data Mining Techniques 29 How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk- 1 • Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Data Mining Techniques 30 How to Count Supports of Candidates? • Why counting supports of candidates a problem? – The total number of candidates can be very huge – One transaction may contain many candidates • Method: – Candidate itemsets are stored in a hash-tree – Leaf node of hash-tree contains a list of itemsets and counts – Interior node contains a hash table – Subset function: finds all the candidates contained in a transaction Data Mining Techniques 31 Counting Supports of Candidates Using Hash Tree • Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} • You need: – Hash function – Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) Data Mining Techniques 32 Generate Hash Tree {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} Hash function 3,6,9 1,4,7 2,5,8 234 145 356 567 136 357 124 689 457 345 Split nodes with more than 125 367 3 candidates using the second item 458 368 159 Data Mining Techniques 33 Generate Hash Tree {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} Hash function 3,6,9 1,4,7 234 2,5,8 567 356 145 357 136 124 689 457 345 125 367 Now split nodes 458 368 using the third item 159 Data Mining Techniques 34 Generate Hash Tree {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} Hash function 3,6,9 1,4,7 234 567 356 2,5,8 145 357 136 689 124 345 457 125 159 367 458 368 Data Mining Now, split Techniques this similarly. 35 Subset Operation Given a (lexicographically ordered) transaction t, say {1,2,3,5,6} how can we enumerate the possible subsets of size 3? Transaction, t 1 2 3 5 6 Level 1 1 2 3 5 6 2 3 5 6 3 5 6 Level 2 12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6 123 135 235 125 156 256 356 136 236 126 Level 3 Subsets of 3 items Data Mining Techniques 36 Subset Operation Using Hash Tree 12356 transaction Hash Function 1+ 2356 2+ 356 1,4,7 3+ 56 2,5,8 3,6,9 234 567 145 136 345 356 367 357 368 124 159 689 125 457 458 Data Mining Techniques 37 Subset Operation Using Hash Tree transaction Hash Function 12356 1+ 2356 2+ 356 1,4,7 3,6,9 12+ 356 2,5,8 3+ 56 13+ 56 234 15+ 6 567 145 136 345 356 367 357 368 124 159 689 125 457 458 Data Mining Techniques 38 Subset Operation Using Hash Tree Hash Function 1 2 3 5 6 transaction 1+ 2356 2+ 356 1,4,7 3,6,9 12+ 356 2,5,8 3+ 56 13+ 56 234 15+ 6 567 145 136 345 356 367 357 368 124 159 689 125 457 458 Match transaction against 11 out of 15 candidates Data Mining Techniques 39 How the Hash Tree Works • Suppose t = {1, 2, 3, 4, 5} • First all size 3-itemsets must begin with 1, 2 or 3 • Therfore at the root must hash on 1, 2 and 3 separately • Once we reach the child of the root, need to hash again repeat the process till the algorithm reaches the leaves check if each candidate in the leaf is a subset of the transaction and increment count if it is In the example, 6/9 leaf nodes are visited and 11/15 itemsets are matched Data Mining Techniques 40 Generating Association Rules From Frequent Itemsets TID Item_IDs • For each frequent itemset l, generate all nonempty T10 I1,I2,I5 T20 I2,I4 subset of l T30 I2,I3 T40 I1,I2,I4 • For every nonempty subset s of l, output the rule T50 I1,I3 sup_ count ( A B) sup_ count (l ) T60 I2,I3 c sup_ count ( s) T70 I1,I3 sup_ count ( A) T80 I1,I2,I3,I5 T90 I1,I2,I3 – Example: Suppose l = {I1, I2, I5}. The nonempty subsets of l are {I1,I2} ,{I1,I5},{I2,I5},{I1},{I2},and{I5}. The association rules are: I1∧I2 I5 c=2/4=50% I1∧I5 I2 c=2/2=100% I2∧I5 I5 c=2/2=100% I1 I2∧I5 c=2/6=33% I2 I1∧I5 c=2/7=29% I5 I1∧I2 c=2/2=100% Data Mining Techniques 41 Efficient Implementation of Apriori in SQL • Hard to get good performance out of pure SQL (SQL-92) based approaches alone • Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. – Get orders of magnitude improvement • S. Sarawagi, S. Thomas, and R. Agrawal, 1998 Data Mining Techniques 42 Challenges of Frequent Itemset Mining • The core of the Apriori algorithm – Use frequent (k–1)-itemsets to generate candidate frequent k- itemsets – Use database scan to collect counts for the candidate itemsets • Challenge – Multiple scans of transaction database-costly • Needs (n +1 ) scans, n is the length of the longest pattern – Huge number of candidates especially when support threshold is set low • 104 frequent 1-itemset will generate 107 candidate 2-itemsets • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 ~ 1030 candidates. – Tedious workload of support counting for candidates Data Mining Techniques 43 Outline • Methods for improving Apriori • An interesting approach – FP-growth Data Mining Techniques 44 Methods to Improve Apriori‘s Efficiency • Improving Apriori: general ideas – Reduce passes of transaction database scans – Shrink number of candidates – Facilitate support counting of candidates Data Mining Techniques 45 DIC: Reduce Number of Scans • DIC (Dynamic Itemset Counting ): tries to reduce the number of passes over the database by dividing the database into intervals of a specific size • Intuitively, DIC works like a train running over the data with stops at intervals M transactions apart (M is a parameter) • S. Brin R. Motwani, J. Ullman, and S. Tsur. ―Dynamic itemset counting and implication rules for market basket data‖. In SIGMOD’97 Data Mining Techniques 46 DIC: Reduce Number of Scans • Candidate 1-itemsets are generated • Once both A and D are determined frequent, the counting of AD begins • Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins ABCD Transactions ABC ABD ACD BCD 1-itemsets Apriori 2-itemsets … AB AC BC AD BD CD 1-itemsets 2-items A B C D DIC 3-items {} Data Mining Techniques 47 Itemset lattice DIC: An Example • A transaction database TDB with 40,000 transactions; support threshold=100; M =10,000 – If itemset a and b get support counts greater than 100 in the first 10,000 transactions, DIC will start counting 2-itemset ab after the first 10,000 transactions – Similarly, if ab, ac and bc are contained in at least 100 transactions among the second 10,000 transactions, DIC will start counting 3-itemset abc after 20,000 transactions – Once DIC gets to the end of the transaction database TDB, it will stop counting the 1-itemsets and go back to the start of the database and count the 2 and 3-itemsets – After the first 10,000 transactions, DIC will finish counting ab, and after 20,000 transactions, it will finish counting abc By overlapping the counting of different lengths of itemsets, DIC can save some database scans Data Mining Techniques 48 DHP: Reduce the Number of Candidates • DHP (Direct Hashing and Pruning ): reduces the number of candidate itemsets • J. Park, M. Chen, and P. Yu. ―An effective hash- based algorithm for mining association rules‖. In SIGMOD’95 Data Mining Techniques 49 DHP: Reduce the Number of Candidates • In the k-th scan, DHP counts not only length-k candidates, but also buckets of length-(k+1) potential candidates • A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent – Candidates: a, b, c, d, e – Hash entries: {ab, ad, ae} {bd, be, de} … – Frequent 1-itemset: a, b, d, e – ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold Data Mining Techniques 50 Compare Apriori & DHP DHP Data Mining Techniques 51 DHP: Database Trimming Data Mining Techniques 52 Example: DHP Data Mining Techniques 53 Example: DHP Data Mining Techniques 54 Partition: A Two Scan Method • Partition: requires just two database scans to mine the frequent itemsets • A. Savasere, E. Omiecinski, and S. Navathe, ―An efficient algorithm for mining association rules in large databases‖. VLDB‘95 Data Mining Techniques 55 A Two Scan Method: Partition • Partition the database into n partitions, such that each partition can be held into main memory • Itemset X is frequent ���� X must be frequent in at least one partition – Scan 1: partition database and find local frequent patterns – Scan 2: consolidate global frequent patterns • All local frequent itemsetscan be held in main memory? A sometimes too strong assumption Data Mining Techniques 56 Partitioning Data Mining Techniques 57 Sampling for Frequent Patterns • Sampling : selects a sample of original database, mine frequent patterns within sample using Apriori • H. Toivonen. ―Sampling large databases for association rules‖. In VLDB’96 Data Mining Techniques 58 Sampling for Frequent Patterns • Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked – Example: check abcd instead of ab, ac, …, etc. • Scan database again to find missed frequent patterns • Trade off some degree of accuracy against efficiency Data Mining Techniques 59 Eclat • Eclat: uses the vertical database layout and uses the intersection based approach to compute the support of an itemset • H. Toivonen. ―Sampling large databases for association rules‖. In VLDB’96 Data Mining Techniques 60 Eclat – An Example • Transform the horizontally formatted data to the vertical format Horizontal Data Layout Vertical Data Layout TID Item_IDs Itemset TID_set T10 I1,I2,I5 I1 {T10, T40, T50, T70, T80, T90} T20 I2,I4 I2 {T10, T20, T30, T40, T60, T80, T90} T30 I2,I3 I3 {T30, T50, T60, T70, T80, T90} T40 I1,I2,I4 I4 {T20, T40} T50 I1,I3 I5 {T10, T80} T60 I2,I3 T70 I1,I3 T80 I1,I2,I3,I5 T90 I1,I2,I3 Data Mining Techniques 61 Eclat – An Example • The frequent k-itemset can be used to construct the candidate (k+1)-itemsets • Determine support of any (k+1)-itemset by intersecting tid-lists of two of its k subsets 2-itemsets 3-itemsets Itemset TID_set Itemset TID_set {I1, I2} {T10, T40, T80, T90} {I1, I2, I3} {T80, T90} {I1, I3} {T50, T70, T80, T90} {I1, I2, I5} {T10, T80} {I1, I4} {T40} {I1, I5} {T10, T80} {I2, I3} {T30, T60, T80, T90} {I2, I4} {T20, T40} Adv: very fast support counting {I2, I5} {T10, T80} Disa: intermediate tid-lists may {I3, I5} {T80} become too large fo memory Data Mining Techniques 62 Apriori-like Advantage • Uses large itemset property • Easily parallelized • Easy to implement Data Mining Techniques 63 Apriori-Like Bottleneck • Multiple database scans are costly • Mining long patterns needs many passes of scanning and generates lots of candidates – To find frequent itemset i1i2…i100 • # of scans: 100 • # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 ! • Bottleneck: candidate-generation-and-test • Can we avoid candidate generation? Data Mining Techniques 64 Mining Frequent Patterns Without Candidate Generation • Grow long patterns from short ones using local frequent items – ―abc‖ is a frequent pattern – Get all transactions having ―abc‖: DB|abc – ―d‖ is a local frequent item in DB|abc abcd is a frequent pattern Data Mining Techniques 65 Compress Database by FP-tree TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} min_support = 3 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {} Header Table 1. Scan DB once, find frequent 1-itemset Item frequency head f:1 (single item pattern) f 4 c 4 c:1 2. Sort frequent items in a 3 frequency descending b 3 order, L-list m 3 a:1 p 3 3. Scan DB again, m:1 construct FP-tree F-list=f-c-a-b-m-p Data Mining Techniques p:1 66 Compress Database by FP-tree TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p} {} Header Table Item frequency head f:2 f 4 c 4 c:2 a 3 b 3 a:2 m 3 p 3 m:1 b:1 Data Mining Techniques p:1 m:1 67 Compress Database by FP-tree TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p} {} Header Table Item frequency head f:3 f 4 c 4 c:3 b:1 a 3 b 3 a:3 m 3 p 3 m:2 b:1 Data Mining Techniques p:2 m:1 68 Compress Database by FP-tree TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p} {} Header Table Item frequency head f:3 c:1 f 4 c 4 c:2 b:1 b:1 a 3 b 3 a:2 p:1 m 3 p 3 m:1 b:1 Data Mining Techniques p:1 m:1 69 Compress Database by FP-tree TID (ordered) frequent items 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, b, p} 500 {f, c, a, m, p} {} Header Table Item frequency head f:4 c:1 f 4 c 4 c:3 b:1 b:1 a 3 b 3 a:3 p:1 m 3 p 3 m:2 b:1 Data Mining Techniques p:2 m:1 70 Benefits of the FP-tree • Completeness – Preserve complete information for frequent pattern mining – Never break a long pattern of any transaction • Compactness – Reduce irrelevant info—infrequent items are gone – Items in frequency descending order: the more frequently occurring, the more likely to be shared – Never be larger than the original database (not count node-links and the count field) – For Connect-4 DB, compression ratio could be over 100 Data Mining Techniques 71 Partition Patterns and Databases • Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p – Patterns containing p – Patterns having m but no p – … – Patterns having c but no a nor b, m, or p – Pattern f • The partitioning is complete and does not have any overlap Data Mining Techniques 72 Find Patterns Having P From P- conditional Database • Starting at the frequent item header table in the FP-tree • Traverse the FP-tree by following the link of each frequent item p • Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base {} Conditional pattern bases Header Table Item frequency head f:4 c:1 item cond. pattern base f 4 c f:3 c 4 c:3 b:1 b:1 a 3 a fc:3 b 3 a:3 p:1 b fca:1, f:1, c:1 m 3 p 3 m:2 b:1 m fca:2, fcab:1 p:2 m:1 p fcam:2, cb:1 Data Mining Techniques 73 From Conditional Pattern-bases to Conditional FP-trees • For each pattern-base – Accumulate the count for each item in the base – Construct the FP-tree for the frequent items of the pattern base p-conditional pattern base: {} fcam:2, cb:1 Header Table Item frequency head All frequent f:4 c:1 patterns relate to p f 4 {} c 4 c:3 b:1 b:1 p a 3 c:3 pc b 3 a:3 p:1 m 3 p 3 m:2 b:1 p:2 m:1 Mining Techniques Data 74 m-conditional FP-tree Recusive Mining • Patterns having m but no p can be mined recursively m-conditional pattern base: {} fca:2, fcab:1 Header Table Item frequency head All frequent f:4 c:1 patterns relate to m f 4 {} m, c 4 c:3 b:1 b:1 a 3 f:3 fm, cm, am, b 3 a:3 p:1 fcm, fam, cam, m 3 c:3 fcam p 3 m:2 b:1 p:2 m:1 a:3 m-conditional FP-tree Data Mining Techniques 75 Optimization • Optimization: enumerate patterns from single- branch FP-tree – Enumerate all combination – Support = that of the last item • m, fm, cm, am {} • fcm, fam, cam • fcam f:3 c:3 a:3 m-conditional FP-tree Data Mining Techniques 76 A Special Case: Single Prefix Path in FP-tree • A (projected) FP-tree has a single prefix – Reduce the single prefix into one node – Join the mining results of the two parts {} enumeration of all the combinations of the sub-pathes of P a1:n1 a2:n2 r1 a3:n3 {} a1:n1 C1:k1 r1 = b1:m1 b1:m1 C1:k1 a2:n2 + a3:n3 C2:k2 C3:k3 C2:k2 C3:k3 Data Mining Techniques 77 FP-Growth • Idea: Frequent pattern growth – Recursively grow frequent patterns by pattern and database partition • Method – For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree – Repeat the process on each newly created conditional FP-tree – Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern Data Mining Techniques 78 Scaling Up FP-growth by Database Projection • What if FP-tree cannot fit in memory?—Database projection – Partition a database into a set of projected Databases – Construct and mine FP-tree for each projected Database • Heuristic: Projected database shrinks quickly in many applications – Such a process can be recursively applied to any projected database if its FP-tree still cannot fit in main memory How? Data Mining Techniques 79 Partition-based Projection • Parallel projection needs Tran. DB a lot of disk space fcamp fcabm • Partition projection fb cbp saves it fcamp p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB fcam fcab f fc f … cb fca cb … … fcam fca … am-proj DB cm-proj DB fc f … fc f fc f Data Mining Techniques 80 FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K: the average transaction size and average maximal potentially frequent itemset size are set to 25 and 20, respectively, while the number of transactions in the dataset is set to 10K [AS94] 100 90 D1 FP-grow th runtime D1 Apriori runtime 80 70 Run time(sec.) 60 50 40 30 20 10 0 0 0.5 1 1.5 2 2.5 3 Data Mining Techniques Support threshold(%) 81 FP-Growth vs. Tree-Projection: Scalability with the Support Threshold Data set T25I20D100K 140 D2 FP-growth 120 D2 TreeProjection 100 Runtime (sec.) 80 60 40 20 0 0 0.5 1 1.5 2 Data Mining Techniques 82 Support threshold (%) Why Is FP-Growth Efficient? • Divide-and-conquer: – decompose both the mining task and DB according to the frequent patterns obtained so far – leads to focused search of smaller databases • Other factors – no candidate generation, no candidate test – compressed database: FP-tree structure – no repeated scan of entire database – basic ops—counting local freq items and building sub FP-tree, no pattern search and matching Data Mining Techniques 83 Major Costs in FP-Growth • Poor locality of FP-trees – Low hit rate of cache • Building FP-trees – A stack of FP-trees • Redundant information – Transaction abcd appears in a-, ab-, abc-, ac-, …, c- projected databases and FP-trees. • Can we avoid the redundancy? Data Mining Techniques 84 Implications of the Methodology • Mining closed frequent itemsets and max-patterns – CLOSET (DMKD‘00) • Constraint-based mining of frequent patterns – Convertible constraints (KDD‘00, ICDE‘01) • Computing iceberg data cubes with complex measures – H-tree and H-cubing algorithm (SIGMOD‘01) Data Mining Techniques 85 Closed Frequent Itemsets • An itemset X is closed if none of its immediate supersets has the same support as X. • An itemset X is not closed if at least one of its immediate supersets has the same support count as X. – For example • Database: {(1,2,3,4),(1,2,3,4,5,6)} • Itemset (1,2) is not a closed itemset • Itemset (1,2,3,4) is a closed itemset • An itemset is a closed frequent itemset if it is closed and its support satisfies support threshold. Data Mining Techniques 86 Benefits of closed frequent itemsets • It reduces redundant patterns to be generated – A frequent itemset {a , a , , a } , the total number of 1 2 100 frequent itemsets that it contains is (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 ! • It has the same power as frequent itemset mining • It improves not only efficiency but also effectiveness of mining Data Mining Techniques 87 Mining Closed Frequent Itemsets (Ⅰ) • Itemset merging: if Y appears in every occurrence of X, then Y is merged with X – For example, the projected conditional database for prefix item {I5:2} is {{I2,I1},{I2,I1,I3}}. Item {I2,I1} can be merged with {I5} to form the closed itemset, {I5,I2,I1:2} • Sub-itemset pruning: if Y כX, and sup(X) = sup(Y), X and all of X‘s descendants in the set enumeration tree can be pruned – For example, suppose a transaction database: a , a , , a , a , a , , a 1 2 100 1 2 50 min_sup=2. The projection on the item a1 , a , a , , a 1 2 : 2 . Thus the 50 mining of closed frequent itemset in this data set terminates after mining a1 's Projected database. Data Mining Techniques 88 Mining Closed Frequent Itemsets(Ⅱ) • Item skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levels – For example, a transaction database: a , a , , a , a , a , , a , 1 2 100 1 2 50 min_sup = 2. Becausea 2 in a1‗s projected database has the same support as a 2 in the global header table, a 2 can be pruned from the global header table. • Efficient subset checking – closure checking – Superset checking: checks if this new frequent itemset is a superset of some already found closed itemsets with the same support – Subset checking Data Mining Techniques 89 Mining Closed Frequent Itemsets • J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00. Data Mining Techniques 90 Maximal Frequent Itemsets • An itemset is maximal frequent if none of its immediate supersets is frequent • Despite providing a compact representation, maximal frequent itemsets do not contain the support information of their subsets. – For example, the support of the maximal frequent itemsets {a, c, e}, {a, d}, and {b,c,d,e} do not provide any hint about the support of their subsets. • An additional pass over the data set is therefore needed to determine the support counts of the non-maximal frequent itemsets. • It might be desirable to have a minimal representation of frequent itemsets that preserves the support information. – Such representation is the set of the closed frequent itemsets. Data Mining Techniques 91 Maximal vs Closed Itemsets All maximal frequent itemsets are closed because none Frequent of the maximal frequent Itemsets itemsets can have the same support count as their Closed immediate supersets. Frequent Itemsets Maximal Frequent Itemsets Data Mining Techniques 92 MaxMiner: Mining Max-patterns • 1st scan: find frequent items Tid Items – A, B, C, D, E 10 A,B,C,D,E 20 B,C,D,E, • 2nd scan: find support for 30 A,C,D,F – AB, AC, AD, AE, ABCDE – BC, BD, BE, BCDE Potential – CD, CE, CDE, DE, max-patterns • Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan • R. Bayardo. Efficiently mining long patterns from Data Mining Techniques 93 databases. In SIGMOD’98 Further Improvements of Mining Methods • AFOPT (Liu, et al. [KDD‘03]) – A ―push-right‖ method for mining condensed frequent pattern (CFP) tree • Carpenter (Pan, et al. [KDD‘03]) – Mine data sets with small rows but numerous columns – Construct a row-enumeration tree for efficient mining Data Mining Techniques 94 Mining Various Kinds of Association Rules • Mining multilevel association • Miming multidimensional association • Mining quantitative association • Mining interesting correlation patterns Data Mining Techniques 95 Multiple-Level Association Rules • Items often form hierarchies TID Items Purchased 1 IBM-ThinkPad-R40/P4M, Symantec-Norton-Antivirus-2003 2 Microsoft-Office-Proffesional-2003, Microsoft- 3 logiTech-Mouse, Fellows-Wrist-Rest … … all Level 0 Level 1 Computer Software Printer & Camera Accessory Level 2 laptop desktop office antivirus printer camera mouse pad Level 3 Data Mining Techniques 96 IBM Dell Microsoft Multiple-Level Association Rules • Flexible support settings – Items at the lower level are expected to have lower support • Exploration of shared multi-level mining (Agrawal & Srikant[VLB’95], Han & Fu[VLDB’95]) uniform support reduced support Level 1 Milk Level 1 min_sup = 5% [support = 10%] min_sup = 5% Level 2 2% Milk Skim Milk Level 2 min_sup = 5% [support = 6%] [support = 4%] min_sup = 3% Data Mining Techniques 97 Multi-level Association: Redundancy Filtering • Some rules may be redundant due to ―ancestor‖ relationships between items. – Example • laptop computer HP printer [support = 8%, confidence = 70%] • IBM laptop computer HP printer [support = 2%, confidence = 72%] • We say the first rule is an ancestor of the second rule. • A rule is redundant if its support is close to the ―expected‖ value,Data Mining Techniques rule’s ancestor. 98 based on the Multi-Dimensional Association • Single-dimensional rules: buys(X, ―computer‖) buys(X, ―printer‖) • Multi-dimensional rules: 2 dimensions or predicates – Inter-dimension assoc. rules (no repeated predicates) age(X,‖19-25‖) occupation(X,―student‖) buys(X, ―coke‖) – hybrid-dimension assoc. rules (repeated predicates) age(X,‖19-25‖) buys(X, ―popcorn‖) buys(X, ―coke‖) • Categorical Attributes: finite number of possible values, no ordering among values—data cube approach • Quantitative Attributes: numeric, implicit ordering among values—discretization, clustering, and gradient approaches Data Mining Techniques 99 Multi-Dimensional Association Techniques can be categorized by how numerical attributes, such as age or salary are treated 1. Quantitative attributes are discretized using predefined concept hierarchies – Static and predetermined • A concept hierarchy for income, such as ―0…20k‖, ―21k…30k‖, and so on. 2. Quantitative attributes are discretized or clustered into ―bins‖ based on the distribution of the data – Dynamic, referred as quantitative association rules Data Mining Techniques 100 Quantitative Association Rules • Proposed by Lent, Swami and Widom ICDE‘97 • Numeric attributes are dynamically discretized – Such that the confidence or compactness of the rules mined is maximized • 2-D quantitative association rules: Aquan1 Aquan2 Acat • Example Data Mining Techniques 101 Quantitative Association Rules • ARCS (association rule clustering system)- Cluster adjacent association rules to form general rules using a 2- D grid – Binning: partition the ranges of quantitative attributes into intervals • Equal-width • Equal-frequency • Clustering-based – Finding frequent predicate sets: once the 2-D array containing the count distribution for each category is set up, it can be scaned to find the frequent predicate sets – Clustering the associationrules age(X,”34-35”) income(X,”30-50K”) buys(X,”high resolution TV”) Data Mining Techniques 102 Correlation Analysis min_sup:30% min_conf:60% • play basketball eat cereal [40%, 66%] is misleading – The overall % of students eating cereal is 75% > 66%. • play basketball not eat cereal [20%, 34%] is more accurate, although with lower support and confidence • Measure of dependent/correlated events: lift Basketball Not basketball Sum (row) P( A B) lift Cereal Not cereal 2000 1000 1750 250 3750 1250 P( A) P( B) Sum(col.) 3000 2000 5000 2000 / 5000 1000 / 5000 lift ( B, C ) 0.89 lift( B, C ) 1.33 3000 / 5000*3750 / 5000 3000 / 5000 *1250 / 5000 Data Mining Techniques 103 Outline • What is association rule mining and frequent pattern mining? • Methods for frequent-pattern mining • Constraint-based frequent-pattern mining • Frequent-pattern mining: achievements, promises and research problems Data Mining Techniques 104 Constraint-based (Query-Directed) Mining • Finding all the patterns in a database autonomously? — unrealistic! – The patterns could be too many but not focused! • Data mining should be an interactive process – User directs what to be mined using a data mining query language (or a graphical user interface) • Constraint-based mining – User flexibility: provides constraints on what to be mined – System optimization: explores such constraints for Data Mining Techniques 105 efficient mining—constraint-based mining Constraints • Constrains can be classified into five categories: – antimonotone – Monotone – Succinct – Convertible – Inconvertible Data Mining Techniques 106 Anti-Monotone in Constraint Pushing TDB (min_sup=2) • Anti-monotone TID Transaction 10 a, b, c, d, f – When an intemset S violates the 20 b, c, d, f, g, h constraint, so does any of its superset 30 a, c, d, e, f – sum(S.Price) v is anti-monotone 40 c, e, f, g – sum(S.Price) v is not anti-monotone Item Profit • Example. C: range(S.profit) 15 is a 40 b 0 anti-monotone c -20 – Itemset ab violates C d 10 – So does every superset of ab e -30 f 30 Data Mining Techniques g 20 107 h -10 Monotone for Constraint Pushing TDB (min_sup=2) TID Transaction • Monotone 10 a, b, c, d, f – When an intemset S satisfies the 20 b, c, d, f, g, h constraint, so does any of its 30 a, c, d, e, f superset 40 c, e, f, g – sum(S.Price) v is monotone Item Profit a 40 – min(S.Price) v is monotone b 0 • Example. C: range(S.profit) 15 c -20 d 10 – Itemset ab satisfies C e -30 – So does every superset of ab f 30 Data Mining Techniques g 20 108 h -10 Succinctness • Succinctness: – Given A1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A1 , i.e., S contains a subset belonging to A1 – Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items – min(S.Price) v is succinct – sum(S.Price) v is not succinct • Optimization: If C is succinct, C is pre-counting Data Mining Techniques 109 pushable Converting ―Tough‖ Constraints TDB (min_sup=2) TID Transaction • Convert tough constraints into 10 a, b, c, d, f anti-monotone or monotone by 20 b, c, d, f, g, h properly ordering items 30 a, c, d, e, f 40 c, e, f, g • Examine C: avg(S.profit) 25 Item Profit – Order items in value-descending a 40 order b 0 c -20 • <a, f, g, d, b, h, c, e> d 10 – If an itemset afb violates C e -30 • So does afbh, afb* f 30 g 20 • It becomes anti-monotone! Data Mining Techniques h 110 -10 Strongly Convertible Constraints • avg(X) 25 is convertible anti-monotone w.r.t. item value descending order R: <a, f, g, d, b, h, c, e> Item Profit – If an itemset af violates a constraint C, so a 40 does every itemset with af as prefix, such as b 0 afd c -20 • avg(X) 25 is convertible monotone d 10 w.r.t. item value ascending order R-1: <e, e -30 f 30 c, h, b, d, g, f, a> g 20 – If an itemset d satisfies a constraint C, so h -10 does itemsets df and dfa, which having d as a prefix • Thus, avg(X) 25 Data strongly convertible is Mining Techniques 111 Can Apriori Handle Convertible Constraint? • A convertible, not monotone nor anti- monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm Item Value a 40 – Within the level wise framework, no direct b 0 pruning based on the constraint can be made c -20 – Itemset df violates constraint C: avg(X)>=25 d 10 – Since adf satisfies C, Apriori needs df to e -30 assemble adf, df cannot be pruned f 30 • But it can be pushed into frequent-pattern g 20 growth framework! Data Mining Techniques h -10 112 Mining With Convertible Constraints Item Value • C: avg(X) >= 25, min_sup=2 a 40 f 30 • List items in every transaction in value g 20 descending order R: <a, f, g, d, b, h, c, e> d 10 – C is convertible anti-monotone w.r.t. R b 0 • Scan TDB once h -10 c -20 – remove infrequent items e -30 • Item h is dropped – Itemsets a and f are good, … TDB (min_sup=2) TID Transaction • Projection-based mining 10 a, f, d, b, c – Imposing an appropriate order on item projection 20 f, g, d, b, c – Many tough constraints can be converted into 30 a, f, d, c, e (anti)-monotone Data Mining Techniques 40 f, g, h, c, e 113 Handling Multiple Constraints • Different constraints may require different or even conflicting item-ordering • If there exists an order R s.t. both C1 and C2 are convertible w.r.t. R, then there is no conflict between the two convertible constraints • If there exists conflict on order of items – Try to satisfy one constraint first – Then using the order for the other constraint to mine frequent itemsets in the corresponding projected database Data Mining Techniques 114 What Constraints Are Convertible? Convertible Convertible Strongly Constraint anti-monotone monotone convertible avg(S) , v Yes Yes Yes median(S) , v Yes Yes Yes sum(S) v (items could be of any Yes No No value, v 0) sum(S) v (items could be of any No Yes No value, v 0) sum(S) v (items could be of any No Yes No value, v 0) sum(S) v (items could be of any Yes No No value, v 0) …… Data Mining Techniques 115 Constraint-Based Mining—A General Picture Constraint Antimonotone Monotone Succinct vS no yes yes SV no yes yes SV yes no yes min(S) v no yes yes min(S) v yes no yes max(S) v yes no yes max(S) v no yes yes count(S) v yes no weakly count(S) v no yes weakly sum(S) v ( a S, a 0 ) yes no no sum(S) v ( a S, a 0 ) no yes no range(S) v yes no no range(S) v no yes no avg(S) v, { , , } convertible convertible no support(S) yes no no support(S) Data Mining Techniques no yes no 116 A Classification of Constraints Monotone Antimonotone Strongly convertible Succinct Convertible Convertible anti-monotone monotone Inconvertible Data Mining Techniques 117 Outline • What is association rule mining and frequent pattern mining? • Methods for frequent-pattern mining • Constraint-based frequent-pattern mining • Frequent-pattern mining: achievements, promises and research problems Data Mining Techniques 118 Frequent-Pattern Mining: Summary • Frequent pattern mining—an important task in data mining • Scalable frequent pattern mining methods – Apriori (Candidate generation & test) – Projection-based (FPgrowth, CLOSET+, ...) – Vertical format approach (CHARM, ...) Mining a variety of rules and interesting patterns Constraint-based mining Mining sequential and structured patterns Extensions and applications Data Mining Techniques 119 Frequent-Pattern Mining: Research Problems • Mining fault-tolerant frequent, sequential and structured patterns – Patterns allows limited faults (insertion, deletion, mutation) • Mining truly interesting patterns – Surprising, novel, concise, … • Application exploration – E.g., DNA sequence analysis and bio-pattern classification – ―Invisible‖ data mining Data Mining Techniques 120 Assignment (Ⅰ) • A database has five transactions. Suppose min_sup = 60% and TID Items_list min_conf = 80%. T1 {m, o, n, k, e, y} – Find all frequent itemsets using T2 {d, o, n, k, e, y} Apriori and FP-grwoth, respectively. T3 {m, a, k, e} Compare the efficiency of the two T4 {m, u, c, k, y} mining process. T5 {c, o, k, I, e} – List all of the strong association rules Data Mining Techniques 121 Assignment (Ⅱ) • Frequent itemset mining often generate a huge number of frequent itemsets. Discuss effective methods that can be used to reduced the number of frequent itemsets while still preserving most of the information. • The price of each item in a store is nonnegative. The store manager is only interested in rules of the forms:‖ one free item may trigger $200 total purchases in the same transaction.‖ State how to mine such rules efficiently Data Mining Techniques 122 Thank you ! Data Mining Techniques 123

DOCUMENT INFO

Shared By:

Categories:

Tags:
Lecture Notes, East Asian, video lectures, Video Lecture, standard deviation, Currency Crisis, East Asian community, Video Courses, East Asia, Requirements analysis

Stats:

views: | 16 |

posted: | 7/14/2011 |

language: | English |

pages: | 123 |

OTHER DOCS BY shuifanglj

How are you planning on using Docstoc?
BUSINESS
PERSONAL

Feel free to Contact Us with any questions you might have.