VIEWS: 480 PAGES: 77 CATEGORY: Business POSTED ON: 2/9/2010 Public Domain
Mining Association Rules in Large Databases By Group 10 Sadler Divers 103315414 Beili Wang 104522400 Xiang Xu 106067660 Xiaoxiang Zhang 105635826 Spring 2007 - CSE634 DATA MINING Professor Anita Wasilewska Department of Computer Sciences - Stony Brook University - SUNY Sources/References [1] J. Han and M. Kamber, "Data Mining: Concepts and Techniques", 2nd Edition, Morgan Kaufmann Publishers, August 2006. [2] A. Wasilewska, "Data Mining: Concepts and Techniques", Course Slides. [3] J. Han, "Data Mining: Concepts and Techniques", Book Slides. [4] T. Brijs et al., “Using Association Rules for Product Assortment Decisions: A Case Study”, KDD-99 ACM 1999. [5] A. Savasere, E. Omiecinski, and S. Navathe. An Efficient Algorithm for Mining Association Rules in Large Databases. VLDB'95, 432-444, Zurich, Switzerland. [6] J. Han and Y. Fu, ''Discovery of Multiple-Level Association Rules from Large Databases'', In Proc. of 1995 Int. Conf. on Very Large Data Bases (VLDB'95). [7] M. Kamber, J. Han, and J. Y. Chiang. "Metarule-guided mining of multi- dimensional association rules using data cubes". In Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining (KDD'97). [8] B. Lent, A. Swami, and J. Widom. "Clustering association rules". In Proc. 1997 Int. Conf. Data Engineering (ICDE'97). [9]. S. Brin, R. Motwani and C. Silverstein. “Beyond Market Baskets: Generalizing Association rules to Correlations”. Proceeding of the 1997 ACM SIGMOD International conference on management of data. Goal and Overview • Goals: – Introduce the concepts of frequent patterns, associations, and correlations; – Explain how they can be mined efficiently. • Overview: – Introduction and Apriori Algorithm – Improved the Efficiency of Apriori – Mining Various Kinds of Association Rules – From Association Mining to Correlation Analysis Introduction and Apriori Algorithm Sadler Divers References [1] J. Han and M. Kamber, "Data Mining: Concepts and Techniques", 2nd Edition, Morgan Kaufmann Publishers, August 2006. [2] A. Wasilewska, "Data Mining: Concepts and Techniques", Course Slides. [3] J. Han, "Data Mining: Concepts and Techniques", Book Slides. [4] T. Brijs et al., “Using Association Rules for Product Assortment Decisions: A Case Study”, KDD-99 ACM 1999. Mining Association Rules • Definition It’s the process of finding frequent patterns or associations within the data of some DB or some set of DBs. • Why? To gain Information, Knowledge, Money, etc. Applications Market Basket Analysis Cross-Marketing Catalog Design Product Assortment Decision How is it done? Approaches: • Apriori Algorithm • FP-Growth (Frequent Pattern Growth) • Vertical Format Concepts and Definitions • Let I = {I1, I2, … Im} a set of items • Let D be a set of DB transactions • Let T be a particular transaction • An association rule is of the form A => B where A, B included in I and (A ∩ B = ) Concepts & Definitions (continued) • Support: The support of a rule, A => B, is the percentage of transactions in D, the DB, containing both A and B. • Confidence: The percentage of transactions in D containing A that also contain B. Concepts & Definitions (continued) • Strong Rules: Rules that satisfy both a minimum support and a minimum confidence are said to be strong • Itemset: Simply a set of items • k-Itemset: a set of items with k items in it Concepts & Definitions (continued) • Apriori Property: All non-empty subset of a frequent itemset must also be frequent • Frequent Itemset: An itemset is said to be frequent if it satisfies the minimum support threshold. Apriori Algorithm • A two-step process – The join step: Find Lk, the set of candidate of k- itemsets, join Lk-1 with itself. – Rules for joining: • Order the items first so you can compare item by item • The join of Lk-1 is possible only if its first (k-2) items are in common Apriori Algorithm (continued) • The Prune step: – The “join” step will produce all k-itemsets, but not all of them are frequent. – Scan DB to see which itemsets are indeed frequent and discard the others. • Stop when “join” step produces and empty set Apriori Algorithm : Pseudo code • Join Step: Ck is generated by joining Lk-1with itself • Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset • Pseudo-code: Ck: Candidate itemset of size k Lk: frequent itemset of size k L1 = {frequent items}; for (k= 1; Lk!= ∅; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1= candidates in Ck+1with min_support end return∪kLk; Source: A. Wasilewska, CSE 634, Lecture Slides The Apriori Algorithm—An Example Supmin = 2 Itemset sup Itemset sup Database TDB {A} 2 L1 {A} 2 Tid Items C1 {B} 3 {B} 3 10 A, C, D {C} 3 1st scan {C} 3 20 B, C, E {D} 1 {E} 3 30 A, B, C, E {E} 3 40 B, E C2 Itemset sup C2 Itemset {A, B} 1 L2 Itemset sup 2nd scan {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 {B, C} 2 {B, C} 2 {A, E} {B, E} 3 {B, E} 3 {B, C} {C, E} 2 {C, E} 2 {B, E} {C, E} C3 L3 Itemset sup Itemset 3rd scan Source: J. Han, “Data {B, C, E} 2 Mining Concepts and {B, C, E} Techniques Generating Association Rules From Frequent Itemsets • For each frequent itemset l, generate all nonempty subsets of l. • For every nonempty subset s of l, output rule “s => (l - s)” if: support_count(l) / support_count(s) >= min_conf (where min_conf = minimum confidence threshold). Association Rules from Example • Generate all nonempty subsets: – {B, C}, {B, E}, {C, E}, {B}, {C}, {E} • Calculate Confidence: • B ∩ C => E Confidence = 2/2 = 100% • B ∩ E => C Confidence = 2/3 = 66% • C ∩ E => B Confidence = 2/2 = 100% • B => C ∩ E Confidence = 2/3 = 66% • C => B ∩ E Confidence = 3/3 = 100% • E => B ∩ C Confidence = 2/3 = 66% Improved the Efficiency of Apriori Beili Wang References [1] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95, 432-444, Zurich, Switzerland. <http://www.informatik.uni- trier.de/~ley/db/conf/vldb/SavasereON95.html>. [2] J. Han and M. Kamber. "Data Mining: Concepts and Techniques". Morgan Kaufmann Publishers. March 2006. Chapter 5, Section 5.2.3, Page 240. [3] Presentation Slides of Prof. Anita Wasilewska Improving Apriori: General Ideas • Challenges: – Multiple scans of transaction database – Huge number of candidates – Tedious workload of support counting for candidates • General Ideas: – Reduce passes of transaction database scans – Shrink number of candidates – Facilitate support counting of candidates Source: textbook slide, 2nd Edition, Chapter 5, http://www-faculty.cs.uiuc.edu/~hanj/bk2/slidesindex.html Methods to Improve Apriori’s Efficiency • Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent. An Effective Hash-Based Algorithm for Mining Association Rules <http://citeseer.ist.psu.edu/park95effective.html> • Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans. Fast Algorithms for Mining Association Rules in Large Databases <http://citeseer.ist.psu.edu/agrawal94fast.html> • Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB. An Efficient Algorithm for Mining Association Rules in Large Databases <http://citeseer.ist.psu.edu/sarasere95efficient.html> • Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness. Sampling Large Databases for Association Rules <http://citeseer.ist.psu.edu/toivonen96sampling.html> • Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent. Dynamic Itemset Counting and Implication Rules for Market Basket Data <http://citeseer.ist.psu.edu/brin97dynamic.html> Source: Presentation Slides of Prof. Anita Wasilewska, 07. Association Analysis, page 51 Partition Algorithm: Basics • Definition: A partition p b D of the database refers to any subset of the transactions contained in the database D . Any two different partitions are non- overlapping, i.e., pi T p j , i j . • Ideas: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB. Partition scans DB only twice: Scan 1: partition database and find local frequent patterns. Scan 2: consolidate global frequent patterns. Partition Algorithm Initially the database D is logically partitioned into n partitions. Phase I: read the entire database once, takes n iterations input: pi, where i = 1... n. output: local large itemsets of all lengths, L2 , L3 , , Lli as the output. i i Merge phase: input: local large itemsets of same lengths from all n partitions output: combine and generate the global candidate itemsets. The set of global candidate itemsets of length j is computed as CG [ j i Lj i 1, , n Phase II: read the entire database again, takes n iterations G input: pi, where i = 1... n; c 2 C output: counters for each global candidate itemset and counts their support Algorithm output: itemsets that have the minimum global support along with their support. The algorithm reads the entire database twice. Partition Algorithm: Pseudo code Source: An Efficient Algorithm for Mining Association Rules in Large Databases, page 7, <http://citeseer.ist.psu.edu/sarasere95efficient.html> Partition Algorithm: Example Consider a small database with four items I={Bread, Butter, Eggs, Milk} and four transactions as shown in Table 1. Table 2 shows all itemsets for I. Suppose that the minimum support and minimum confidence of an association rule are 40% and 60%, respectively. Source: A Survey of Association Rules <pandora.compsci.ualr.edu/milanova/7399-11/week10/ar.doc> Partition Algorithm: Example Source: A Survey of Association Rules <pandora.compsci.ualr.edu/milanova/7399-11/week10/ar.doc> Partition Size Q: How to estimate the partition size from system parameters? A: We must choose the partition size such that at least those itemsets that are used for generating the new large itemsets can fit in main memory. The size is estimated based on: 1. available main memory 2. average length of the transactions Effect of Data Skew Problem: 1. A gradual change in data characteristics or any localized changes in data, can lead to the generation of a large number of local large sets which may not have global support. 2. Fewer itemsets will be found common between partitions leading to a larger global candidate set. Solution: Randomly reading the pages from the database is extremely effective in eliminating data skew. Performance Comparison - Time Source: An Efficient Algorithm for Mining Association Rules in Large Databases, page 22, <http://citeseer.ist.psu.edu/sarasere95efficient.html> Performance Comparison – Disk IO Source: An Efficient Algorithm for Mining Association Rules in Large Databases, page 23, <http://citeseer.ist.psu.edu/sarasere95efficient.html> Performance Comparison – Scale-up Source: An Efficient Algorithm for Mining Association Rules in Large Databases, page 24, <http://citeseer.ist.psu.edu/sarasere95efficient.html> Parallelization in Parallel Database Partition algorithm indicates that the partition processing can be essentially done in parallel. Parallel algorithm executes in four phases: 1. All the processing nodes independently generate the large itemsets for their local data. 2. The large itemsets at each node is exchanged with all other nodes. 3. At each node support for each itemset in the candidate set with respect to the local data is measured. 4. The local counts at each node is sent to all other nodes. The global support is the sum of all local supports. Conclusion • Partition algorithm achieve both CPU and I/O improvements over Apriori algorithm • It scans the database at most twice, wherease in Apriori this is not known in advance and may be quite large. • The inherent parallelism in the alogrithm can be exploited for implementation on a parallel machine. It is suited for very large database in a high data and resource contention environment such as an OLTP system. Mining Various Kinds of Association Rules Xiang Xu Outline • Mining multilevel association • Miming multidimensional association – Mining quantitative association References [1] J. Han and M. Kamber, "Data Mining: Concepts and Techniques", Morgan Kaufmann Publishers, August 2000. [2] A. Wasilewska, "Data Mining: Concepts and Techniques", Course Slides. [3] J. Han, "Data Mining: Concepts and Techniques", Book Slides. [4] J. Han and Y. Fu, ''Discovery of Multiple-Level Association Rules from Large Databases'', In Proc. of 1995 Int. Conf. on Very Large Data Bases (VLDB'95). [5] M. Kamber, J. Han, and J. Y. Chiang. "Metarule-guided mining of multi- dimensional association rules using data cubes". In Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining (KDD'97). [6] B. Lent, A. Swami, and J. Widom. "Clustering association rules". In Proc. 1997 Int. Conf. Data Engineering (ICDE'97). Mining Multilevel Association Multilevel Association Rules • Rules generated from association rule mining with concept hierarchies milk → bread [8%, 70%] 2% milk → wheat bread [2%, 72%] • Encoded transaction: T1 {111,121,211,221} Source: J. Han and Y. Fu, ''Discovery of Multiple-Level Association Rules from Large Databases'', Proc. of 1995 Int. Conf. on Very Large Data Bases (VLDB'95). Multilevel Association: Uniform vs Reduced Support • Uniform support • Reduced support Uniform Support • Same minimum support threshold for all levels Level 1 Milk min support = 5% [support = 10%] Level 2 2% Milk Skim Milk min support = 5% [support = 6%] [support = 4%] Reduced Support • Reduced minimum support threshold at lower levels Level 1 Milk min support = 5% [support = 10%] Level 2 2% Milk Skim Milk min support = 3% [support = 6%] [support = 4%] Mining Multilevel: Top-Down Progressive Deepening • Find multilevel frequent itemsets – High-level frequent itemsets milk (15%), bread (10%) – Lower-level “weaker” frequent itemsets 2% milk (5%), wheat bread (4%) • Generate multilevel association rules – High-level strong rules milk → bread [8%, 70%] – Lower-level “weaker”rules: 2% milk → wheat bread [2%, 72%] Generation of Flexible Multilevel Association Rules • Association rules with alternative multiple hierarchies 2% milk → Old Mills bread <{11*},{2*1}> • Level-crossed association rules 2% milk → Old Mills white bread <{11*},{211}> Source: J. Han and Y. Fu, ''Discovery of Multiple-Level Association Rules from Large Databases'', Proc. of 1995 Int. Conf. on Very Large Data Bases (VLDB'95). Redundant Multilevel Association Rules Filtering • Some rules may be redundant due to “ancestor” relationships between items milk → wheat bread [8%, 70%] 2% milk → wheat bread [2%, 72%] • First rule is an ancestor of the second rule • A rule is redundant if its support and confidence are close to their “expected” values, based on the rule’s ancestor. Mining Multidimensional Association Rules Multidimensional Association Rules • Single-dimensional rules buys(X, “milk”) → buys(X, “bread”) • Multidimensional rules(2 dimensions/predicates) – Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”) ∧ occupation(X,“student”) → buys(X, “coke”) – Hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”) ∧ buys(X, “popcorn”) → buys(X, “coke”) Categorical Attributes and Quantitative Attributes • Categorical Attributes – Finite number of possible values, no ordering among values • Quantitative Attributes – Numeric, implicit ordering among values Mining Quantitative Associations • Static discretization based on predefined concept hierarchies • Dynamic discretization based on data distribution • Clustering: Distance-based association Static Discretization of Quantitative Attributes • Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. • In relational database, finding all frequent k- predicate sets will require k or k+1 table scans. • Data cube is well suited for mining. (faster) Source: J. Han and M. Kamber, "Data Mining: Concepts and Techniques", Morgan Kaufmann Publishers, August 2000. Dynamic Discretization of Quantitative Association Rules • Numeric attributes are dynamically discretized – The confidence of the rules mined is maximized • Cluster adjacent association rules to form general rules using a 2-D grid: ARCS (Association Rules Clustering System) Source: B. Lent, A. Swami, and J. Widom. “Clustering association rules”. In Proc. 1997 Int. Conf. Data Engineering (ICDE'97). Clustering Association Rules: Example age(X,34) ∧ income(X,“30 - 40K”) → buys(X,“high resolution TV”) age(X,35) ∧ income(X,“30 - 40K”) → buys(X,“high resolution TV”) age(X,34) ∧ income(X,“40 - 50K”) → buys(X,“high resolution TV”) age(X,35) ∧ income(X,“40 - 50K”) → buys(X,“high resolution TV”) age(X, “34 - 35”) ∧ income(X,“30 - 50K”) Source: J. Han and M. Kamber, "Data → buys(X,“high resolution TV”) Mining: Concepts and Techniques", Morgan Kaufmann Publishers, August 2000. Mining Distance-based Association Rules: Motive • Binning methods like equi-width and equi-depth do not capture the semantics of interval data Equi-width Equi-depth Distance- Price($) (width $10) (depth 2) based 7 [0,10] [7,20] [7,7] 20 [11,20] [22,50] [20,22] 22 [21,30] [51,53] [50,53] 50 [31,40] 51 [41,50] 53 [51,60] • Source: J. Han and M. Kamber, "Data Mining: Concepts and Techniques", Morgan Kaufmann Publishers, August 2000. Clusters and Distance Measurements • S[X]: A set of N tuples t1,t 2 ,...,t N projected on the attribute set X. • The diameter of S[X]: dist X ti X , t j X N N d ( S X ) i 1 j 1 N N 1 • dist X : Distance metric on the values for the attribute set X (e.g. Euclidean distance or Manhattan distance) Clusters and Distance Measurements (Cont.) • Cluster C X – Density threshold d X 0 d C X d 0X – Frequency threshold s 0 CX s0 • Finding clusters and distance-based rules – A modified version of BIRCH – Density threshold replace Support – Degree of association threshold replace Confidence Conclusion • Mining multilevel association – Uniform and reduced support – Top-down progressive deepening approach – Generation of flexible multilevel association rules – Redundant multilevel association rules filtering • Miming multidimensional association – Mining quantitative association • Static Discretization of Quantitative Attributes • ARCS (Association Rules Clustering System) • Mining Distance-based Association Rules From Association Mining to Correlation Analysis Xiaoxiang Zhang Sources/References: [1]. J. Han and M. Kamber. “Data Mining Concepts and Techniques”. Morgan Kaufman Publishers. [2]. S. Brin, R. Motwani and C. Silverstein. “Beyond Market Baskets: Generalizing Association rules to Correlations”. Proceeding of the 1997 ACM SIGMOD International conference on management of data. • Why we need correlation analysis? Because correlation analysis can reveal which strong association rules are really interesting and useful. • Association rule mining often generates a huge number of rules, but a majority of them either are redundant or do not reflect the true correlation relationship among data objects. Example Above table is called contingency table • Let us apply the support- confidence framework to this example. If the support, confidence threshold is [10%, 60%]. Then the following association rule is discovered: buys (X, “Tea”) => buys (X, “Coffee”) [support = 20%, confidence = 80%] • However, tea=>coffee is misleading, since the probability of purchasing coffee is 90%, which is larger than 80%. • The above example illustrates that the confidence of a rule A=>B can be deceiving in that it is only an estimate of the conditional probability of itemset B given itemset A. Measuring Correlation • One way of measuring correlation is p( A B) corrA, B p ( A) p ( B ) • If the resulting value is equal to 1, then A and B are independent. If the resulting value is greater than 1, then A and B are positively correlated, else A and B are negatively correlated. For the above example p[tc] /( p[t ] * p[c]) 0.2 /(0.25 * 0.9) 0.89, which is less than 1, indicating there is a negative correlation between buying tea and buying coffee. • Is the above way of measuring the correlation good enough? The fact is that we calculate the correlation value indeed, but we could not tell whether the value is statistically significant. • So, we introduce: The chi-squared test for independence The chi-squared test for independence • Let R be {i1 , i1} ... {ik , ik } and r r1...rk R • Here R is the set of all possible basket values, and r is a single basket value. Each value of r denotes a cell-- -this terminology comes from the view that R is a k- dimensional contingency table. Let O(r) denote the number of baskets falling into cell r. • The chi-squared statistic is defined as: (O ( r ) E[ r ]) 2 x 2 E[ r ] What does chi-squared statistic mean? • The chi-squared statistic as defined will specify whether all k items are k-way independent. 2 • If the x is equal to 0, then all the variables are really independent. If it is larger than a cutoff value at one significance level, then we say all the variables are dependent (correlated), else we say all the variables are independent. • Note that the cutoff value for any given significance level can be obtained from wildly available tables for the chi-squared distribution. 2 • Example of calculating x (O (r ) E[r ]) 2 x 2 E[r ] x2 • If the cutoff of the 95% significance level = 3.84 then 0.900 < 3.84, so the two items are independent. Correlation Rules • We have the tool to test whether a given itemset is independent or dependent (correlated). • We are almost ready to mine of rules that identify correlations, or correlation rules. • Then what is correlation rule? A correlation rule is of the form {i1 , i2 ,..., im } where the occurrence of the items {i1 , i2 ,..., im }are correlated. Upward Closed Property of Correlation • An advantage of correlation is that it is upward closed. This means that if a set S of items is correlated, then every superset of S is also correlated. In other words, adding items to a set of correlated items does not remove the exiting correlation. Minimal Correlated Itemsets • Minimal correlated itemsets are the Itemsets that are correlated although no subsets of them is correlated. • Minimal correlated itemsets form a border within the lattice. • Consequently, we reduce the data mining task as the problem of computing a border in the lattice. Support and Significant Concepts • Support: A set of items S has support s at p% level means that at least p% of cells in the contingency table for S have value s. • Significant: If an itemset is supported and minimally correlated, we say this itemset is significant. Algorithm Chi-squared Support • Input: A chi-squared significance level α, support s, support fraction p > 0.25, Basket data B. • Output: A set of minimal correlated itemsets, from B. 1. For each item i in I, count O(i). 2. Initialize Cand 0, Sig 0, Notsig 0. 3. For each pair of items ia, ib such that O(ia) > s and O(ib) > s, add {ia,ib} to Cand. 4. Notsig 0. 5. If Cand is empty, then return Sig and terminate. 6. For each itemset in Cand, do construct the contingency table for the itemset. If less than p percent of the cells have count s, then go to step 8. 7. If the chi-squared value exceeds a threshold, then add the itemset to Sig, else add the itemset to Notsig. 8. Continue with the next itemset in Cand. If there are no more itemsets in Cand, then set Cand to be the set of all sets S such that every subset of size |S|-1 of S is in Notsig. Goto Step 4. Example: • I: { i1, i2, i3, i4, i5} • Cand:{ {i1, i2},{i1, i3},{i1, i5},{i3, i5},{i2, i4} {i3, i4} } • Sig: { {i1, i2} } • Notsig: { {i1, i3}, {i1, i5}, {i3, i5}, {i2, i4} } • Cand: { {i1,i3,i5} } Limitation Use of the chi-squared test only if - All cells in the contingency table have expected value greater than 1. - At least 80% of the cells in the contingency table have expected value greater than 5. Conclusion • The use of the chi-squared test is solidly grounded in statistical theory. • The chi-squared statistic simultaneously and uniformly takes into account all possible combinations of the presence and absence of the various attributes being examined as a group. Thank You!