VIEWS: 76 PAGES: 32 POSTED ON: 4/30/2012
Association Rule Market Basket Model A-priori Algorithm The Market-Basket Model • A market Basket is a collection of items purchased by a customer in a single customer transaction. • A large set of items. (e.g. things sold in a supermarket) • A large set of baskets, each of which is a small set of the items. (e.g. the things one customer buys on one day) 2 Support • Simplest question: find sets of items that appear “frequently” in the baskets. • Support for itemset I = the number of baskets containing all items in I. • Given a support threshold s, sets of items that appear in > s baskets are called frequent itemsets. 3 Applications • “Baskets” = documents; “items” = words in those documents. – Lets us find words that appear together unusually frequently, i.e., linked concepts. • “Baskets” = Web pages; “items” = linked pages. – Pairs of pages with many common references may be about the same topic. • Real market baskets: chain stores keep terabytes of information about what customers buy together. 4 Mining Association Rule • Example of a Retail Store…. Transactions Items T1 Bread, Jelly, Butter T2 Bread, Butter T3 Bread, Milk, Butter T4 Beer, Bread T5 Beer, Milk • Support of an item (or set of items) is the percentage of transactions in which that item occurs. • 5 transactions as 100%. (in this case) 5 Mining Association Rule • Occurrence of Beer------twice in T4 and T5. • Therefore…Support = 40% • Occurrence of Beer and Jelly -------0 • Beer and Milk --------T5 • Therefore…Support = 20% 6 List of few item sets & their Support S.No Set Support Tran Items 1 Beer 40 sacti 2 Bread 80 ons 3 Jelly 20 T1 Bread, Jelly, Butter 4 Milk 40 T2 Bread, Butter 5 Butter 60 T3 Bread, Milk, Butter 6 Beer, Bread 20 T4 Beer, Bread 7 Beer, Milk 20 T5 Beer, Milk 8 Bread, Jelly 20 9 Bread, Jelly, Butter 20 10 Bread, Milk, Butter 20 11 Milk, Butter 20 12 Bread, Butter 60 7 Mining Association Rule Definition (Support): The support for an association rule X Y is the percentage of transactions in the database that contain X U Y 8 Mining Association Rule Definition (Confidence): The confidence or strength (σ) for an association X Y is the ratio of the Number of Transactions that contain X U Y to the Number of transactions that contain X . 9 Mining Association Rule Bread occurs in 4 transactions from T1 to T4 Bread, Butter together occurs 3 times (T1 T2, T3) Therefore… σ =3/4 i.e. 85% X Y S σ Bread Butter 60% 75% Butter Bread 60% 100% Jelly Milk 0% 0% Confidence shows that Bread Butter is stronger rule than Jelly Milk 10 Association Rules • If-then rules about the contents of baskets. • {i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it is likely to contain j.” • Confidence of this association rule is the probability of j given i1,…,ik. 11 Mining Association Rule Larger Item set is an item set whose number of occurrences is above a threshold s. L—complete set of large item set. Suppose… m size of item set. No. of subsets = pow(2,m) No. of large itemsets = pow(2,m) – 1 excluding the empty set. e.g. m = 5 31 item sets 12 AR Algorithm (Example) Suppose The input support and confidence are s = 30% σ = 50% and Large item set is given by L = {{Beer}, {Bread}, {Milk}, {Butter}, {Bread, Butter}} Let l = {Bread, Butter} such that {Bread} and {Butter} are two non empty subsets of l support {Bread, Butter}) = 60 = 0.75 support({Bread}) 80 Thus confidence of the association rule Bread Butter is 75%. Since this is above threshold given, it is valid association rule. 13 Mining Association Rule Larger Item set is an item set whose number of occurrences is above a threshold s. L—complete set of large item set. Suppose… m size of item set. No. of subsets = pow(2,m) No. of large itemsets = pow(2,m) – 1 excluding the empty set. e.g. m = 5 31 item sets 14 Important Point • “Market Baskets” is an abstraction that models any many-many relationship between two concepts: “items” and “baskets.” – Items need not be “contained” in baskets. • The only difference is that we count co- occurrences of items related to a basket, not vice-versa. 15 Association Mining? • Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. • Applications: – Basket data analysis, cross-marketing, catalog design, loss- leader analysis, clustering, classification, etc. • Examples. – Rule form: “Body ead [support, confidence]”. – buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%] – major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%] Mining Association Rules—An Example Transaction ID Items Bought Min. support 50% 2000 A,B,C Min. confidence 50% 1000 A,C 4000 A,D Frequent Itemset Support {A} 75% 5000 B,E,F {B} 50% {C} 50% For rule A C: {A,C} 50% support = support({A U C}) = 50% confidence = support({A U C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent Mining Frequent Itemsets • Find the frequent itemsets: the sets of items that have minimum support – A subset of a frequent itemset must also be a frequent itemset • i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset The Apriori Algorithm: Basic idea • The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent item set properties. • K-itemsets are used to explore (k+1)-itemsets • First the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item and collecting those items that satisfy minimum support. • The resulting set is denoted by L1. • L1 is used to find L2 (set of frequent 2-itemsets). • L2 is used to find L3 (set of frequent 3-itemsets). • The Apriori Algorithm: • How Lk-1 is used to find Lk where k >=2 • A two step process is followed… – Join – Prune • Join: – We find To find Lk a set of candidate k-itemsets is generated by joining Lk-1 with itself. – The set of candidates is denoted by Ck • Prune: – Ck is superset of Lk – That is, its members may or may not be frequent. Apriori Algorithm for Boolean Association Rule: • Join Step: Ck is generated by joining Lk-1with itself • Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return L = k Lk; The Apriori Algorithm — Example Database D itemset sup. L1 itemset sup. TID Items C1 {1} 2 {1} 2 100 134 {2} 3 {2} 3 Scan D 200 235 {3} 3 {3} 3 300 1235 {4} 1 {5} 3 400 25 {5} 3 C2 itemset sup C2 itemset L2 itemset sup {1 2} 1 Scan D {1 2} {1 3} 2 {1 3} 2 {1 3} {2 3} 2 {1 5} 1 {1 5} {2 3} 2 {2 3} {2 5} 3 {2 5} 3 {2 5} {3 5} 2 {3 5} 2 {3 5} C3 itemset Scan D L3 itemset sup {2 3 5} {2 3 5} 2 Problem: Generate candidate itemsets and frequent itemsets where the minimum support count is 2. Transaction-ID List of Item IDs T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3 How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Example of Generating Candidates • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 – abcd from abc and abd – acde from acd and ace • Pruning: – acde is removed because ade is not in L3 • C4={abcd} Improving Apriori’s Efficiency • Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans • Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB • Sampling: mining on a subset of given data, need a lower support threshold + a method to determine the completeness • Dynamic itemset counting: add new candidate itemsets immediately (unlike Apriori) when all of their subsets are estimated to be frequent Association Rule Mining • Types with the description of – Multiple (Multi Level) AR from Transaction DB. – Multi Dimensional AR from RDB. There are many types of AR. AR can be classified in various ways based on the following criteria…. 1. Based on the types of values handled in the rule: (Boolean AR and Quantitative AR) # If a rule concerns associations between the presence or absence of items, it is a Boolean association rule. AR Types • A support of 2% for AR1 means that 2% of all the transactions under analysis show that computer and financial_management_software are purchased together. A confidence of 60% for AR1 means that 60% of the customers who purchased a computer also bought the software. Typically, association rules are considered interesting if they satisfy both a minimum support threshold and minimum confidence threshold. Such thresholds can be set by users or domain experts. AR Types # If a rule describes associations between quantitative items or attributes, then it is a Quantitative Association Rule. • In these rules, quantitative values for items or attributes are partitioned into intervals. • Association Rule 2 (AR2) below is an example of a quantitative association rule. • Note that the quantitative attributes, age and income, have been discredited. Multi Dimensional AR 2. Based on the dimensions of data involved in the rule (Single D AR and Multi D AR) # If the items or attributes in an association rule each reference only one dimension, then it is a single dimensional association rule. Note that AR 1 could be rewritten as.. • • AR 1 is a single-dimensional association rule since it refers to only one dimension i.e. buys. # If a rule references two or more dimensions, such as the dimensions buys, time of transaction and customer category, then it is a multidimensional association rule. AR 2 is a multidimensional association rule since it involves three dimensions: age, income and buys. Multi Level AR 3. Based on the levels of abstractions involved in the rule (Single Level AR and Multi Level AR) • Some methods for association rule mining can find rules at different levels of abstraction. • For example: Suppose that a set of mining association rules include AR 3 and AR 4 below. • In AR3 and AR4, the items bought are referenced at different levels of abstraction. (i.e. “computer” is a higher level abstraction of “laptop computer"). • We refer to the rule set mined as consisting of multilevel association rules. • If, instead, the rules within a given set do not reference items or attributes at different levels of abstraction, then the set contains single-level association rules. 4. Based on the nature of the association involved in the rule: Association mining can be extended to correlation analysis where the absence or presence of correlated items can be identified.