5-Association Rules

Reviews
Shared by: Honey Singh
Categories
Tags
Stats
views:
69
rating:
not rated
reviews:
0
posted:
11/12/2007
language:
pages:
0
Association Rules Contents Association Rules Problem Overview – Large itemsets Association Rules Algorithms – Apriori – Sampling – Partitioning – Parallel Algorithms Example: Market Basket Data Items frequently purchased together: Bread PeanutButter Uses: – Advertising – Sales – Communication Objective: increase sales and reduce costs Association Rule Definitions Set of items: I={I1,I2,…,Im} Transactions: D={t1,t2, …, tn} Where ti={Ii1,Ii2, …, Iik } and Iij I Support(s) of an itemset: Percentage of transactions which contain that itemset. Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold s. Association Rule Definitions Association Rule (AR): implication X Y where X,Y ⊆ I and X ∩ Y = ; Support of AR (s) X Y: Percentage of transactions that contain X ∪Y Confidence of AR ( ) X Y: Ratio of number of transactions that contain X ∪ Y to the number that contain X Association Rules Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60% Association Rules Ex (cont’d) Association Rule Problem Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij ∈ I, the Association Rule Problem is to identify all association rules X Y with a minimum support(s) and confidence( ). Association Rule Techniques 1. Find Large Itemsets(L). 2. Generate rules from frequent itemsets. A set of items of size m 2m subsets Potential number of large itemsets 2m -1 For m=5 31 itemsets When m=30 1073741823 itemsets Thus association rule algorithms are based on smart ways to reduce the number of itemsets to be counted. Large itemsets Candidate Association rule Notations Term D ti s X,Y X Y L l C p Description Database of transactions Transaction in D Support Confidence Itemsets Association Rule Set of large itemsets Large itemset in L Set of candidate itemsets Number of partitions Algorithm to Generate ARs Set Support Beer 40 Bread 80 Jelly 20 Milk 40 Peanut Butter 60 Beer,bread 20 Beer,Jelly 0 ……. Bread,PeanutButter 60 • s=30%, =50% • L=??? • L={{Beer},{Bread},{Milk},{Peanutbutter},{Bread, PeanutButter}} • Bread=> PeanutButter = support({Bread,PeanutButter})/support({Bread}) =60/80=0.75=75% • PeanutButter=>Bread ?? 100% Apriori Large Itemset Property: Any subset of a large itemset is large. If an itemset is not large, none of its supersets are large. Apriori Algorithm The mining of association rules from large databases is a two-steps process: 1. Find all frequent itemsets; that is, find all itemsets with frequency >= minimum count 2. From the frequent itemsets, generate association rules satisfying the minimum support and confidence conditions. Generating Frequent Itemsets The Apriori Algorithm is to generate candidate itemsets of a particular size and then scan the database to count these to see if they are large. An itemset is considered as a candidate only if all its subsets also are large. Large Itemset Property APRIORI APRIORI 1. k = 1 2. Find frequent set Lk from Ck of all candidate itemsets 3. Form Ck+1 from Lk; k = k + 1 4. Repeat 2-3 until Ck is empty Details about steps 2 and 3 – Step 2: scan D and count each itemset in Ck , if it’s greater than minSup, it is frequent – Step 3: next slide Apriori’s Candidate Generation For k=1, C1 = all frequent 1-itemsets.(all individual items). For k>1, generate Ck from Lk-1 as follows: – The join step Ck = k-2 way join of Lk-1 with itself If both {a1, …,ak-2, ak-1} & {a1, …, ak-2, ak} are in Lk-1, then add {a1, …,ak-2, ak-1, ak} to Ck (We keep items sorted). – The prune step Remove {a1, …,ak-2, ak-1, ak} if it contains a non-frequent (k-1) subset Apriori Ex (cont’d) s=30% α = 50% Example – Finding frequent itemsets Dataset D TID T10 T20 T30 T40 Items a1 a3 a4 a2 a3 a5 a1 a2 a3 a5 a2 a5 1. scan D C1: a1:2, a2:3, a3:3, a4:1, a5:3 L1: a1:2, a2:3, a3:3, a5:3 C2: a1a2, a1a3, a1a5, a2a3, a2a5, a3a5 2. scan D C2: a1a2:1, a1a3:2, a1a5:1, a2a3:2, a2a5:3, a3a5:2 L2: a1a3:2, a2a3:2, a2a5:3, a3a5:2 C3: a2a3a5 Pruned C3: a2a3a5 3. scan D L3: a2a3a5:2 minSupport=0.5 3-itemsets To do so, we join L2 with itself, where itemsets are joined if they have the first k-1 items in common (in alphabetical order) L2: a1a3:2, a2a3:2, a2a5:3, a3a5:2 C3: a2a3a5 The Apriori Algorithm — Example Database D TID 100 200 300 400 Items 134 235 1235 25 itemset sup. {1} 2 C1 {2} 3 Scan D {3} 3 {4} 1 {5} 3 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 L2 itemset sup {1 3} {2 3} {2 5} {3 5} 2 2 3 2 C2 itemset sup {1 {1 {1 {2 {2 {3 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 C2 itemset Scan D {1 {1 {1 {2 {2 {3 2} 3} 5} 3} 5} 5} C3 itemset {2 3 5} Scan D L3 itemset sup {2 3 5} 2 Generating Association Rules 1. First, generate all subsets of L. 2. Then, let ss represent a nonempty subset of L. Consider the association rule R: ss (Lss),where (L-ss) indicates the set L without ss. Generate (and output) R if R fulfills the minimum confidence requirement. 3. Do so for every subset ss of L. Note that for simplicity, a single-item consequent is often desired. 4. Self Exercise of the given example Apriori Advantages/Disadvantages Advantages: – Uses large itemset property. – Easy to implement. Disadvantages: – Assumes transaction database is memory resident. Sampling Large databases Sample the database and apply Apriori to the sample. (memory resident) Potentially Large Itemsets (PL): Large itemsets from sample are called as PL. These are used as candidates to be counted using the entire database. Negative Border (BD- ): – Additional candidates are determined by applying BD- against the large itemsets from the sample. – Minimal set of itemsets which are not in PL, but whose subsets are all in PL. The entire set of candidates is then C= BD-(PL) U (PL) Suppose that the set of items is {A,B,C,D} PL={A,C,D,CD} Negative Border Example PL={A,C,D,CD} BD- ={B,AC,AD} PL U BD-(PL)={A,B,C,D,AC,AD} Sampling Algorithm 1. 2. 3. 4. 5. 6. 7. 8. Ds = sample of Database D; PL = Large itemsets in Ds using smalls; C = PL ∪ BD-(PL); Count C in Database using s; ML = large itemsets in BD-(PL); If ML = ∅ then done else C = repeated application of BD-; Count C in Database; Sampling Example Find AR assuming s = 20% Ds = { t1,t2} PL = {{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} BD-(PL)={{Beer},{Milk}} ML = {{Beer}, {Milk}} Repeated application of BD- generates all remaining itemsets Sampling Adv/Disadv Advantages: – Reduces number of database scans to one in the best case and two in worst. Disadvantages: – Potentially large number of candidates in second pass Partitioning Divide database into partitions D1,D2,…,Dp Apply Apriori to each partition Any large itemset must be large in at least one partition. Partitioning Algorithm 1. 2. 3. 4. 5. Divide D into partitions D1,D2,…,Dp; For I = 1 to p do Li = Apriori(Di); C = L1 ∪ … ∪ Lp; Count C on D to generate L; Partitioning Example D1 L1 ={{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} L2 ={{Bread}, {Milk}, {PeanutButter}, {Bread,Milk}, {Bread,PeanutButter}, {Milk, PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer,Milk}} D2 S=10% Partitioning Adv/Disadv Advantages: – Adapts to available main memory – Easily parallelized – Maximum number of database scans is two. Disadvantages: – May have many candidates during second scan. Parallelizing AR Algorithms Based on Apriori Techniques differ: – What is counted at each site – How data (transactions) are distributed Data Parallelism – Data partitioned – Count Distribution Algorithm Task Parallelism – Data and candidates partitioned – Data Distribution Algorithm Count Distribution Algorithm(CDA) 1. Place data partition at each site. 2. In Parallel at each site do 3. C1 = Itemsets of size one in I; 4. Count C1; 5. Broadcast counts to all sites; 6. Determine global large itemsets of size 1, L1; 7. i = 1; 8. Repeat 9. i = i + 1; 10. Ci = Apriori-Gen(Li-1); 11. Count Ci; 12. Broadcast counts to all sites; 13. Determine global large itemsets of size i, Li; 14. until no more large itemsets found; CDA Example Data Distribution Algorithm(DDA) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Place data partition at each site. In Parallel at each site do Determine local candidates of size 1 to count; Broadcast local transactions to other sites; Count local candidates of size 1 on all data; Determine large itemsets of size 1 for local candidates; Broadcast large itemsets to all sites; Determine L1; i = 1; Repeat i = i + 1; Ci = Apriori-Gen(Li-1); Determine local candidates of size i to count; Count, broadcast, and find Li; until no more large itemsets found; DDA Example Advanced AR Techniques Generalized Association Rules Multiple-Level Association Rules Quantitative Association Rules Using multiple minimum supports Correlation Rules Generalized Association Rules Generalized Association Rules allows rules at different levels in the concept hierarchy. AR’s could be generated for any and all levels in the hierarchy. A Generalized Association Rule,X=> Y, is defined like a regular association rule with the restriction that no item in Y may be above any item in X. Example AR Bread=> Butter Grain=> Butter Wheat Bread => Butter Bread=> Butter Lower Support Multiple-Level Association Rules Variation of Generalized Association Rules Itemsets may occur from any level in the hierarchy. Large k-itemsets at one level in the concept hierarchy are used as candidates to generate large k-itemsets for children at the next level. Multiple-Level Association Rules The reduced minimum support concept,the following rules apply: The minimum support for all nodes in the hierarchy at the same level is identical If i is the minimum support for the level i in the hierarchy and i-1 is the minimum support for level i-1,then i-1 > i Quantitative Association Rules AR algorithms data is categorical An Quantitative Association Rule is one that involves categorical and quantitative data. Example : A customer buys wine for Rs30 and Rs50 a bottle=> he also buys snacks Traditional: A customer buys wine => he also buys snacks Instead of having items {Bread,Butter} we might have the items as {(Bread:[0…1]), (Bread:[1…2]), (Bread:[2…3]), (Bread:[3… ]),….. } The minimum support and confidence used may be lowered. For large intervals the minimum support condition will be worse ,So combining of adjacent intervals is done then the support is calculated. Using multiple minimum supports With many types of data only one minimum support value is feasible ?? Different items behave differently. Useful rules will be missed. Skim Milk Wheat Bread , s=3% Milk Bread , s=6% Using multiple minimum supports If s is too high, rules involving rare items will not be generated. If s is too low, too many rules will be generated. Approaches: Combine clustering and Association rules. MISApriori (Minimum Item Support Apriori) It allows a different support threshold to be indicated for each item. The minimum support for a rule is the minimum of all the minimum supports for each item in the rule. Example {A,B,C} MIS(A)=20%, MIS(B)=3 and MIS( C) =4% Support for A is large, the no. of entries in rules may be small While both AB and AC may be large because support of AB= min(MIS(A),MIS(B))=3% AC= min(MIS(A),MIS(C))=4% Correlation Rules A Correlation Rule is defined as a set of itemsets that are correlated. Motivation :Negative correlations may be useful Example Two Items {A,B}, A B, s=15%, confidence=60% P(B)=70%,P(A)=25 % ,Shows that the probability of purchasing B has actually gone down. There seems to be a –ive correlation between buying A and buying B Correlation can be expressed as B)= P(A,B)/(P(A)* P(B)) = 0.15/0.25*0.7 = 0.857 which is less than 1 . Correlation( A Incremental Association Rules Generate ARs in a dynamic database. Problem: algorithms assume static database Objective: – Know large itemsets for D – Find large itemsets for D ∪ {D D} Must be large in either D or D D Measuring Quality of Rules Support Confidence Interest(Correlation) Chi Squared Test Support s( A B ) = P( A,B) ( A B ) = P( B/A ) Interest(Correlation) Interest( A B)= P(A,B)/(P(A)* P(B)) No difference between the values of interest( A B) and interest( B A) Chi Squared Test Chi Squared Test takes into account both the presence and absence of items in sets. It is used to measure how much an itemset count differs from the expected. tj tj ∈ I 1, I 1 }* I 2, I 2 }* ....... * Im .Im The chi squared statistics can be calculated as follows: { { { } The chi squared is the calculated for X as (O ( X ) − E[ X ])2 x2 = E[ X ] X ∈I Where O(X) is the count of the number of transactions that contain the items in X. The expected value E[X] is calculated as E[ X ] = n * ∏ E[ Ii ] / n i =1 m where n is the number of transactions B A A’ TOTAL 15 55 70 B’ 10 20 30 TOTAL 25 75 100 E[AB]=17.5, E[AB’]=7.5,E[A’B]=52.5, E[A’B’]=22.5 (O ( X ) − E[ X ])2 x2 = E[ X ] X ∈I =(15-17.5)2/17.5 + (10-7.5)2/7.5 + (55-52.5)2/52.5 + (20-22.5)2/22.5 =1.587 If all the values were independent,then the chi squared should be 0. chi squared value < 3.84 ,then we should not reject the independent assumption. Comparing AR Techniques Target Type Data Type Data Source Technique Itemset Strategy and Data Structure Transaction Strategy and Data Structure Optimization Parallelism Strategy Comparison of AR Techniques

Shared by: Honey Singh
About
Honey is a zealous web and graphics designer (currently working with media redefined ) having a creative and devouring gumption with an experience of over 3 years in Interactive Designing , Blogging and Web technologies.
Other docs by Honey Singh
What Mr.Buffett learned from Graham
Views: 1290  |  Downloads: 135
Warren Buffett_27s Invisible Empire
Views: 1121  |  Downloads: 91
Under Warren Buffett_27s Big Top
Views: 733  |  Downloads: 48
The Warren Buffett You Don_27t Know
Views: 992  |  Downloads: 104
The Best Advice I ever Got
Views: 6481  |  Downloads: 376
9 investing secrets of Warren Buffett[2]
Views: 1131  |  Downloads: 149
UNIX[3]
Views: 907  |  Downloads: 44
Thinking in java 2nd edition
Views: 1283  |  Downloads: 69
network programming
Views: 705  |  Downloads: 37
Kevs-php-mysql[1]
Views: 12832  |  Downloads: 65
Googles Backdoor
Views: 451  |  Downloads: 19
Google Hacking 101
Views: 14581  |  Downloads: 338
Google Hackers Guide
Views: 8466  |  Downloads: 261
Google Anatomy
Views: 1579  |  Downloads: 195
Beej_27s Guide to Network Programming
Views: 505  |  Downloads: 19