Association Rules Contents
Association Rules Problem Overview – Large itemsets Association Rules Algorithms – Apriori – Sampling – Partitioning – Parallel Algorithms
Example: Market Basket Data
Items frequently purchased together:
Bread PeanutButter
Uses:
– Advertising – Sales – Communication
Objective: increase sales and reduce costs
Association Rule Definitions
Set of items: I={I1,I2,…,Im} Transactions: D={t1,t2, …, tn} Where ti={Ii1,Ii2, …, Iik } and Iij I Support(s) of an itemset: Percentage of transactions which contain that itemset. Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold s.
Association Rule Definitions
Association Rule (AR): implication X Y where X,Y ⊆ I and X ∩ Y = ; Support of AR (s) X Y: Percentage of transactions that contain X ∪Y Confidence of AR ( ) X Y: Ratio of number of transactions that contain X ∪ Y to the number that contain X
Association Rules Example
I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60%
Association Rules Ex (cont’d)
Association Rule Problem
Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij ∈ I, the Association Rule Problem is to identify all association rules X Y with a minimum support(s) and confidence( ).
Association Rule Techniques
1. Find Large Itemsets(L). 2. Generate rules from frequent itemsets.
A set of items of size m 2m subsets Potential number of large itemsets 2m -1 For m=5 31 itemsets When m=30 1073741823 itemsets Thus association rule algorithms are based on smart ways to reduce the number of itemsets to be counted. Large itemsets Candidate
Association rule Notations
Term D ti s X,Y X Y L l C p Description Database of transactions Transaction in D Support Confidence Itemsets Association Rule Set of large itemsets Large itemset in L Set of candidate itemsets Number of partitions
Algorithm to Generate ARs
Set Support Beer 40 Bread 80 Jelly 20 Milk 40 Peanut Butter 60 Beer,bread 20 Beer,Jelly 0 ……. Bread,PeanutButter 60
• s=30%, =50% • L=??? • L={{Beer},{Bread},{Milk},{Peanutbutter},{Bread, PeanutButter}} • Bread=> PeanutButter = support({Bread,PeanutButter})/support({Bread}) =60/80=0.75=75% • PeanutButter=>Bread ?? 100%
Apriori
Large Itemset Property: Any subset of a large itemset is large. If an itemset is not large, none of its supersets are large.
Apriori Algorithm
The mining of association rules from large databases is a two-steps process: 1. Find all frequent itemsets; that is, find all itemsets with frequency >= minimum count 2. From the frequent itemsets, generate association rules satisfying the minimum support and confidence conditions.
Generating Frequent Itemsets
The Apriori Algorithm is to generate candidate itemsets of a particular size and then scan the database to count these to see if they are large. An itemset is considered as a candidate only if all its subsets also are large.
Large Itemset Property
APRIORI
APRIORI 1. k = 1 2. Find frequent set Lk from Ck of all candidate itemsets 3. Form Ck+1 from Lk; k = k + 1 4. Repeat 2-3 until Ck is empty Details about steps 2 and 3 – Step 2: scan D and count each itemset in Ck , if it’s greater than minSup, it is frequent – Step 3: next slide
Apriori’s Candidate Generation
For k=1, C1 = all frequent 1-itemsets.(all individual items). For k>1, generate Ck from Lk-1 as follows: – The join step Ck = k-2 way join of Lk-1 with itself If both {a1, …,ak-2, ak-1} & {a1, …, ak-2, ak} are in Lk-1, then add {a1, …,ak-2, ak-1, ak} to Ck (We keep items sorted). – The prune step Remove {a1, …,ak-2, ak-1, ak} if it contains a non-frequent (k-1) subset
Apriori Ex (cont’d)
s=30%
α = 50%
Example – Finding frequent itemsets
Dataset D
TID T10 T20 T30 T40 Items a1 a3 a4 a2 a3 a5 a1 a2 a3 a5 a2 a5
1. scan D
C1: a1:2, a2:3, a3:3, a4:1, a5:3 L1: a1:2, a2:3, a3:3,
a5:3
C2: a1a2, a1a3, a1a5, a2a3, a2a5, a3a5 2. scan D C2: a1a2:1, a1a3:2, a1a5:1, a2a3:2, a2a5:3, a3a5:2 L2: a1a3:2, a2a3:2, a2a5:3, a3a5:2 C3: a2a3a5 Pruned C3: a2a3a5 3. scan D L3: a2a3a5:2
minSupport=0.5
3-itemsets
To do so, we join L2 with itself, where itemsets are joined if they have the first k-1 items in common (in alphabetical order) L2: a1a3:2, a2a3:2, a2a5:3, a3a5:2 C3: a2a3a5
The Apriori Algorithm — Example
Database D
TID 100 200 300 400 Items 134 235 1235 25
itemset sup. {1} 2 C1 {2} 3 Scan D {3} 3 {4} 1 {5} 3
L1 itemset sup.
{1} {2} {3} {5} 2 3 3 3
L2 itemset sup
{1 3} {2 3} {2 5} {3 5} 2 2 3 2
C2 itemset sup
{1 {1 {1 {2 {2 {3 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2
C2 itemset Scan D
{1 {1 {1 {2 {2 {3 2} 3} 5} 3} 5} 5}
C3 itemset {2 3 5}
Scan D
L3 itemset sup {2 3 5} 2
Generating Association Rules
1. First, generate all subsets of L. 2. Then, let ss represent a nonempty subset of L. Consider the association rule R: ss (Lss),where (L-ss) indicates the set L without ss. Generate (and output) R if R fulfills the minimum confidence requirement. 3. Do so for every subset ss of L. Note that for simplicity, a single-item consequent is often desired. 4. Self Exercise of the given example
Apriori Advantages/Disadvantages
Advantages:
– Uses large itemset property. – Easy to implement.
Disadvantages:
– Assumes transaction database is memory resident.
Sampling
Large databases Sample the database and apply Apriori to the sample. (memory resident) Potentially Large Itemsets (PL): Large itemsets from sample are called as PL. These are used as candidates to be counted using the entire database. Negative Border (BD- ): – Additional candidates are determined by applying BD- against the large itemsets from the sample. – Minimal set of itemsets which are not in PL, but whose subsets are all in PL. The entire set of candidates is then C= BD-(PL) U (PL)
Suppose that the set of items is {A,B,C,D} PL={A,C,D,CD}
Negative Border Example
PL={A,C,D,CD}
BD- ={B,AC,AD} PL U BD-(PL)={A,B,C,D,AC,AD}
Sampling Algorithm
1. 2. 3. 4. 5. 6. 7. 8. Ds = sample of Database D; PL = Large itemsets in Ds using smalls; C = PL ∪ BD-(PL); Count C in Database using s; ML = large itemsets in BD-(PL); If ML = ∅ then done else C = repeated application of BD-; Count C in Database;
Sampling Example
Find AR assuming s = 20% Ds = { t1,t2} PL = {{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} BD-(PL)={{Beer},{Milk}} ML = {{Beer}, {Milk}} Repeated application of BD- generates all remaining itemsets
Sampling Adv/Disadv
Advantages:
– Reduces number of database scans to one in the best case and two in worst.
Disadvantages:
– Potentially large number of candidates in second pass
Partitioning
Divide database into partitions D1,D2,…,Dp Apply Apriori to each partition Any large itemset must be large in at least one partition.
Partitioning Algorithm
1. 2. 3. 4. 5. Divide D into partitions D1,D2,…,Dp; For I = 1 to p do Li = Apriori(Di); C = L1 ∪ … ∪ Lp; Count C on D to generate L;
Partitioning Example
D1
L1 ={{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} L2 ={{Bread}, {Milk}, {PeanutButter}, {Bread,Milk}, {Bread,PeanutButter}, {Milk, PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer,Milk}}
D2
S=10%
Partitioning Adv/Disadv
Advantages:
– Adapts to available main memory – Easily parallelized – Maximum number of database scans is two.
Disadvantages:
– May have many candidates during second scan.
Parallelizing AR Algorithms
Based on Apriori Techniques differ:
– What is counted at each site – How data (transactions) are distributed
Data Parallelism
– Data partitioned – Count Distribution Algorithm
Task Parallelism
– Data and candidates partitioned – Data Distribution Algorithm
Count Distribution Algorithm(CDA)
1. Place data partition at each site. 2. In Parallel at each site do 3. C1 = Itemsets of size one in I; 4. Count C1; 5. Broadcast counts to all sites; 6. Determine global large itemsets of size 1, L1; 7. i = 1; 8. Repeat 9. i = i + 1; 10. Ci = Apriori-Gen(Li-1); 11. Count Ci; 12. Broadcast counts to all sites; 13. Determine global large itemsets of size i, Li; 14. until no more large itemsets found;
CDA Example
Data Distribution Algorithm(DDA)
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Place data partition at each site. In Parallel at each site do Determine local candidates of size 1 to count; Broadcast local transactions to other sites; Count local candidates of size 1 on all data; Determine large itemsets of size 1 for local candidates; Broadcast large itemsets to all sites; Determine L1; i = 1; Repeat i = i + 1; Ci = Apriori-Gen(Li-1); Determine local candidates of size i to count; Count, broadcast, and find Li; until no more large itemsets found;
DDA Example
Advanced AR Techniques
Generalized Association Rules Multiple-Level Association Rules Quantitative Association Rules Using multiple minimum supports Correlation Rules
Generalized Association Rules
Generalized Association Rules allows rules at different levels in the concept hierarchy. AR’s could be generated for any and all levels in the hierarchy. A Generalized Association Rule,X=> Y, is defined like a regular association rule with the restriction that no item in Y may be above any item in X.
Example
AR Bread=> Butter Grain=> Butter Wheat Bread => Butter Bread=> Butter
Lower Support
Multiple-Level Association Rules
Variation of Generalized Association Rules Itemsets may occur from any level in the hierarchy. Large k-itemsets at one level in the concept hierarchy are used as candidates to generate large k-itemsets for children at the next level.
Multiple-Level Association Rules
The reduced minimum support concept,the following rules apply: The minimum support for all nodes in the hierarchy at the same level is identical If i is the minimum support for the level i in the hierarchy and i-1 is the minimum support for level i-1,then i-1 > i
Quantitative Association Rules
AR algorithms data is categorical An Quantitative Association Rule is one that involves categorical and quantitative data. Example : A customer buys wine for Rs30 and Rs50 a bottle=> he also buys snacks Traditional: A customer buys wine => he also buys snacks
Instead of having items {Bread,Butter} we might have the items as {(Bread:[0…1]), (Bread:[1…2]), (Bread:[2…3]), (Bread:[3… ]),….. } The minimum support and confidence used may be lowered. For large intervals the minimum support condition will be worse ,So combining of adjacent intervals is done then the support is calculated.
Using multiple minimum supports
With many types of data only one minimum support value is feasible ?? Different items behave differently. Useful rules will be missed. Skim Milk Wheat Bread , s=3% Milk Bread , s=6%
Using multiple minimum supports
If s is too high, rules involving rare items will not be generated. If s is too low, too many rules will be generated. Approaches: Combine clustering and Association rules.
MISApriori (Minimum Item Support Apriori)
It allows a different support threshold to be indicated for each item. The minimum support for a rule is the minimum of all the minimum supports for each item in the rule.
Example
{A,B,C} MIS(A)=20%, MIS(B)=3 and MIS( C) =4% Support for A is large, the no. of entries in rules may be small While both AB and AC may be large because support of AB= min(MIS(A),MIS(B))=3% AC= min(MIS(A),MIS(C))=4%
Correlation Rules
A Correlation Rule is defined as a set of itemsets that are correlated. Motivation :Negative correlations may be useful
Example
Two Items {A,B}, A B, s=15%, confidence=60% P(B)=70%,P(A)=25 % ,Shows that the probability of purchasing B has actually gone down. There seems to be a –ive correlation between buying A and buying B
Correlation can be expressed as B)= P(A,B)/(P(A)* P(B)) = 0.15/0.25*0.7 = 0.857 which is less than 1 . Correlation( A
Incremental Association Rules
Generate ARs in a dynamic database. Problem: algorithms assume static database Objective: – Know large itemsets for D – Find large itemsets for D ∪ {D D} Must be large in either D or D D
Measuring Quality of Rules
Support Confidence Interest(Correlation) Chi Squared Test
Support
s( A B ) = P( A,B) ( A B ) = P( B/A )
Interest(Correlation)
Interest( A B)= P(A,B)/(P(A)* P(B)) No difference between the values of interest( A B) and interest( B A)
Chi Squared Test
Chi Squared Test takes into account both the presence and absence of items in sets. It is used to measure how much an itemset count differs from the expected.
tj
tj ∈ I 1, I 1 }* I 2, I 2 }* ....... * Im .Im
The chi squared statistics can be calculated as follows:
{
{
{
}
The chi squared is the calculated for X as
(O ( X ) − E[ X ])2 x2 = E[ X ] X ∈I
Where O(X) is the count of the number of transactions that contain the items in X. The expected value E[X] is calculated as
E[ X ] = n * ∏ E[ Ii ] / n
i =1 m
where n is the number of transactions
B A A’ TOTAL 15 55 70
B’ 10 20 30
TOTAL 25 75 100
E[AB]=17.5, E[AB’]=7.5,E[A’B]=52.5, E[A’B’]=22.5
(O ( X ) − E[ X ])2 x2 = E[ X ] X ∈I
=(15-17.5)2/17.5 + (10-7.5)2/7.5 + (55-52.5)2/52.5 + (20-22.5)2/22.5 =1.587 If all the values were independent,then the chi squared should be 0. chi squared value < 3.84 ,then we should not reject the independent assumption.
Comparing AR Techniques
Target Type Data Type Data Source Technique Itemset Strategy and Data Structure Transaction Strategy and Data Structure Optimization Parallelism Strategy
Comparison of AR Techniques