# Data Mining and Exploration Association Rules Association Rules

Document Sample

```					                                                                               Association Rules

Data Mining and Exploration: Association
Rules
Itemsets, association rules
Amos Storkey, School of Informatics                                  Frequency, accuracy
APRIORI algorithm
February 7, 2006
http://www.inf.ed.ac.uk/teaching/courses/dme/                                  §6.1, 6.2

These lecture slides are based extensively on
previous versions of the course written by Chris
Williams.

1/1                                                                     2/1

Example of Association rules: market basket analysis, the
We are looking for patterns, i.e. local regularities in the data               process of analyzing customer buying habits by ﬁnding
Examples of frequent itemsets, association rules                               associations between items that customers place in their
10% of supermarket customers buy wine and cheese
If a person visits the CNN website, there is a 60% chance                 Each row of the data matrix has a 1 if the corresponding
that they will visit the ABC website in the same month                    product was in the basket. Data is often sparse
Association rules are like classiﬁcation rules, except that they               Can recode k-valued categorical variables (e.g. outlook =
can predict any attribute, not just the class                                  {sunny, overcast, rainy}) as k binary variables
Association rules are not intended to be used together as a set
(cf classiﬁcation rules)

3/1                                                                     4/1
Itemsets, Frequency, Accuracy                                                     Play Tennis Example

An itemset is a pattern deﬁned by
Day   Outlook    Temperature   Humidity   Wind    PlayTennis
(Ai1 = aj1 ) ∧ (Ai2 = aj2 ) ∧ . . . (Aik = ajk )             D1     Sunny        Hot         High      False       No
D2     Sunny        Hot         High      True        No
The frequency (or support) of an itemset X is simply P(X )                   D3    Overcast      Hot         High      False      Yes
D4      Rain        Mild        High      False      Yes
Example: in the “Play Tennis” data                                           D5      Rain        Cool       Normal     False      Yes
D6      Rain        Cool       Normal     True        No
P(Humidity = Normal ∧ Play = Yes ∧ Windy = False) = 4/14
D7    Overcast      Cool       Normal     True       Yes
D8     Sunny        Mild        High      False       No
D9     Sunny        Cool       Normal     False      Yes
The accuracy (or conﬁdence) of an association rule if Y=y
D10     Rain        Mild       Normal     False      Yes
then Z=z is
D11    Sunny        Mild       Normal     True       Yes
P(Z = z|Y = y )
D12   Overcast      Mild        High      True       Yes
Example                                                                      D13   Overcast      Hot        Normal     False      Yes
D14     Rain        Mild        High      True        No
P(Windy = False ∧ Play = Yes|Humidity = Normal) = 4/7

5/1                                                                       6/1

Generating rules from itemsets                                                    Finding Frequent Itemsets

An itemset of size k can give rise to 2k − 1 rules
Task: ﬁnd all itemsets with frequency ≥ s
Example. Itemset
Key observation: a set X of variables can be frequent only
Windy=False, Play=Yes, Humidity=Normal                                 if all subsets of variables are frequent (monotonicity
property), i.e. P(A, B) ≤ P(A) and P(A, B) ≤ P(B)
gives rise to 7 rules including                                                So ﬁnd frequent singleton sets, then sets of size 2, and so
IF Windy=False and Humidity=Normal THEN Play=Yes            (4/4)                    on ...
IF Play=Yes THEN Humidity=Normal and Windy=False            (4/9)
IF True THEN Windy=False and Play=Yes and Humidity=Normal   (4/14)                   An efﬁcient algorithm using this idea for ﬁnding frequent
itemsets is the APRIORI algorithm (Agrawal and Srikant
Select association rules that have accuracy greater than some                  (1994), Mannila et al (1994))
threshold a

8/1                                                                       9/1
APRIORI algorithm
Single database pass is linear in |Ci |n, make a pass for each i
until Ci is empty
(for binary variables)                                                   Candidate formation
i =1                                                                          Find all pairs of sets {U, V } from Li such that U ∪ V has
Ci = {{A}|A is a variable}                                                    size i + 1 and test if this union is really a potential
while Ci is not empty                                                         candidate. O(|Li |3 )
database pass:                                                      Example: 5 three-item sets
for each set in Ci test if it is frequent                     (ABC), (ABD), (ACD), (ACE), (BCD)
let Li be collection of frequent sets from Ci                 Candidate four-item sets
candidate formation:                                                (ABCD) ok
(ACDE) not ok because (CDE) is not present above
let Ci+1 be those sets of size i + 1
all of whose subsets are frequent                             Data structure techniques can be used for speedups
end while                                                                Other algorithms possible for ﬁnding frequent itemsets, e.g.
Han’s FP-growth

10 / 1                                                                            11 / 1

APRIORI and Algorithm Components                                      Comments on Association Rules

Finding Association Rules is just the beginning in a datamining
effort. Some will be trivial, others interesting. Challenge is to
select potentially interesting rules
Finding Association rules as Exploratory Data Analysis
Task: Rule Pattern Discovery                                         Trivial rule example:
Structure: Association Rules
pregnant ⇒ female
Score Function: Support                                              with accuracy 1!
Search: Breadth First with Pruning                                   For rule A ⇒ B, it can be useful to compare P(B|A) to P(B)
Data Management Technique: Linear Scans                              APRIORI algorithm can be generalized to frequent structure
mining, e.g. ﬁnding episodes from sequences or
frequently-occurring trees
Example application: Health Insurance Commission (HIC) in
Australia detected patterns of ordering of medical tests that
suggested that some of the tests ordered were unnecessary
˜
(Cabena et al, 1998)
12 / 1                                                                            13 / 1
Summary

Finding frequent itemsets
Done with APRIORI algorithm
Given frequent itemsets, construct association rules with
accuracy > a
Select interesting rules
Generalize to frequent structure mining

14 / 1

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 30 posted: 3/10/2010 language: English pages: 4
Description: Data Mining and Exploration Association Rules Association Rules