Association Rule Market Basket Model A-priori Algorithm by yurtgc548

VIEWS: 76 PAGES: 32

• pg 1
```									  Association Rule
A-priori Algorithm
• A market Basket is a collection of items
purchased by a customer in a single customer
transaction.

• A large set of items.
(e.g. things sold in a supermarket)
• A large set of baskets, each of which is a small
set of the items.
(e.g. the things one customer buys on one day)

2
Support
• Simplest question: find sets of items that
• Support for itemset I = the number of baskets
containing all items in I.
• Given a support threshold s, sets of items that
appear in > s baskets are called frequent
itemsets.

3
Applications
• “Baskets” = documents; “items” = words in
those documents.
– Lets us find words that appear together unusually
pages.
– Pairs of pages with many common references may
• Real market baskets: chain stores keep
4
Mining Association Rule
• Example of a Retail Store….

Transactions         Items
T5                   Beer, Milk

• Support of an item (or set of items) is the
percentage of transactions in which that item
occurs.
• 5 transactions as 100%. (in this case)

5
Mining Association Rule

•   Occurrence of Beer------twice in T4 and T5.
•   Therefore…Support = 40%
•   Occurrence of Beer and Jelly -------0
•   Beer and Milk --------T5
•   Therefore…Support = 20%

6
List of few item sets & their Support

S.No Set                    Support
Tran    Items                  1    Beer                   40
sacti
ons
3    Jelly                  20
4    Milk                   40
5    Butter                 60
7    Beer, Milk             20
T5      Beer, Milk
11   Milk, Butter           20

7
Mining Association Rule
Definition (Support):

The support for an association rule
X     Y
is the percentage of transactions in the database
that contain
X U Y

8
Mining Association Rule
Definition (Confidence):

The confidence or strength (σ) for an association
X    Y is the ratio of the Number of Transactions
that contain X U Y to the Number of
transactions that contain X .

9
Mining Association Rule
Bread occurs in 4 transactions from T1 to T4
Bread, Butter together occurs 3 times (T1 T2, T3)
Therefore…
σ =3/4 i.e. 85%

X        Y                  S   σ
Jelly        Milk     0%        0%

Confidence shows that Bread             Butter is stronger
rule than Jelly   Milk
10
Association Rules
• {i1, i2,…,ik} → j means: “if a basket contains all
of i1,…,ik then it is likely to contain j.”
• Confidence of this association rule is the
probability of j given i1,…,ik.

11
Mining Association Rule
Larger Item set is an item set whose number of
occurrences is above a threshold s.
L—complete set of large item set.

Suppose…
m size of item set.
No. of subsets = pow(2,m)
No. of large itemsets = pow(2,m) – 1
excluding the empty set.
e.g. m = 5
31 item sets
12
AR Algorithm (Example)
Suppose
The input support and confidence are
s = 30%
σ = 50%                and
Large item set is given by
such that
{Bread} and {Butter} are two non empty subsets of l
support {Bread, Butter}) = 60        =     0.75

Thus confidence of the association rule Bread           Butter is 75%. Since
this is above threshold given, it is valid association rule.

13
Mining Association Rule
Larger Item set is an item set whose number of
occurrences is above a threshold s.
L—complete set of large item set.

Suppose…
m size of item set.
No. of subsets = pow(2,m)
No. of large itemsets = pow(2,m) – 1
excluding the empty set.
e.g. m = 5
31 item sets
14
Important Point
• “Market Baskets” is an abstraction that
models any many-many relationship between
– Items need not be “contained” in baskets.
• The only difference is that we count co-
occurrences of items related to a basket, not
vice-versa.

15
Association Mining?
•   Association rule mining:
–   Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
•   Applications:
–   Basket data analysis, cross-marketing, catalog design, loss-
•   Examples.
–   Rule form: “Body ead [support, confidence]”.
–   major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]
Mining Association Rules—An Example

Transaction ID   Items Bought       Min. support 50%
2000         A,B,C              Min. confidence 50%
1000         A,C
4000         A,D                 Frequent Itemset Support
{A}                 75%
5000         B,E,F
{B}                 50%
{C}                 50%
For rule A  C:                     {A,C}               50%
support = support({A U C}) = 50%
confidence = support({A U C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
Mining Frequent Itemsets
• Find the frequent itemsets: the sets of items that
have minimum support
– A subset of a frequent itemset must also be a frequent
itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemset
The Apriori Algorithm: Basic idea
• The name of the algorithm is based on the fact that the algorithm uses
prior knowledge of frequent item set properties.

• K-itemsets are used to explore (k+1)-itemsets

• First the set of frequent 1-itemsets is found by scanning the database
to accumulate the count for each item and collecting those items that
satisfy minimum support.

• The resulting set is denoted by   L1.
• L1 is used to find L2 (set of frequent 2-itemsets).
• L2 is used to find L3 (set of frequent 3-itemsets).
•
The Apriori Algorithm:
• How      Lk-1   is used to find   Lk   where k >=2

• A two step process is followed…
– Join
– Prune
• Join:
– We find To find Lk a set of candidate k-itemsets is generated by joining
Lk-1 with itself.
– The set of candidates is denoted by Ck

• Prune:
– Ck is superset of Lk
– That is, its members may or may not be frequent.
Apriori Algorithm for Boolean Association Rule:
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return L = k Lk;
The Apriori Algorithm — Example
Database D                  itemset sup.
L1 itemset sup.
TID   Items              C1    {1}   2                  {1}       2
100   134                      {2}   3                  {2}       3
Scan D
200   235                      {3}   3                  {3}       3
300   1235                     {4}   1                  {5}       3
400   25                       {5}   3
C2 itemset sup                  C2    itemset
L2   itemset   sup          {1   2}        1   Scan D           {1 2}
{1 3}    2           {1   3}        2                    {1 3}
{2 3}    2           {1   5}        1                    {1 5}
{2   3}        2                    {2 3}
{2 5}    3
{2   5}        3                    {2 5}
{3 5}    2
{3   5}        2                    {3 5}
C3   itemset       Scan D        L3   itemset sup
{2 3 5}                          {2 3 5} 2
Problem: Generate candidate itemsets and
frequent itemsets where the minimum
support count is 2.

Transaction-ID   List of Item IDs
T100             I1, I2, I5
T200             I2, I4
T300             I2, I3
T400             I1, I2, I4
T500             I1, I3
T600             I2, I3
T700             I1, I3
T800             I1, I2, I3, I5
T900             I1, I2, I3
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Example of Generating Candidates
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace

• Pruning:
– acde is removed because ade is not in L3

• C4={abcd}
Improving Apriori’s Efficiency
• Hash-based itemset counting: A k-itemset whose corresponding
hashing bucket count is below the threshold cannot be frequent

• Transaction reduction: A transaction that does not contain any frequent
k-itemset is useless in subsequent scans

• Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB

• Sampling: mining on a subset of given data, need a lower support threshold
+ a method to determine the completeness

• Dynamic itemset counting: add new candidate itemsets immediately
(unlike Apriori) when all of their subsets are estimated to be frequent
Association Rule Mining
• Types with the description of
– Multiple (Multi Level) AR from Transaction DB.
– Multi Dimensional AR from RDB.

There are many types of AR.
AR can be classified in various ways based on the following
criteria….
1. Based on the types of values handled in the rule:
(Boolean AR and Quantitative AR)
# If a rule concerns associations between the presence or absence
of items, it is a Boolean association rule.
AR Types

• A support of 2% for AR1
means that 2% of all the transactions under analysis show that
computer and financial_management_software
are purchased together.
A confidence of 60% for AR1
means that 60% of the customers who purchased a computer
also bought the software.
Typically, association rules are considered interesting if they
satisfy both a minimum support threshold and minimum
confidence threshold.
Such thresholds can be set by users or domain experts.
AR Types
# If a rule describes associations between quantitative items or
attributes, then it is a Quantitative Association Rule.
• In these rules, quantitative values for items or attributes are
partitioned into intervals.
• Association Rule 2 (AR2) below is an example of a quantitative
association rule.

• Note that the quantitative attributes, age and income, have
been discredited.
Multi Dimensional AR
2. Based on the dimensions of data involved in the rule
(Single D AR and Multi D AR)
# If the items or attributes in an association rule each reference only one
dimension, then it is a single dimensional association rule.
Note that AR 1 could be rewritten as..

•
•   AR 1 is a single-dimensional association rule since it refers to only one dimension
#   If a rule references two or more dimensions,
such as the dimensions buys, time of transaction and customer category,
then it is a multidimensional association rule.
AR 2 is a multidimensional association rule since it involves three dimensions:
Multi Level AR
3. Based on the levels of abstractions involved in the rule
(Single Level AR and Multi Level AR)
• Some methods for association rule mining can find rules at different levels of
abstraction.
• For example:
Suppose that a set of mining association rules include AR 3 and AR 4 below.

• In AR3 and AR4, the items bought are referenced at different levels of
abstraction.
(i.e. “computer” is a higher level abstraction of “laptop computer").
• We refer to the rule set mined as consisting of multilevel association rules.

• If, instead, the rules within a given set do not reference items or attributes at
different levels of abstraction, then the set contains single-level association
rules.
4. Based on the nature of the association involved in the rule:

Association mining can be extended to correlation analysis
where the absence or presence of correlated items can be
identified.

```
To top