Association Rule Market Basket Model A-priori Algorithm by yurtgc548


									  Association Rule
Market Basket Model
 A-priori Algorithm
          The Market-Basket Model
• A market Basket is a collection of items
  purchased by a customer in a single customer

• A large set of items.
  (e.g. things sold in a supermarket)
• A large set of baskets, each of which is a small
  set of the items.
  (e.g. the things one customer buys on one day)

• Simplest question: find sets of items that
  appear “frequently” in the baskets.
• Support for itemset I = the number of baskets
  containing all items in I.
• Given a support threshold s, sets of items that
  appear in > s baskets are called frequent

• “Baskets” = documents; “items” = words in
  those documents.
  – Lets us find words that appear together unusually
    frequently, i.e., linked concepts.
• “Baskets” = Web pages; “items” = linked
  – Pairs of pages with many common references may
    be about the same topic.
• Real market baskets: chain stores keep
  terabytes of information about what
  customers buy together.
         Mining Association Rule
• Example of a Retail Store….

 Transactions         Items
 T1                   Bread, Jelly, Butter
 T2                   Bread, Butter
 T3                   Bread, Milk, Butter
 T4                   Beer, Bread
 T5                   Beer, Milk

• Support of an item (or set of items) is the
  percentage of transactions in which that item
• 5 transactions as 100%. (in this case)

         Mining Association Rule

•   Occurrence of Beer------twice in T4 and T5.
•   Therefore…Support = 40%
•   Occurrence of Beer and Jelly -------0
•   Beer and Milk --------T5
•   Therefore…Support = 20%

          List of few item sets & their Support

                               S.No Set                    Support
Tran    Items                  1    Beer                   40
                               2    Bread                  80
                               3    Jelly                  20
T1      Bread, Jelly, Butter
                               4    Milk                   40
T2      Bread, Butter
                               5    Butter                 60
T3      Bread, Milk, Butter
                               6    Beer, Bread            20
T4      Beer, Bread
                               7    Beer, Milk             20
T5      Beer, Milk
                               8    Bread, Jelly           20
                               9    Bread, Jelly, Butter   20
                               10   Bread, Milk, Butter    20
                               11   Milk, Butter           20
                               12   Bread, Butter          60

       Mining Association Rule
Definition (Support):

The support for an association rule
   X     Y
 is the percentage of transactions in the database
  that contain
  X U Y

       Mining Association Rule
Definition (Confidence):

The confidence or strength (σ) for an association
   X    Y is the ratio of the Number of Transactions
  that contain X U Y to the Number of
  transactions that contain X .

            Mining Association Rule
Bread occurs in 4 transactions from T1 to T4
Bread, Butter together occurs 3 times (T1 T2, T3)
σ =3/4 i.e. 85%

    X        Y                  S   σ
    Bread        Butter   60%       75%
    Butter       Bread 60%          100%
    Jelly        Milk     0%        0%

Confidence shows that Bread             Butter is stronger
  rule than Jelly   Milk
             Association Rules
• If-then rules about the contents of baskets.
• {i1, i2,…,ik} → j means: “if a basket contains all
  of i1,…,ik then it is likely to contain j.”
• Confidence of this association rule is the
  probability of j given i1,…,ik.

        Mining Association Rule
Larger Item set is an item set whose number of
  occurrences is above a threshold s.
L—complete set of large item set.

m size of item set.
No. of subsets = pow(2,m)
No. of large itemsets = pow(2,m) – 1
                     excluding the empty set.
e.g. m = 5
31 item sets
                AR Algorithm (Example)
The input support and confidence are
 s = 30%
 σ = 50%                and
Large item set is given by
L = {{Beer}, {Bread}, {Milk}, {Butter}, {Bread, Butter}}
Let l = {Bread, Butter}
such that
{Bread} and {Butter} are two non empty subsets of l
     support {Bread, Butter}) = 60        =     0.75
      support({Bread})            80

Thus confidence of the association rule Bread           Butter is 75%. Since
   this is above threshold given, it is valid association rule.

        Mining Association Rule
Larger Item set is an item set whose number of
  occurrences is above a threshold s.
L—complete set of large item set.

m size of item set.
No. of subsets = pow(2,m)
No. of large itemsets = pow(2,m) – 1
                     excluding the empty set.
e.g. m = 5
31 item sets
              Important Point
• “Market Baskets” is an abstraction that
  models any many-many relationship between
  two concepts: “items” and “baskets.”
  – Items need not be “contained” in baskets.
• The only difference is that we count co-
  occurrences of items related to a basket, not

                  Association Mining?
•   Association rule mining:
    –   Finding frequent patterns, associations, correlations, or
        causal structures among sets of items or objects in
        transaction databases, relational databases, and other
        information repositories.
•   Applications:
    –   Basket data analysis, cross-marketing, catalog design, loss-
        leader analysis, clustering, classification, etc.
•   Examples.
    –   Rule form: “Body ead [support, confidence]”.
    –   buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]
    –   major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]
   Mining Association Rules—An Example

Transaction ID   Items Bought       Min. support 50%
    2000         A,B,C              Min. confidence 50%
    1000         A,C
    4000         A,D                 Frequent Itemset Support
                                     {A}                 75%
    5000         B,E,F
                                     {B}                 50%
                                     {C}                 50%
 For rule A  C:                     {A,C}               50%
    support = support({A U C}) = 50%
    confidence = support({A U C})/support({A}) = 66.6%
 The Apriori principle:
    Any subset of a frequent itemset must be frequent
          Mining Frequent Itemsets
• Find the frequent itemsets: the sets of items that
  have minimum support
   – A subset of a frequent itemset must also be a frequent
      • i.e., if {AB} is a frequent itemset, both {A} and {B} should be a
        frequent itemset
   The Apriori Algorithm: Basic idea
• The name of the algorithm is based on the fact that the algorithm uses
  prior knowledge of frequent item set properties.

• K-itemsets are used to explore (k+1)-itemsets

• First the set of frequent 1-itemsets is found by scanning the database
  to accumulate the count for each item and collecting those items that
  satisfy minimum support.

• The resulting set is denoted by   L1.
• L1 is used to find L2 (set of frequent 2-itemsets).
• L2 is used to find L3 (set of frequent 3-itemsets).
                  The Apriori Algorithm:
• How      Lk-1   is used to find   Lk   where k >=2

• A two step process is followed…
   – Join
   – Prune
• Join:
    – We find To find Lk a set of candidate k-itemsets is generated by joining
      Lk-1 with itself.
    – The set of candidates is denoted by Ck

• Prune:
    – Ck is superset of Lk
    – That is, its members may or may not be frequent.
     Apriori Algorithm for Boolean Association Rule:
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent
  cannot be a subset of a frequent k-itemset
• Pseudo-code:
       Ck: Candidate itemset of size k
       Lk : frequent itemset of size k
       L1 = {frequent items};
       for (k = 1; Lk !=; k++) do begin
           Ck+1 = candidates generated from Lk;
          for each transaction t in database do
               increment the count of all candidates in Ck+1
             that are contained in t
         Lk+1 = candidates in Ck+1 with min_support
       return L = k Lk;
      The Apriori Algorithm — Example
Database D                  itemset sup.
                                                L1 itemset sup.
TID   Items              C1    {1}   2                  {1}       2
100   134                      {2}   3                  {2}       3
                     Scan D
200   235                      {3}   3                  {3}       3
300   1235                     {4}   1                  {5}       3
400   25                       {5}   3
                        C2 itemset sup                  C2    itemset
L2   itemset   sup          {1   2}        1   Scan D           {1 2}
       {1 3}    2           {1   3}        2                    {1 3}
       {2 3}    2           {1   5}        1                    {1 5}
                            {2   3}        2                    {2 3}
       {2 5}    3
                            {2   5}        3                    {2 5}
       {3 5}    2
                            {3   5}        2                    {3 5}
     C3   itemset       Scan D        L3   itemset sup
           {2 3 5}                          {2 3 5} 2
Problem: Generate candidate itemsets and
frequent itemsets where the minimum
support count is 2.

    Transaction-ID   List of Item IDs
    T100             I1, I2, I5
    T200             I2, I4
    T300             I2, I3
    T400             I1, I2, I4
    T500             I1, I3
    T600             I2, I3
    T700             I1, I3
    T800             I1, I2, I3, I5
    T900             I1, I2, I3
   How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
   insert into Ck
   select p.item1, p.item2, …, p.itemk-1, q.itemk-1
   from Lk-1 p, Lk-1 q
   where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

• Step 2: pruning
   forall itemsets c in Ck do
        forall (k-1)-subsets s of c do
             if (s is not in Lk-1) then delete c from Ck
 Example of Generating Candidates
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
   – abcd from abc and abd
   – acde from acd and ace

• Pruning:
   – acde is removed because ade is not in L3

• C4={abcd}
             Improving Apriori’s Efficiency
• Hash-based itemset counting: A k-itemset whose corresponding
   hashing bucket count is below the threshold cannot be frequent

• Transaction reduction: A transaction that does not contain any frequent
   k-itemset is useless in subsequent scans

• Partitioning: Any itemset that is potentially frequent in DB must be
   frequent in at least one of the partitions of DB

• Sampling: mining on a subset of given data, need a lower support threshold
   + a method to determine the completeness

• Dynamic itemset counting: add new candidate itemsets immediately
   (unlike Apriori) when all of their subsets are estimated to be frequent
                Association Rule Mining
• Types with the description of
   – Multiple (Multi Level) AR from Transaction DB.
   – Multi Dimensional AR from RDB.

There are many types of AR.
  AR can be classified in various ways based on the following
1. Based on the types of values handled in the rule:
       (Boolean AR and Quantitative AR)
# If a rule concerns associations between the presence or absence
  of items, it is a Boolean association rule.
                           AR Types

• A support of 2% for AR1
  means that 2% of all the transactions under analysis show that
  computer and financial_management_software
  are purchased together.
  A confidence of 60% for AR1
  means that 60% of the customers who purchased a computer
  also bought the software.
  Typically, association rules are considered interesting if they
  satisfy both a minimum support threshold and minimum
  confidence threshold.
  Such thresholds can be set by users or domain experts.
                           AR Types
# If a rule describes associations between quantitative items or
  attributes, then it is a Quantitative Association Rule.
• In these rules, quantitative values for items or attributes are
  partitioned into intervals.
• Association Rule 2 (AR2) below is an example of a quantitative
  association rule.

• Note that the quantitative attributes, age and income, have
  been discredited.
                         Multi Dimensional AR
2. Based on the dimensions of data involved in the rule
          (Single D AR and Multi D AR)
# If the items or attributes in an association rule each reference only one
  dimension, then it is a single dimensional association rule.
   Note that AR 1 could be rewritten as..

•   AR 1 is a single-dimensional association rule since it refers to only one dimension
    i.e. buys.
#   If a rule references two or more dimensions,
    such as the dimensions buys, time of transaction and customer category,
    then it is a multidimensional association rule.
    AR 2 is a multidimensional association rule since it involves three dimensions:
    age, income and buys.
                                 Multi Level AR
3. Based on the levels of abstractions involved in the rule
         (Single Level AR and Multi Level AR)
• Some methods for association rule mining can find rules at different levels of
• For example:
  Suppose that a set of mining association rules include AR 3 and AR 4 below.

• In AR3 and AR4, the items bought are referenced at different levels of
  (i.e. “computer” is a higher level abstraction of “laptop computer").
• We refer to the rule set mined as consisting of multilevel association rules.

• If, instead, the rules within a given set do not reference items or attributes at
  different levels of abstraction, then the set contains single-level association
4. Based on the nature of the association involved in the rule:

   Association mining can be extended to correlation analysis
   where the absence or presence of correlated items can be

To top