Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Homework3 Association Rule Mining by umf86597

VIEWS: 1,281 PAGES: 2

									                 Homework 3: Association Rule Mining
                              Release date: October 28, 2003
                               Due date: November 6, 2003



1. Describe the differences between Association rule mining and Classification in terms of data
   used and results of learning. (10 points)
  Answer:

   (a) Classification algorithms need the class attribute, while the Association rule mining
       algorithms, do not need it. In fact, association rule algorithms can consider the class
       attribute as just another attribute. (5 points)
   (b) Rules from classification algorithms have only the class attribute on the RHS of the
       rules, while in the case of association rule algorithms, class attribute could be on either
       side of the rule and other attributes can also be on the RHS of the rule. (5 points)

2. The following contingency table summarizes supermarket transaction data, where hotdogs
   refers to the transactions containing hot dogs, hotdogs refers to the transactions that do
   not contain hot dogs, hamburgers refers to the transactions containing hamburgers, and
   hamburgers refers to the transactions that do not contain hamburgers. (10 points)

                                         hotdogs    hotdogs    row
                          hamburgers      2000        500     2500
                          hamburgers      1000       1500     2500
                                col       3000       2000     5000


  Suppose the association rule (hotdogs ⇒ hamburgers) is mined. Given a minimum support
  threshold of 25% and a minimum confidence threshold of 50%, is this association rule strong?
  Answer:
  Support(hotdogs ⇒ hamburgers) = P(hotdogs AND hamburgers) = 2000/5000 = 40%.
  Confidence(hotdogs ⇒ hamburgers) = P(hamburgers | hotdogs)
     = P(hotdogs.hamburgers)/P(hotdogs) = 0.4/0.6 = 66.67%. (5 points)
  So support and confidence of rules(40%, 66.67%) are greater than the minimum support
  threshold of 25% and the minimum confidence threshold of 50%. So this association rule is
  strong. (5 points)

3. Use the hash-based technique, introduced in class, to create candidate 3-itemsets (C3 ) for
   the following transactions. Use hash index = 7, and min supp = 50%. (10 points)

                                      T ID       items
                                      T 100    {1, 3, 4}
                                      T 200    {2, 3, 5}
                                      T 300   {1, 2, 3, 5}
                                      T 400     {2, 5}
  Answer: Let the order of the items be 1,2,3,4,5. The hashing function h3 is,
  h3 (x, y, z) = ((order of x)*102 + (order of y)*10 + (order of z)) mod 7.
  h3 (1, 2, 3) = ((1)*100 + (2)*10 + (3)) mod 7 = 123 mod 7 = bin 4.
  h3 (1, 2, 4) = ((1)*100 + (2)*10 + (4)) mod 7 = 124 mod 7 = bin 5.
  h3 (1, 2, 5) = ((1)*100 + (2)*10 + (5)) mod 7 = 125 mod 7 = bin 6.
  h3 (1, 3, 4) = ((1)*100 + (3)*10 + (4)) mod 7 = 134 mod 7 = bin 1.
  h3 (1, 3, 5) = ((1)*100 + (3)*10 + (5)) mod 7 = 135 mod 7 = bin 2.
  h3 (1, 4, 5) = ((1)*100 + (4)*10 + (5)) mod 7 = 145 mod 7 = bin 5.
  h3 (2, 3, 4) = ((2)*100 + (3)*10 + (4)) mod 7 = 234 mod 7 = bin 3.
  h3 (2, 3, 5) = ((2)*100 + (3)*10 + (5)) mod 7 = 235 mod 7 = bin 4.
  h3 (2, 4, 5) = ((2)*100 + (4)*10 + (5)) mod 7 = 245 mod 7 = bin 0.
  h3 (3, 4, 5) = ((3)*100 + (4)*10 + (5)) mod 7 = 345 mod 7 = bin 2.

               bin#0   bin#1   bin#2         bin#3   bin#4      bin#5   bin#6
    itemsets   {245}   {134}   {135}         {234}   {235}      {124}   {125}
                               {345}                 {123}      {145}
     counts      0        1       1            0        3          0      1     (5 points)

  So as min supp is 50%, min supp count is 2. The only bin which has count more than or
  equal to 2 is bin#4. So C3 is [{1,2,3}, {2,3,5}]. (5 points)

4. A database has 4 transactions as shown. Let min sup = 60% and min conf = 80%. (In
   text, Pg 272). (10 points)

                                     T ID      items bought
                                     T 100     {K, A, D, B}
                                     T 200    {D, A, C, E, B}
                                     T 300     {C, A, B, E}
                                     T 400       {B, A, D}

  a) Find ALL frequent itemsets for the database using the Apriori algorithm. (5 points)
      Answer: Since min sup = 60%, the min sup count is 3 (after rounding 2.4)
      The frequent itemsets of various sizes are as follows:

                        1-itemsets       2-itemsets        3-itemsets
                       {A} 4 times     {A,B} 4 times    {A,B,D} 3 times
                       {B} 4 times     {A,D} 3 times
                       {D} 3 times     {B,D} 4 times

       The set of all frequent itemsets are{ {A},{B},{D},{A,B},{A,D},{B,D},{A,B,D} }
  b) List all strong association rules (with their support and confidence) matching the fol-
      lowing metarule, where X is a variable representing the customers, and itemi denotes
      the variables representing the items (e.g., ”A”, ”B”, etc.). (5 points)
                     buys(X, item1 ) ∧ buys(X, item2 ) ⇒ buys(X, item3 )
      Answer: To generate rules matching the above metarule, we need frequent 3-itemsets.
      We have only one frequent 3-itemset, which is {A,B,D}. The rules are:
                     Rule            Support      Confidence        Strong
                   A∧B ⇒D               75%          75%            NO
                   A∧D ⇒B               75%          100%           YES
                   B∧D ⇒A               75%          100%           YES

								
To top