Mining Association Rules with Constraints

Document Sample
Mining Association Rules with Constraints Powered By Docstoc
					Mining Association Rules with
Constraints

           Wei Ning
          Joon Wong
     COSC 6412 Presentation

                                1
Outline
   Introduction
   Summary of Approach
   Algorithm CAP
   Performance Analysis
   Conclusion
   References


                           2
Outline
   Introduction
   Summary of Approach
   Algorithm CAP
   Performance Analysis
   Conclusion
   References


                           3
Introduction
   Recall mining association rules

   Association rules mining finds
    interesting association or correlation
    relationships among a large set of data
    items.



                                          4
Some problems we met during
mining association rules
   Overwhelming?
   Not what you want?
   Wait so long?
   Lack of Focus




                         5
Introduction(cont.)
   Example in walmart
   Suppose a manager want to find which
    is the most popular shoes in winter?




                                       6
Outline
   Introduction
   Summary of Approach
   Algorithm CAP
   Performance Analysis
   Conclusion
   References


                           7
Mining frequent itemsets vs.
Mining association rules
   Mining frequent itemsets is almost the
    same as Mining association rules




                                             8
Constrained Mining
   A naive solution
        First find all frequent sets, and then test
         them for constraint satisfaction
   Our approach:
     Analyze the properties of constraints
     comprehensively
     Push them as deeply as possible

     inside the frequent pattern
     computation.
                                                       9
      Frequent Itemsets &
      Constraints
                            Given a transaction database
 TDB (min_sup=2)
TID       Transaction       Frequent itemset: a subset of
10           a, b, c         items frequently appear in
20          b, c, d, f       transactions, e.g. {a, c}
30            a, c
                            Constraint: a predicate over
  Item Value
      a       40
                             itemsets
      b       10                C(I): sum(I)>50
      c      -20
                                C(abd)= true
      d       10
      e      -30
                                                             10
Mining Frequent Itemsets With
Constraints
   Given
       A transaction database TDB
       A support threshold min_sup
       A constraint C
   Find the complete set of frequent itemsets
    satisfying the constraint
   Use constraint to
       Express user’s focus
       Improve both effectiveness and efficiency


                                                    11
    Classification of Constraints
   We have the following classification of
    constraints
        Anti-monotone
        Monotone
        Succinct
        Convertible
             Convertible anti-monotone
             Convertible monotone
             Strongly convertible
        Inconvertible
                                              12
Anti-Monotone
   Definition 1 (Anti-Monotone): A 1-var
    constraint C is anti-monotone if for all
    sets S, S’:
    S  S’ & S satisfies C  S’ satisfies C.
   Simply, when an intemset S violates
    the constraint, so does any of its
    superset
                                           13
 Is Min(S)  v anti-monotone?
 S={5, 10, 14}, v = 7
     Min(S)  7
   {5} violates it.
   Superset {5}: {5, 10}, {5, 14}, {5, 10 , 14}

So does {5, 10}, {5, 14}, {5, 10 , 14}
Min(S)  v is anti-monotone
                                              14
Succinct
   Definition 2 (Succinct)
       I  Item is a succinct set if it can be expressed as
        p(Item) for some selection predicate p.
       SP  2Item is a succinct powerset if there is a fixed
        number of succinct sets Item1, … Itemk  Item
        such that SP can be expressed in terms of the
        strict powersets of Item1,…,Itemk, using union
        and minus.
       Finally, a 1-var constraint C is succinct provided
        SATc(Item) is a succinct powerset.

                                                          15
Succinct
   General idea: we can enumerate all and
    only those sets that are guaranteed to
    satisfy the constraint.
   If a constraint is succinct, we can
    directly generate precisely the sets that
    satisfy it.



                                          16
Succinct example
    Itemset containing a or b

    Itemset containing some item with value
     more than 30




                                           17
Succinct example
   C1  Item.Price  100
       Item 1 = Item.price  100(Item)={a,b}
       2Item1={{a}, {b}, {a, b}}
       SATc1 = {{a}, {b}, {a, b}}
       SATc1 = 2Item1
       C1 is succinct



                                                 18
Convertible
   Convert tough constraints into anti-
    monotone or monotone by properly
    order items




                                           19
Convertible
   Definition:
   R is an order of items
   Convertible anti-monotone
       Itemset X satisfies constraint  so does
        every prefix of X w.r.t. R




                                                   20
Convertible example
   constraint C: avg(X)  25
                                        Item   Value   Item   Value
       Order items in value-
                                         a      40      a      40
        descending order
                                         b       0      f      30
            <a, f, g, d, b, h, c, e>    c      -20     g      20
       Itemset afd satisfies C          d      10      d      10

            So do prefixes a and af     e      -30     b       0
                                         f      30      h      -10
            Thus, it becomes
                                         g      20      c      -20
                  Anti-monotone!
                                         h      -10     e      -30


                                                                    21
        Constraints— A General
        Picture
      Constraint            Antimonotone   Monotone      Succinct
         vS                     no           yes          yes
         SV                     no           yes          yes
         SV                    yes            no          yes
       min(S)  v                no           yes          yes
       min(S)  v               yes            no          yes
      max(S)  v                yes            no          yes
      max(S)  v                 no           yes          yes
      count(S)  v              yes            no        weakly
      count(S)  v               no           yes        weakly
sum(S)  v ( a  S, a          yes            no          no
          0)
sum(S)  v ( a  S, a           no           yes          no
          0)
      range(S)  v              yes            no          no
      range(S)  v               no           yes          no
avg(S)  v,   { , ,     convertible   convertible     no
            }                                                       22
Optional Proof of min(S)  v is
Anti-monotone
   According to the table, min(S)  v is
    both anti-monotone and succinct.
   I only proof anti-monotone here due to
    time limitation.

   Something special…


                                        23
  Constraint Classification

                                         Monotone
        Antimonotone

                           Strongly
                           convertible
                Succinct


         Convertible                     Convertible
         anti-monotone                   monotone

Inconvertible
                                                       24
Summary of Approach
Recapitulation
   Basic idea about mining frequent
    itemsets with constraints.
   Introduce several important constraints.




                                         25
Outline
   Introduction
   Summary of Approach
   Algorithm CAP
   Performance Analysis
   Conclusion
   References


                           26
Algorithms
   There are many algorithms in solving
    constrained based association rules
    mining.
       Algorithm   Direct
       Algorithm   MultiJoins & Reorder
       Algorithm   Apriori†
       Algorithm   Hybrid(m)
       Algorithm   CAP (Main Focus)

                                           27
Design of Algorithm
   Sound
       An algorithm is sound provided it only finds
        frequent sets that satisfy the given
        constraints.

   Complete
       An algorithm is complete provided all
        frequent sets satisfying the given
        constraints are found.

                                                 28
Algorithm Apriori†
   Main idea : Use Apriori Algorithm to get
    the frequent item sets. Then apply the
    constraints on the item sets found.

       Step 1) Apriori with Cfreq
       Step 2) Apply C – Cfreq to get final Ans



                                                   29
Algorithm Apriori† (Pseudocode)
1. C1 consists of sets of size 1; k = 1; Ans = ;
2. While (Ck not empty) {
       2.1 conduct db scan to form Lk from Ck;
       2.2 form Ck+1 from Lk based on Cfreq; k++; }
3. For each set S in some Lk:
      Add S to Ans if S satisfies (C – Cfreq).




                                                      30
      The Apriori† Algorithm — An Example
                                   Itemset sup
                                                        Itemset sup
                                      {A}   2     L1
Database TDB                C1                             {A}   2
                                      {B}   3
Tid     Items                                              {B}   3
                                      {C}   3
10      A, C, D        1st scan                            {C}   3
                                     {D}    1
20      B, C, E                                            {E}   3
                                      {E}   3
30     A, B, C, E
40        B, E             C2     Itemset sup            C2   Itemset
                                   {A, B}  1
                                                 2nd scan      {A, B}
L2    Itemset sup                  {A, C}  2
                                   {A, E}  1                   {A, C}
       {A, C}  2
       {B, C}  2                   {B, C}  2                   {A, E}
       {B, E}  3                   {B, E}  3                   {B, C}
       {C, E}  2                   {C, E}  2                   {B, E}
                                                               {C, E}
      C3   Itemset         3rd scan     L3   Itemset sup
           {B, C, E}                         {B, C, E} 2
The Apriori† Algorithm — An Example
(cont.)
                   L1   Itemset sup
                           {A}   2       Constraint :
Database TDB               {B}   3    {A, C, E}  T.Item
                           {C}   3
Tid    Items               {E}   3          Ans
10     A, C, D                              {A}
20     B, C, E     L2   Itemset sup         {C}
30    A, B, C, E         {A, C}  2          {E}
40       B, E            {B, C}  2         {A, C}
                         {B, E}  3         {C, E}
                         {C, E}  2


                   L3   Itemset sup
                        {B, C, E} 2
Algorithm CAP
   Succinct and Anti-monotone
       Strategy I: Replace C1 in the Apriori Algorithm by
        C1C.


   Anti-monotone but non-succinct
       Strategy II: Define Ck as in the Apriori Algorithm.
        Drop a set S  Ck from counting if S fails C, i.e.,
        constraint satisfaction is tested before counting is
        done.

                                                         33
Algorithm CAP (cont.)
   Succinct but non-anti-monotone
       Strategy III: Too Complicated. To be discussed
        later…


   Non-succinct & non-anti-monotone
       Strategy IV: Induce any weaker constraint C1 from
        C. Depending on whether C1 is anti-monotone
        and/or succinct, use one of the strategies I-III
        above for the generation of frequent set.

                                                      34
Algorithm CAP (Pseudocode)
1 if Csam  Csuc  Cnone is non-empty, prepare C1 as indicated in
    Strategies I, III, and IV; k = 1;
2 if Csuc is non-empty {
    2.1 conduct db scan to form L1 as indicated in Strategy III;
    2.2 form C2 as indicated in Strategy III; k = 2;}
3 while (Ck not empty) {
    3.1 conduct db scan to form Lk from Ck;
    3.2 form Ck+1 from Lk based on Strategy III if Csuc is non-empty,
          and Strategy II for constraints in Cam;}
4. if Cnone is empty, Ans = ULk. Otherwise, for each set S in some
    Lk, add S to Ans iff S satisfies Cnone.


                                                                35
The Algorithm CAP — An Example

 Constraints : {A, C, E}  T.Item & min support count = 2
 Question : Which strategy should we apply?

                      Database TDB
                       Tid    Items
                       10     A, C, D
                       20     B, C, E
                       30    A, B, C, E
                       40       B, E
  The Algorithm CAP — An Example
  (Cont.)              L Itemset sup               1
Database TDB                 Apply Strategy I!!!            {A}    2
Tid     Items               C1 Itemset sup                  {C}    3
10      A, C, D      1st scan     {A}     2                 {E}    3
20      B, C, E                   {C}     3
30     A, B, C, E                 {E}     3            C2     Itemset
40        B, E                                                 {A, C}
                         C2                   2nd scan         {A, E}
                              Itemset sup
L2 Itemset sup                                                 {C, E}
                               {A, C}  2
    {A, C}  2
                               {A, E}  1
    {C, E}  2                                                      Ans
                               {C, E}  2
                                                                   {A}
                                                                   {C}
                                                                   {E}
      C3   Itemset
                     Because {A, E} is pruned earlier             {A, C}
              {}                                                  {C, E}
         Case 3 : Succinct but not anti-
         monotone. Revisit…

 {1} {2} {3} {4} {5} {6} {7} {8} {9} {10}               min (S) < 5



                 {1} {2} {3} {4}                        Apriori


                                                  {1} {2} {3} {4}
                                                  {1,2} {2,3}………{3,4}
  Some possible frequent sets may                           ………
     be lost: e.g. {1,8} {1,2,10}                 {1,2,3,4}

**Information extracted from past presentation.
                                                                        38
Case 3 : Succinct but not anti-
monotone. Continue…
   Algorithm Direct
       Idea : Play it safe. Generate Cck+1 by using
        Lck x F where F is the set of all frequent
        items.

       Algorithm MultiJoins
       Algorithm Reorder


                                                 39
Outline
   Introduction
   Summary of Approach
   Algorithm CAP
   Performance Analysis
   Conclusion
   References


                           40
Performance Analysis
(Specification)
   Programs written in C
   Generate transactional databases using
    program from IBM Almaden Research
    Center
   100,000 records, domain of 1,000 items
   Page size 4KB
   SPARC-10 environment

                                       41
Performance Analysis
(Terminology)
   Speedup
       Comparison of execution time between two
        algorithms.

   Item Selectivity
       x% of them items satisfying the constraints.

   Support Threshold
       *Low support threshold means more frequent set
        to process.

                                                       42
Performance Analysis
                  Note: Support threshold
                   set at 0.5%.

                  For 10% selectivity,
                   CAP runs 80 times
                   faster than Apriori†!

                  For 30% selectivity, the
                   speedup is about 10
                   times.

                                           43
Performance Analysis
                  Note: Item Selectivity
                   fixed at 30%.

                  Support threshold goes
                   up, frequent item set
                   goes down, Apriori†
                   improves.

                  CAP still at least 8 times
                   faster.

                                         44
Performance Analysis
Support          L1       L2        L3       L4       L5      L6      L7     L8

    0.2%       174/582   79/969   29/1140   8/1250   1/934   0/451   0/132   0/20


    0.6%       98/313     1/12      0/1       0       0       0       0       0



   Each entry is of the form a/b
          a is the # of frequent set satisfying the constraint.
          B is the total number of frequent set.
   For L4 with support of 0.2%, Apriori† finds 1250
    frequent sets where 8 of which is found by CAP.

                                                                             45
Conclusion
   The idea of anti-monotonicity,
    succinctness, and convertible are
    introduced in the paper.

   Sound, complete, and efficient
    algorithms are introduced for the
    constraint based association rule mining.


                                          46
Reference
   R. Srikant, Q. Vu, and R. Agrawal. Mining association
    rules with item constraints. KDD’97.

   R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang.
    Exploratory mining and pruning optimizations of
    constrained associations rules. SIGMOD’98.

   J. Pei and J. Han. Can we push more constraints into
    frequent pattern mining? KDD’00.


                                                       47

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:5/15/2012
language:
pages:47