Data Mining and Exploration Association Rules Association Rules

Document Sample
Data Mining and Exploration Association Rules Association Rules Powered By Docstoc
					                                                                               Association Rules

      Data Mining and Exploration: Association
                                                                                     Itemsets, association rules
                Amos Storkey, School of Informatics                                  Frequency, accuracy
                                                                                     APRIORI algorithm
                                                                                     Comments on Association Rules
                           February 7, 2006
                                                                                 Reading: HMS chapter 13
                                                                                 Additional reading: Witten and Frank §4.5, Han and Kamber                                  §6.1, 6.2

  These lecture slides are based extensively on
  previous versions of the course written by Chris

                                                                         1/1                                                                     2/1

About Association Rules

                                                                                     Example of Association rules: market basket analysis, the
      We are looking for patterns, i.e. local regularities in the data               process of analyzing customer buying habits by finding
      Examples of frequent itemsets, association rules                               associations between items that customers place in their
           10% of supermarket customers buy wine and cheese
                                                                                     “shopping baskets”
           If a person visits the CNN website, there is a 60% chance                 Each row of the data matrix has a 1 if the corresponding
           that they will visit the ABC website in the same month                    product was in the basket. Data is often sparse
      Association rules are like classification rules, except that they               Can recode k-valued categorical variables (e.g. outlook =
      can predict any attribute, not just the class                                  {sunny, overcast, rainy}) as k binary variables
      Association rules are not intended to be used together as a set
      (cf classification rules)

                                                                         3/1                                                                     4/1
Itemsets, Frequency, Accuracy                                                     Play Tennis Example

         An itemset is a pattern defined by
                                                                                      Day   Outlook    Temperature   Humidity   Wind    PlayTennis
                         (Ai1 = aj1 ) ∧ (Ai2 = aj2 ) ∧ . . . (Aik = ajk )             D1     Sunny        Hot         High      False       No
                                                                                      D2     Sunny        Hot         High      True        No
         The frequency (or support) of an itemset X is simply P(X )                   D3    Overcast      Hot         High      False      Yes
                                                                                      D4      Rain        Mild        High      False      Yes
         Example: in the “Play Tennis” data                                           D5      Rain        Cool       Normal     False      Yes
                                                                                      D6      Rain        Cool       Normal     True        No
         P(Humidity = Normal ∧ Play = Yes ∧ Windy = False) = 4/14
                                                                                      D7    Overcast      Cool       Normal     True       Yes
                                                                                      D8     Sunny        Mild        High      False       No
                                                                                      D9     Sunny        Cool       Normal     False      Yes
         The accuracy (or confidence) of an association rule if Y=y
                                                                                      D10     Rain        Mild       Normal     False      Yes
         then Z=z is
                                                                                      D11    Sunny        Mild       Normal     True       Yes
                               P(Z = z|Y = y )
                                                                                      D12   Overcast      Mild        High      True       Yes
         Example                                                                      D13   Overcast      Hot        Normal     False      Yes
                                                                                      D14     Rain        Mild        High      True        No
           P(Windy = False ∧ Play = Yes|Humidity = Normal) = 4/7

                                                                            5/1                                                                       6/1

Generating rules from itemsets                                                    Finding Frequent Itemsets

         An itemset of size k can give rise to 2k − 1 rules
                                                                                        Task: find all itemsets with frequency ≥ s
         Example. Itemset
                                                                                        Key observation: a set X of variables can be frequent only
                 Windy=False, Play=Yes, Humidity=Normal                                 if all subsets of variables are frequent (monotonicity
                                                                                        property), i.e. P(A, B) ≤ P(A) and P(A, B) ≤ P(B)
         gives rise to 7 rules including                                                So find frequent singleton sets, then sets of size 2, and so
   IF Windy=False and Humidity=Normal THEN Play=Yes            (4/4)                    on ...
   IF Play=Yes THEN Humidity=Normal and Windy=False            (4/9)
   IF True THEN Windy=False and Play=Yes and Humidity=Normal   (4/14)                   An efficient algorithm using this idea for finding frequent
                                                                                        itemsets is the APRIORI algorithm (Agrawal and Srikant
         Select association rules that have accuracy greater than some                  (1994), Mannila et al (1994))
         threshold a

                                                                            8/1                                                                       9/1
APRIORI algorithm
                                                                           Single database pass is linear in |Ci |n, make a pass for each i
                                                                           until Ci is empty
  (for binary variables)                                                   Candidate formation
  i =1                                                                          Find all pairs of sets {U, V } from Li such that U ∪ V has
  Ci = {{A}|A is a variable}                                                    size i + 1 and test if this union is really a potential
  while Ci is not empty                                                         candidate. O(|Li |3 )
       database pass:                                                      Example: 5 three-item sets
             for each set in Ci test if it is frequent                     (ABC), (ABD), (ACD), (ACE), (BCD)
             let Li be collection of frequent sets from Ci                 Candidate four-item sets
       candidate formation:                                                (ABCD) ok
                                                                           (ACDE) not ok because (CDE) is not present above
             let Ci+1 be those sets of size i + 1
             all of whose subsets are frequent                             Data structure techniques can be used for speedups
  end while                                                                Other algorithms possible for finding frequent itemsets, e.g.
                                                                           Han’s FP-growth

                                                             10 / 1                                                                            11 / 1

APRIORI and Algorithm Components                                      Comments on Association Rules

                                                                           Finding Association Rules is just the beginning in a datamining
                                                                           effort. Some will be trivial, others interesting. Challenge is to
                                                                           select potentially interesting rules
                                                                           Finding Association rules as Exploratory Data Analysis
      Task: Rule Pattern Discovery                                         Trivial rule example:
      Structure: Association Rules
                                                                                                   pregnant ⇒ female
      Score Function: Support                                              with accuracy 1!
      Search: Breadth First with Pruning                                   For rule A ⇒ B, it can be useful to compare P(B|A) to P(B)
      Data Management Technique: Linear Scans                              APRIORI algorithm can be generalized to frequent structure
                                                                           mining, e.g. finding episodes from sequences or
                                                                           frequently-occurring trees
                                                                           Example application: Health Insurance Commission (HIC) in
                                                                           Australia detected patterns of ordering of medical tests that
                                                                           suggested that some of the tests ordered were unnecessary
                                                                           (Cabena et al, 1998)
                                                             12 / 1                                                                            13 / 1

    Finding frequent itemsets
    Done with APRIORI algorithm
    Given frequent itemsets, construct association rules with
    accuracy > a
    Select interesting rules
    Generalize to frequent structure mining

                                                                14 / 1

Shared By:
Description: Data Mining and Exploration Association Rules Association Rules