Mining Association Rules in Large Databases by pua50703

VIEWS: 480 PAGES: 77

									Mining Association Rules in
     Large Databases
                          By Group 10
                   Sadler Divers 103315414
                     Beili Wang 104522400
                     Xiang Xu 106067660
                  Xiaoxiang Zhang 105635826


              Spring 2007 - CSE634 DATA MINING
                   Professor Anita Wasilewska
 Department of Computer Sciences - Stony Brook University - SUNY
                Sources/References
[1] J. Han and M. Kamber, "Data Mining: Concepts and Techniques", 2nd
     Edition, Morgan Kaufmann Publishers, August 2006.
[2] A. Wasilewska, "Data Mining: Concepts and Techniques", Course Slides.
[3] J. Han, "Data Mining: Concepts and Techniques", Book Slides.
[4] T. Brijs et al., “Using Association Rules for Product Assortment
     Decisions: A Case Study”, KDD-99 ACM 1999.
[5] A. Savasere, E. Omiecinski, and S. Navathe. An Efficient Algorithm for
     Mining Association Rules in Large Databases. VLDB'95, 432-444,
     Zurich, Switzerland.
[6] J. Han and Y. Fu, ''Discovery of Multiple-Level Association Rules from
     Large Databases'', In Proc. of 1995 Int. Conf. on Very Large Data Bases
     (VLDB'95).
[7] M. Kamber, J. Han, and J. Y. Chiang. "Metarule-guided mining of multi-
     dimensional association rules using data cubes". In Proc. 3rd Int. Conf.
     Knowledge Discovery and Data Mining (KDD'97).
[8] B. Lent, A. Swami, and J. Widom. "Clustering association rules". In Proc.
     1997 Int. Conf. Data Engineering (ICDE'97).
[9]. S. Brin, R. Motwani and C. Silverstein. “Beyond Market Baskets:
     Generalizing Association rules to Correlations”. Proceeding of the 1997
     ACM SIGMOD International conference on management of data.
            Goal and Overview
• Goals:
  – Introduce the concepts of frequent patterns,
    associations, and correlations;
  – Explain how they can be mined efficiently.
• Overview:
  –   Introduction and Apriori Algorithm
  –   Improved the Efficiency of Apriori
  –   Mining Various Kinds of Association Rules
  –   From Association Mining to Correlation Analysis
Introduction and Apriori
       Algorithm

       Sadler Divers
                     References
[1] J. Han and M. Kamber, "Data Mining: Concepts and
   Techniques", 2nd Edition, Morgan Kaufmann Publishers,
   August 2006.

[2] A. Wasilewska, "Data Mining: Concepts and Techniques",
   Course Slides.

[3] J. Han, "Data Mining: Concepts and Techniques", Book
   Slides.

[4] T. Brijs et al., “Using Association Rules for Product
   Assortment Decisions: A Case Study”, KDD-99 ACM 1999.
      Mining Association Rules
• Definition
It’s the process of finding frequent patterns or
  associations within the data of some DB or
  some set of DBs.

• Why?
To gain Information, Knowledge, Money, etc.
             Applications
Market Basket Analysis

Cross-Marketing

Catalog Design

Product Assortment Decision
            How is it done?
Approaches:

  • Apriori Algorithm

  • FP-Growth (Frequent Pattern Growth)

  • Vertical Format
      Concepts and Definitions
• Let I = {I1, I2, … Im} a set of items

• Let D be a set of DB transactions

• Let T be a particular transaction

• An association rule is of the form A => B
  where A, B included in I and (A ∩ B = )
  Concepts & Definitions (continued)
• Support: The support of a rule, A => B, is the
  percentage of transactions in D, the DB,
  containing both A and B.

• Confidence: The percentage of transactions in
  D containing A that also contain B.
  Concepts & Definitions (continued)
• Strong Rules: Rules that satisfy both a
  minimum support and a minimum confidence
  are said to be strong

• Itemset: Simply a set of items

• k-Itemset: a set of items with k items in it
  Concepts & Definitions (continued)
• Apriori Property: All non-empty subset of a
  frequent itemset must also be frequent

• Frequent Itemset: An itemset is said to be
  frequent if it satisfies the minimum support
  threshold.
             Apriori Algorithm
• A two-step process

  – The join step: Find Lk, the set of candidate of k-
    itemsets, join Lk-1 with itself.

  – Rules for joining:
     • Order the items first so you can compare item by item
     • The join of Lk-1 is possible only if its first (k-2) items are
       in common
     Apriori Algorithm (continued)
• The Prune step:

  – The “join” step will produce all k-itemsets, but not
    all of them are frequent.

  – Scan DB to see which itemsets are indeed frequent
    and discard the others.

• Stop when “join” step produces and empty set
      Apriori Algorithm : Pseudo code
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a
  frequent k-itemset

• Pseudo-code:
    Ck: Candidate itemset of size k
    Lk: frequent itemset of size k

    L1 = {frequent items};
    for (k= 1; Lk!= ∅; k++) do begin
       Ck+1 = candidates generated from Lk;
       for each transaction t in database do
           increment the count of all candidates in Ck+1 that are contained in t
       Lk+1= candidates in Ck+1with min_support
    end
    return∪kLk;

                              Source: A. Wasilewska, CSE 634, Lecture Slides
  The Apriori Algorithm—An Example
                   Supmin = 2        Itemset       sup
                                                                 Itemset   sup
Database TDB                            {A}         2     L1        {A}     2
 Tid     Items               C1         {B}         3
                                                                    {B}     3
 10      A, C, D                       {C}          3
                        1st scan                                    {C}     3
 20      B, C, E                       {D}          1
                                                                    {E}     3
 30     A, B, C, E                     {E}          3
 40        B, E
                            C2     Itemset     sup               C2   Itemset
                                     {A, B}     1
 L2    Itemset       sup                                 2nd scan       {A, B}
                                     {A, C}     2
         {A, C}       2                                                 {A, C}
                                     {A, E}     1
         {B, C}       2
                                     {B, C}     2                       {A, E}
         {B, E}       3
                                     {B, E}     3                       {B, C}
         {C, E}       2
                                     {C, E}     2                       {B, E}
                                                                        {C, E}
       C3                                     L3   Itemset     sup
            Itemset         3rd scan                                  Source: J. Han, “Data
                                                   {B, C, E}    2     Mining Concepts and
            {B, C, E}
                                                                      Techniques
Generating Association Rules From
        Frequent Itemsets

• For each frequent itemset l, generate all
  nonempty subsets of l.

• For every nonempty subset s of l, output rule
  “s => (l - s)” if:

  support_count(l) / support_count(s) >= min_conf

  (where min_conf = minimum confidence threshold).
Association Rules from Example
• Generate all nonempty subsets:
   – {B, C}, {B, E}, {C, E}, {B}, {C}, {E}

• Calculate Confidence:
   •   B ∩ C => E       Confidence = 2/2 = 100%
   •   B ∩ E => C       Confidence = 2/3 = 66%
   •   C ∩ E => B       Confidence = 2/2 = 100%
   •   B => C ∩ E       Confidence = 2/3 = 66%
   •   C => B ∩ E       Confidence = 3/3 = 100%
   •   E => B ∩ C       Confidence = 2/3 = 66%
Improved the Efficiency of Apriori

            Beili Wang
                           References
[1] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for
     mining association rules in large databases. VLDB'95, 432-444, Zurich,
     Switzerland. <http://www.informatik.uni-
     trier.de/~ley/db/conf/vldb/SavasereON95.html>.

[2] J. Han and M. Kamber. "Data Mining: Concepts and Techniques". Morgan
       Kaufmann Publishers. March 2006. Chapter 5, Section 5.2.3, Page 240.

[3] Presentation Slides of Prof. Anita Wasilewska
  Improving Apriori: General Ideas
• Challenges:
   – Multiple scans of transaction database
   – Huge number of candidates
   – Tedious workload of support counting for candidates
• General Ideas:
   – Reduce passes of transaction database scans
   – Shrink number of candidates
   – Facilitate support counting of candidates
    Source: textbook slide, 2nd Edition, Chapter 5, http://www-faculty.cs.uiuc.edu/~hanj/bk2/slidesindex.html
               Methods to Improve Apriori’s
                        Efficiency
• Hash-based itemset counting: A k-itemset whose corresponding
  hashing bucket count is below the threshold cannot be frequent.
An Effective Hash-Based Algorithm for Mining Association Rules <http://citeseer.ist.psu.edu/park95effective.html>

• Transaction reduction: A transaction that does not contain any
  frequent k-itemset is useless in subsequent scans.
Fast Algorithms for Mining Association Rules in Large Databases <http://citeseer.ist.psu.edu/agrawal94fast.html>

• Partitioning: Any itemset that is potentially frequent in DB must
  be frequent in at least one of the partitions of DB.
An Efficient Algorithm for Mining Association Rules in Large Databases <http://citeseer.ist.psu.edu/sarasere95efficient.html>

• Sampling: mining on a subset of given data, lower support
  threshold + a method to determine the completeness.
Sampling Large Databases for Association Rules <http://citeseer.ist.psu.edu/toivonen96sampling.html>

• Dynamic itemset counting: add new candidate itemsets only when
  all of their subsets are estimated to be frequent.
Dynamic Itemset Counting and Implication Rules for Market Basket Data <http://citeseer.ist.psu.edu/brin97dynamic.html>

                  Source: Presentation Slides of Prof. Anita Wasilewska, 07. Association Analysis, page 51
     Partition Algorithm: Basics
• Definition:
  A partition p b D of the database refers to any
  subset of the transactions contained in the database
   D . Any two different partitions are non-
  overlapping, i.e., pi T p j   , i  j .

• Ideas:
  Any itemset that is potentially frequent in DB must
  be frequent in at least one of the partitions of DB.
  Partition scans DB only twice:
  Scan 1: partition database and find local frequent
  patterns.
  Scan 2: consolidate global frequent patterns.
                    Partition Algorithm
Initially the database D is logically partitioned into n partitions.

Phase I: read the entire database once, takes n iterations
   input: pi, where i = 1... n.
   output: local large itemsets of all lengths, L2 , L3 , , Lli as the output.
                                                 i    i




Merge phase:
  input: local large itemsets of same lengths from all n partitions
  output: combine and generate the global candidate itemsets. The set of global
   candidate itemsets of length j is computed as CG  [
                                                    j
                                                                     i
                                                                    Lj
                                                                  i  1, , n

Phase II: read the entire database again, takes n iterations
                                      G
   input: pi, where i = 1... n; c 2 C
   output: counters for each global candidate itemset and counts their support

Algorithm output: itemsets that have the minimum global support along with their
support. The algorithm reads the entire database twice.
Partition Algorithm: Pseudo code




    Source: An Efficient Algorithm for Mining Association Rules in Large Databases, page 7,
                      <http://citeseer.ist.psu.edu/sarasere95efficient.html>
Partition Algorithm: Example
Consider a small database with four items I={Bread, Butter, Eggs, Milk}
and four transactions as shown in Table 1. Table 2 shows all itemsets for I.
Suppose that the minimum support and minimum confidence of an
association rule are 40% and 60%, respectively.




                         Source: A Survey of Association Rules
               <pandora.compsci.ualr.edu/milanova/7399-11/week10/ar.doc>
Partition Algorithm: Example




               Source: A Survey of Association Rules
     <pandora.compsci.ualr.edu/milanova/7399-11/week10/ar.doc>
                 Partition Size
Q: How to estimate the partition size from system
  parameters?
A: We must choose the partition size such that at least
  those itemsets that are used for generating the new
  large itemsets can fit in main memory.

The size is estimated based on:
  1. available main memory
  2. average length of the transactions
           Effect of Data Skew
Problem:
1. A gradual change in data characteristics or any
    localized changes in data, can lead to the generation
    of a large number of local large sets which may not
    have global support.
2. Fewer itemsets will be found common between
    partitions leading to a larger global candidate set.

Solution: Randomly reading the pages from the
    database is extremely effective in eliminating data
    skew.
Performance Comparison - Time




 Source: An Efficient Algorithm for Mining Association Rules in Large Databases, page 22,
                   <http://citeseer.ist.psu.edu/sarasere95efficient.html>
Performance Comparison – Disk IO




   Source: An Efficient Algorithm for Mining Association Rules in Large Databases, page 23,
                     <http://citeseer.ist.psu.edu/sarasere95efficient.html>
Performance Comparison – Scale-up




   Source: An Efficient Algorithm for Mining Association Rules in Large Databases, page 24,
                     <http://citeseer.ist.psu.edu/sarasere95efficient.html>
Parallelization in Parallel Database
Partition algorithm indicates that the partition processing can
be essentially done in parallel.

Parallel algorithm executes in four phases:
1. All the processing nodes independently generate the large
itemsets for their local data.

2. The large itemsets at each node is exchanged with all other
nodes.

3. At each node support for each itemset in the candidate set
with respect to the local data is measured.

4. The local counts at each node is sent to all other nodes. The
global support is the sum of all local supports.
                      Conclusion
• Partition algorithm achieve both CPU and I/O improvements
  over Apriori algorithm

• It scans the database at most twice, wherease in Apriori this is
  not known in advance and may be quite large.

• The inherent parallelism in the alogrithm can be exploited for
  implementation on a parallel machine. It is suited for very
  large database in a high data and resource contention
  environment such as an OLTP system.
Mining Various Kinds of
   Association Rules
        Xiang Xu
                  Outline
• Mining multilevel association

• Miming multidimensional association
  – Mining quantitative association
                           References
[1] J. Han and M. Kamber, "Data Mining: Concepts and Techniques", Morgan
    Kaufmann Publishers, August 2000.

[2] A. Wasilewska, "Data Mining: Concepts and Techniques", Course Slides.

[3] J. Han, "Data Mining: Concepts and Techniques", Book Slides.

[4] J. Han and Y. Fu, ''Discovery of Multiple-Level Association Rules from Large
    Databases'', In Proc. of 1995 Int. Conf. on Very Large Data Bases (VLDB'95).

[5] M. Kamber, J. Han, and J. Y. Chiang. "Metarule-guided mining of multi-
    dimensional association rules using data cubes". In Proc. 3rd Int. Conf.
    Knowledge Discovery and Data Mining (KDD'97).

[6] B. Lent, A. Swami, and J. Widom. "Clustering association rules". In Proc.
    1997 Int. Conf. Data Engineering (ICDE'97).
Mining Multilevel Association
  Multilevel Association Rules
• Rules generated from association rule
  mining with concept hierarchies
                                                          milk →
                                                          bread [8%, 70%]


                                                          2% milk →
                                                          wheat bread [2%, 72%]


• Encoded transaction: T1 {111,121,211,221}
     Source: J. Han and Y. Fu, ''Discovery of Multiple-Level Association Rules from Large
          Databases'', Proc. of 1995 Int. Conf. on Very Large Data Bases (VLDB'95).
    Multilevel Association:
  Uniform vs Reduced Support
• Uniform support



• Reduced support
             Uniform Support
• Same minimum support threshold for all levels



Level 1                             Milk
min support = 5%
                             [support = 10%]



Level 2               2% Milk                Skim Milk
min support = 5%   [support = 6%]          [support = 4%]
            Reduced Support
• Reduced minimum support threshold at lower levels



Level 1                             Milk
min support = 5%
                             [support = 10%]



Level 2               2% Milk                Skim Milk
min support = 3%   [support = 6%]          [support = 4%]
   Mining Multilevel: Top-Down
      Progressive Deepening
• Find multilevel frequent itemsets
   – High-level frequent itemsets
             milk (15%), bread (10%)
   – Lower-level “weaker” frequent itemsets
             2% milk (5%), wheat bread (4%)
• Generate multilevel association rules
   – High-level strong rules
             milk → bread [8%, 70%]
   – Lower-level “weaker”rules:
             2% milk → wheat bread [2%, 72%]
Generation of Flexible Multilevel
      Association Rules
• Association rules with alternative multiple hierarchies
      2% milk → Old Mills bread                <{11*},{2*1}>
• Level-crossed association rules
      2% milk → Old Mills white bread         <{11*},{211}>




                                    Source: J. Han and Y. Fu, ''Discovery of
                                    Multiple-Level Association Rules from Large
                                    Databases'', Proc. of 1995 Int. Conf. on Very
                                    Large Data Bases (VLDB'95).
       Redundant Multilevel
     Association Rules Filtering
• Some rules may be redundant due to “ancestor”
  relationships between items
      milk → wheat bread [8%, 70%]
      2% milk → wheat bread [2%, 72%]
• First rule is an ancestor of the second rule
• A rule is redundant if its support and confidence are
  close to their “expected” values, based on the rule’s
  ancestor.
Mining Multidimensional
   Association Rules
              Multidimensional
              Association Rules
• Single-dimensional rules
      buys(X, “milk”) → buys(X, “bread”)
• Multidimensional rules(2 dimensions/predicates)
   – Inter-dimension assoc. rules (no repeated predicates)
      age(X,”19-25”) ∧ occupation(X,“student”) →
                                            buys(X, “coke”)
   – Hybrid-dimension assoc. rules (repeated predicates)
     age(X,”19-25”) ∧ buys(X, “popcorn”) →
                                            buys(X, “coke”)
     Categorical Attributes and
      Quantitative Attributes
• Categorical Attributes
  – Finite number of possible values, no ordering
    among values

• Quantitative Attributes
  – Numeric, implicit ordering among values
Mining Quantitative Associations
• Static discretization based on predefined concept
  hierarchies

• Dynamic discretization based on data distribution

• Clustering: Distance-based association
         Static Discretization of
         Quantitative Attributes
• Discretized prior to mining using concept
  hierarchy. Numeric values are replaced by ranges.
• In relational database, finding all frequent k-
  predicate sets will require k or k+1 table scans.
• Data cube is well suited
  for mining. (faster)



  Source: J. Han and M. Kamber, "Data Mining: Concepts and Techniques", Morgan Kaufmann Publishers,
                                            August 2000.
   Dynamic Discretization of
  Quantitative Association Rules
• Numeric attributes are dynamically discretized
   – The confidence of the rules mined is maximized


• Cluster adjacent association rules to form general
  rules using a 2-D grid: ARCS (Association Rules
  Clustering System)


         Source: B. Lent, A. Swami, and J. Widom. “Clustering association rules”. In
                     Proc. 1997 Int. Conf. Data Engineering (ICDE'97).
     Clustering Association Rules:
               Example
age(X,34) ∧ income(X,“30 - 40K”)
   → buys(X,“high resolution TV”)
age(X,35) ∧ income(X,“30 - 40K”)
   → buys(X,“high resolution TV”)
age(X,34) ∧ income(X,“40 - 50K”)
   → buys(X,“high resolution TV”)
age(X,35) ∧ income(X,“40 - 50K”)
   → buys(X,“high resolution TV”)


age(X, “34 - 35”) ∧ income(X,“30 - 50K”)   Source: J. Han and M. Kamber, "Data
   → buys(X,“high resolution TV”)          Mining: Concepts and Techniques", Morgan
                                           Kaufmann Publishers, August 2000.
       Mining Distance-based
      Association Rules: Motive
• Binning methods like equi-width and equi-depth do
  not capture the semantics of interval data

                 Equi-width             Equi-depth              Distance-
     Price($)    (width $10)             (depth 2)               based
       7              [0,10]                 [7,20]                 [7,7]
       20            [11,20]                [22,50]                [20,22]
       22            [21,30]                [51,53]                [50,53]
       50            [31,40]
       51            [41,50]
       53            [51,60]

            • Source: J. Han and M. Kamber, "Data Mining: Concepts and
            Techniques", Morgan Kaufmann Publishers, August 2000.
            Clusters and Distance
               Measurements
• S[X]: A set of N tuples t1,t 2 ,...,t N projected on the attribute
  set X.
• The diameter of S[X]:



                                       dist X ti X , t j X 
                           N      N

            d ( S X ) 
                           i 1   j 1

                                   N N  1
• dist X : Distance metric on the values for the attribute set X (e.g.
  Euclidean distance or Manhattan distance)
         Clusters and Distance
         Measurements (Cont.)
• Cluster C X
   – Density threshold d
                           X
                           0
                                 d C X   d 0X
   – Frequency threshold       s 0 CX  s0

• Finding clusters and distance-based rules
   – A modified version of BIRCH
   – Density threshold replace Support
   – Degree of association threshold replace Confidence
                     Conclusion
• Mining multilevel association
   –   Uniform and reduced support
   –   Top-down progressive deepening approach
   –   Generation of flexible multilevel association rules
   –   Redundant multilevel association rules filtering
• Miming multidimensional association
   – Mining quantitative association
      • Static Discretization of Quantitative Attributes
      • ARCS (Association Rules Clustering System)
      • Mining Distance-based Association Rules
From Association Mining to
   Correlation Analysis

       Xiaoxiang Zhang
          Sources/References:
[1]. J. Han and M. Kamber. “Data Mining Concepts and
  Techniques”. Morgan Kaufman Publishers.

[2]. S. Brin, R. Motwani and C. Silverstein. “Beyond
  Market Baskets: Generalizing Association rules to
  Correlations”. Proceeding of the 1997 ACM
  SIGMOD International conference on management of
  data.
• Why we need correlation analysis?

  Because correlation analysis can reveal which
  strong association rules are really interesting
  and useful.

• Association rule mining often generates a huge
  number of rules, but a majority of them either
  are redundant or do not reflect the true
  correlation relationship among data objects.
                   Example




Above table is called contingency table
• Let us apply the support-
  confidence framework to this
  example. If the support,
  confidence threshold is [10%,
  60%]. Then the following
  association rule is discovered:

  buys (X, “Tea”) => buys (X,
  “Coffee”)
  [support = 20%, confidence =
  80%]

• However, tea=>coffee is
  misleading, since the
  probability of purchasing
  coffee is 90%, which is larger
  than 80%.
• The above example illustrates that the
  confidence of a rule A=>B can be deceiving in
  that it is only an estimate of the conditional
  probability of itemset B given itemset A.
                Measuring Correlation
• One way of measuring correlation is

                     p( A  B)
       corrA, B    
                     p ( A) p ( B )
• If the resulting value is equal to 1, then A
  and B are independent. If the resulting value is greater than 1,
  then A and B are positively correlated, else A and B are
  negatively correlated.
  For the above example
   p[tc] /( p[t ] * p[c])  0.2 /(0.25 * 0.9)  0.89,
 which is less than 1, indicating there is a negative correlation
  between buying tea and buying coffee.
• Is the above way of measuring the correlation
  good enough?

  The fact is that we calculate the correlation
  value indeed, but we could not tell whether the
  value is statistically significant.

• So, we introduce:
  The chi-squared test for independence
  The chi-squared test for independence
• Let R be       {i1 , i1}  ...  {ik , ik } and       r  r1...rk  R
• Here R is the set of all possible basket values, and r is
  a single basket value. Each value of r denotes a cell--
  -this terminology comes from the view that R is a k-
  dimensional contingency table.
  Let O(r) denote the number of baskets falling into cell
  r.
• The chi-squared statistic is defined as:

                (O ( r )  E[ r ])                  2
           x 
             2

                       E[ r ]
  What does chi-squared statistic mean?

• The chi-squared statistic as defined will specify
  whether all k items are k-way independent.
           2
• If the x is equal to 0, then all the variables are really
  independent. If it is larger than a cutoff value at one
  significance level, then we say all the variables are
  dependent (correlated), else we say all the variables
  are independent.

• Note that the cutoff value for any given significance
  level can be obtained from wildly available tables for
  the chi-squared distribution.
                                    2
• Example of calculating        x
      (O (r )  E[r ])      2
 x 
   2

            E[r ]
                                                  x2




 • If the cutoff of the 95% significance level = 3.84
   then 0.900 < 3.84, so the two items are independent.
                Correlation Rules

• We have the tool to test whether a given itemset is
  independent or dependent (correlated).
• We are almost ready to mine of rules that identify
  correlations, or correlation rules.
• Then what is correlation rule?
  A correlation rule is of the form {i1 , i2 ,..., im }
  where the occurrence of the items
  {i1 , i2 ,..., im }are correlated.
Upward Closed Property of Correlation

• An advantage of correlation is that it is upward
  closed. This means that if a set S of items is
  correlated, then every superset of S is also
  correlated. In other words, adding items to a
  set of correlated items does not remove the
  exiting correlation.
   Minimal Correlated Itemsets
• Minimal correlated itemsets are the Itemsets
  that are correlated although no subsets of them
  is correlated.
• Minimal correlated itemsets form a border
  within the lattice.
• Consequently, we reduce the data mining task
  as the problem of computing a border in the
  lattice.
  Support and Significant Concepts
• Support:
  A set of items S has support s at p% level
  means that at least p% of cells in the
  contingency table for S have value s.
• Significant:
  If an itemset is supported and minimally
  correlated, we say this itemset is significant.
    Algorithm Chi-squared Support
•  Input: A chi-squared significance level α,
          support s, support fraction p > 0.25,
          Basket data B.
• Output: A set of minimal correlated itemsets,
           from B.
1. For each item i in I, count O(i).
2. Initialize Cand  0, Sig  0, Notsig 0.
3. For each pair of items ia, ib such that O(ia) > s
   and O(ib) > s, add {ia,ib} to Cand.
4. Notsig  0.
5. If Cand is empty, then return Sig and terminate.
6. For each itemset in Cand, do construct the
   contingency table for the itemset. If less than p
   percent of the cells have count s, then go to step 8.
7. If the chi-squared value exceeds a threshold, then add
   the itemset to Sig, else add the itemset to Notsig.
8. Continue with the next itemset in Cand. If there are
   no more itemsets in Cand, then set Cand to be the set
   of all sets S such that every subset of size |S|-1 of S is
   in Notsig. Goto Step 4.
Example:
• I: { i1, i2, i3, i4, i5}
• Cand:{ {i1, i2},{i1, i3},{i1, i5},{i3, i5},{i2, i4}
  {i3, i4} }
• Sig: { {i1, i2} }
• Notsig: { {i1, i3}, {i1, i5}, {i3, i5}, {i2, i4} }
• Cand: { {i1,i3,i5} }
                 Limitation
Use of the chi-squared test only if
- All cells in the contingency table have
  expected value greater than 1.
- At least 80% of the cells in the contingency
  table have expected value greater than 5.
                 Conclusion
• The use of the chi-squared test is solidly
  grounded in statistical theory.

• The chi-squared statistic simultaneously and
  uniformly takes into account all possible
  combinations of the presence and absence of
  the various attributes being examined as a
  group.
Thank You!

								
To top