Parallel Mining Association Rules in Distributed Memory System

Document Sample
Parallel Mining Association Rules in Distributed Memory System Powered By Docstoc
					    Parallel Mining Association Rules in Distributed
                                      Memory System ∗
                          Xuehai Wang                            Yingbo Miao
                     Faculty of Computer Science           Faculty of Computer Science
                         Dalhousie University                   Dalhousie University
                      Halifax, Canada B3H 1W5                Halifax, Canada B3H 1W5
          ∼xwang  ∼ymiao

              We consider the problem of mining association rules on the      distributed memory
          system, CGM1 system, which has 32 nodes. Furthor more, since        each node of CGM1
          system consists of two processors which share the same resource     in the node, we can
          utilize this feature to employ shared memory apriori algorithm in   one node.

1        Introduction
With the development of hardware, especially the development of large space storage, lots
of organization build large storage database and collect large volume of data. Those orga-
nizations have the desire to extract useful information from the ultra large amount of data.
So some traditional way will be not enough to handle the information.
        Association rule mining, fist proposed by Agrawal, Imielinski and Swami, try to ”Finding
frequent patterns, associations, correlations, or casual structures sets of items or objects in
transaction database, relational database, etc.” On the other word, we want to find out the
relation or dependency of occurrence of one item based on occurrence of other items.
        Lots of algorithms in this area, both sequentially and parallel, had been proposed.
Since association rule mining is dedicated to handle the ultra large amount of data, so the
time complexity and resource complexity have to be carefully considered. Hence parallel
algorithm is desirable. In this paper, we will try to explore some algorithms, especially
parallel algorithms, and find out the trade-off among them. We will also implement a
specific algorithm in both sequential and parallel, and try to find out the speed up of
parallel algorithm and the improvement of parallel algorithm compare to the sequential
        The course project for CSCI 6702, Parallel Computing.

     The parallel algorithm will be implemented on a CGI model, which has 32 nodes with
2 processors on each node.
     The organization of the rest of the paper is as follows. Section 2 gives a brief review
of the problem of mining association rules and some sequential algorithms. Section 3 gives
the description of some parallel algorithms. Section 4 presents out approach.

2     Overview of association rule
2.1     Association Rule
The association rule problem was introduced in [2]. It focus on discovering relationships
among items in a transactional database. Association rules can be defined formally as
   Let I = {i1 , i2 , . . . , im } be a set of items. Let D be a set of transactions, where each
transaction T is a set of items and T ⊂ I. Given a set X that X ⊂ I, we say T contains
X if X ⊂ T . A generalized association rule is an implication of the form X ⇒ Y , where
X ⊂ I, Y ⊂ I and X ∩ Y = φ. The support s of X ⇒ Y in the transaction set D is s% of
transactions in D contain X ∪ Y . The confidence c of rule X ⇒ Y in the transaction set D
is c% of transactions in D that contains X ∪ Y .

2.2     Sequential Algorithms
2.2.1      Apriori

One of the most popular algorithms of finding association rules is Apriori [3]. Apriori uses
k-itemsets, which has k items belonging to the set of items I, to generate (k+1) itemsets.
The main idea of Apriori is: Any subset of a frequent item set must be a frequent set, so
if a subset of a frequent itemset is not a frequent set, the frequent itemset should not be
generated or test. Following is the P-Code of Apriori:
    k1 ← frequent 1-itemsets; //Lk is the set of frequent k-itemsets.
    while Lk−1 = φ do
      generate Ck from Lk−1 //Ck is the set of candidate k-itemsets.
      for all t ∈ D do
        Increment the count of all candidates in Ck that are contained in t.
      end for
      Lk = All candidates in Ck with minimum support.
    end while
    return k Lk

    {Subroutine of generating Ck from Lk−1 }

  {Step 1: Self-joining Lk−1 }
  insert into Ck
  select p.itme1, p.item2, . . ., p.itemk−1 , q.itemk−1
  from Lk−1 p, Lk−1 q
  where p.itme1 = q.item1 , . . .,p.itemk−2 = q.itmek−2 , p.itmek−1 < q.itmek−1
  {Step 2: Pruning}
  for all itemsets c in Ck do
    for all (k-1)-subsets s of c do
          if s ⊂ Lk−1 then
             delete c from ck
          end if
    end for
  end for

2.2.2      Apriori-like algorithms

There are some algorithms based on Apriori, trying to improve it or make it suitable for
some certain conditions.

AprioriTID and AprioriHybrid            AprioriTID [3] uses the raw database for counting the
support of candidate itemsets only once in the first pass. In later passes, it uses an encoding
of the candidate itemsets to count the support. AprioriTID saves much reading effort
since the size of the encoding can become much smaller than the raw database. However,
AprioriTID is slower then Apriori in the earlier passes, since it uses more memory. So they
can be combined to AprioriHybrid [3]. That is, Apriori is used in the initial passes and
switches to AprioriTID when th encoding will fit in memory. But the switching does involve
a cost.

SETM SETM [10] is implemented to use general query languages such as SQL to mining
association rules from large datasets in relational databases.

DIC DIC [4], Dynamic Itemset Counting, begins to count the k frequent itemsets at any
appropriate point instead of always at the beginning of a pass and finishes to count when
the itemsets have been counted over all the transactions. It uses a prefix-tree, of

Partition       Unlike Apriori that counts the support of all (k − 1) candidates to determine
the k frequent itemsets, Partition [12] uses the tidlists of the (k − 1) candidates to generate
the tidlists of the k frequent itemsets. Since the size of those intermediate results may take
too many physical memory, Partition splits the raw dataset into several chunks and do an
extra scan to get the globally frequent itemset.

2.2.3      Other algorithms

All algorithms introduced above are traversing the search space employing breadth-first
search [9]. Ther are also some algorithms using depth- first search(DFS).

FP-growth         FP-growth [8] uses frequent pattern tree (FP-tree), which is an extended
prefix-tree structure for storing highly condensed representation of the transaction data,
thus saves the costly database scans in the subsequent mining processes.

Eclat Eclat [14] combines the depth-first search with tidlist intersections by clustering
itemsets using equivalence classes or maximal hypergraph cliques and then generating the
true frequent itemsets using bottom-up, top-down or hybrid lattice traversal.
       One of the important things about all of the algorithms is that although the algorithms
employ different strtegies, their runtime behaviors are quite similar. None of them can
fundamentally beats out the other ones [9]. We will focus on the parallel mining association
rules that based on Apriori.

3       Parallel algorithms
The tow dominate approaches for mining parallel association rules on distributed memory
and shared memory systems. In distributed memory system, each processor has a private
memory, while in shared memory systems, all processors access common memory.

3.1      Parallel mining association rules on distribute memory system
In a distributed memory system, each processor has its own local memory, which can be
accessed directly only by this processor, and message passing is used to communication
among all processors. Thus, for parallel mining association rules on distribute memory
system, we must consider the trade-offs between computation, communication, memory
usage, synchronization and the use of problem-specific information in parallel data mining

3.1.1      Apriori based

In [1] three algorithms are introduced, namely, Count Distribution, Data Distribution and
Candidate Distribution.

Count Distribution         In Count Distribution algorithm, the raw dataset are distributed
to the disk of every processors. Each processor generates entire k candidates sets from its
own k − 1 Frequent itemsets, and then does sum reduction to obtain the global counts by
exchanging local counts with all other processors. Then it can prun k the candidate sets to
get the k item frequent sets.

  In the first pass(k = 1), each processor pi dynamically generates its local candidate
  itemset C1 depending on the items actually present in its local data partition D i . The
  candidates counted by different processors may not be identical, so the local counts must
  be exchaged to determine global C1 .
  For passes k > 1:
  1) Each processor pi generates the complete Ck , using the complete frequent itemset Lk−1
  careted at the end of pass k − 1.
  2) Processor P i makes a pass over its data partition D j and develops local support counts
  for candidates in Ck .
  3) Processor P i exchanges local Ck counts with all ohter processors to develop global Ck
  counts. Processors are forced to synchronize in this step.
  4) Each processor P i now computes Lk from Ck .
  5) Each processor P i independently makes the decision to terminate or continue to the
  next pass.

Data Distribution      One of the disadvantages of the Counting Distribution is that it does
not exploit the aggregate memory of the system efectively, since the number of candidates
that can be counted in one pass is determined by the memory size of each processor. In Data
Distribution algorithem, each processor counts mutually exclusive candidates, so it exploites
better the total system’s memory. However, this algorithm is a communication-happy al-
gorithm since it requires each processor broadcasts its local data to all other processors in
every pass. According to the experiments in [1], the Data Distribution algorithem performs
poorly when compared to Cound Distribution.

Candidate Distribution The Candidate Distribution algorithm uses Count Distribution
or Data Distribution until pass l. For iteration l, Candidate Distribution algorithm parti-
tions the candidates to make each processor can generate disjoint candidates independent
of other processors. Thus each processor can work on a unique set of candidates without
having to repeatedly brodcast the entire dataset. However, this algorithm also performs
wores than Count Distribution, because it pays the cost of redistributing the dataset while
scanning the local dataset partition repeatedly.

3.1.2   Other algorithms

We can find some other algorithms in [11] [7] [5] and [6]

3.2     parallel shared memory
3.2.1    Shared memory apriori like algorithm

Recall in sequential apriori algorithm, there are two main steps, Candidate generation and
Support counting.
    Let’s denote Ck as the candidate item set of the kth path, and Lk as the frequent item
sets of K th path. Ck can be generated based on Lk−1 joining with itself. Then it eliminates
all infrequent item sets in Ck and gets Lk .
    Ck and Lk are lexicographically sorted. Lk can be partitioned into equivalence class
according to their common k − 2 prefixes and P, the number of processors. So those P
classes can be computed on different processors simultaneously. The partition can also
benfit pruning step. Instead of checking all k items of their (k − 1) sub items, we now need
check n − (k − 2) sub items.
    The problem now need be considered is computation balancing. We can simply partition
by order. But it suffers load imbalance. Interleaved partitioning or Bitonic partitioning can
solve this problem.

3.2.2    Multiple Local Parallel Tree (MLPT)

Noticed that Apriori algorithm suffers I/O problem, MLPT will only scan the database
twice and alleviate the I/O time. MLPT first scans the database to generate the ordered
frequent items, which will be used to generate a tree. By doing this, each processor can
be allocated the same number of transactions, and then compute the count of each item
locally. Then the global count can be computed by splitting the item list evenly according
to the number of processors. Then each processor computes out the global count for the
items allocated to this processor.
    Each processor will be assigned the same number of transactions and do the second
database scan. According to the item list generated in step one, a local FP-tree starting
with a null node will be created.
    Then a bottom-up traversal will be used to mine the association rule [13].

4     Our approach
Haveing seen those algorithms, this paper mainly focus on distributed memory parallel.
We will implement apriori based count distribution algorithm on our CGM1 system. This
cluster is configured with 32 nodes, and each node consists of two processors which share
the same resource in the node.
    We choose count distribution algorithm since it’s simple to implement and it requires
the least communication among nodes. Because it splits the transactions based on the

processor number, so it may suffer imblance workload. As we have seen before, it will suffer
repeatly I/O operations too.
   Because in our CGM1 system, two processors share the same resource in one node, so we
can utilize this feature. In our approach, we also employ shared memory apriori algorithm
in one node to compute Ck , Lk . Furthermore, the local count can also be computed in
parallel in one node.

 [1] R. Agrawal and J. C. Shafer. Parallel mining of association rules. Ieee Trans. On
     Knowledge And Data Engineering, 8:962–969, 1996.
 [2] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules
     between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors,
     Proceedings of the 1993 ACM SIGMOD International Conference on Management of
     Data, pages 207–216, Washington, D.C., 26–28 1993.
 [3] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association
     rules. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proc. 20th Int.
     Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 12–15 1994.
 [4] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. Dynamic itemset
     counting and implication rules for market basket data. In Joan Peckham, editor,
     SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management
     of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 255–264. ACM Press, 05 1997.
 [5] Cheung, Han, Ng, Fu, and Fu. A fast distributed algorithm for mining association
     rules. In PDIS: International Conference on Parallel and Distributed Information Sys-
     tems. IEEE Computer Society Technical Committee on Data Engineering, and ACM
     SIGMOD, 1996.
 [6] David Wai-Lok Cheung and Yongqiao Xiao. Effect of data skewness in parallel mining
     of association rules. In Pacific-Asia Conference on Knowledge Discovery and Data
     Mining, pages 48–60, 1998.
 [7] Eui-Hong Han, George Karypis, and Vipin Kumar. Scalable parallel data mining for
     association rules. pages 277–288, 1997.
 [8] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate
     generation. In Weidong Chen, Jeffrey Naughton, and Philip A. Bernstein, editors,
     2000 ACM SIGMOD Intl. Conference on Management of Data, pages 1–12. ACM
     Press, 05 2000.
 [9] Jochen Hipp, Ulrich G¨ntzer, and Gholamreza Nakhaeizadeh. Algorithms for associa-
     tion rule mining – a general survey and comparison. SIGKDD Explorations, 2(1):58–64,
     July 2000.
[10] Maurice A. W. Houtsma and Arun N. Swami. Set-oriented mining for association rules
     in relational databases. In Philip S. Yu and Arbee L. P. Chen, editors, Proceedings of
     the Eleventh International Conference on Data Engineering, March 6-10, 1995, Taipei,
     Taiwan, pages 25–33. IEEE Computer Society, 1995.
[11] Jong Soo Park, Ming-Syan Chen, and Philip S. Yu. Efficient parallel data mining
     for association rules. In Conference on Information and Knowledge Management
     archive,Proceedings of the fourth international conference on Information and knowl-
     edge management, pages 31–36. ACM Press, 1995.
[12] Ashoka Savasere, Edward Omiecinski, and Shamkant B. Navathe. An efficient algo-
     rithm for mining association rules in large databases. In The VLDB Journal, pages
     432–444, 1995.

[13] O. Zaane, M. El-Hajj, and P. Lu. Fast parallel association rule mining without candi-
     dacy generation, 2001.

[14] Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, and Wei Li.
     New algorithms for fast discovery of association rules. Technical Report TR651, 1997.