Document Sample

Parallel Mining Association Rules in Distributed Memory System ∗ Xuehai Wang Yingbo Miao Faculty of Computer Science Faculty of Computer Science Dalhousie University Dalhousie University Halifax, Canada B3H 1W5 Halifax, Canada B3H 1W5 xwang@cs.dal.ca ymiao@cs.dal.ca http://www.cs.dal.ca/∼xwang http://www.cs.dal.ca/∼ymiao Abstract We consider the problem of mining association rules on the distributed memory system, CGM1 system, which has 32 nodes. Furthor more, since each node of CGM1 system consists of two processors which share the same resource in the node, we can utilize this feature to employ shared memory apriori algorithm in one node. 1 Introduction With the development of hardware, especially the development of large space storage, lots of organization build large storage database and collect large volume of data. Those orga- nizations have the desire to extract useful information from the ultra large amount of data. So some traditional way will be not enough to handle the information. Association rule mining, ﬁst proposed by Agrawal, Imielinski and Swami, try to ”Finding frequent patterns, associations, correlations, or casual structures sets of items or objects in transaction database, relational database, etc.” On the other word, we want to ﬁnd out the relation or dependency of occurrence of one item based on occurrence of other items. Lots of algorithms in this area, both sequentially and parallel, had been proposed. Since association rule mining is dedicated to handle the ultra large amount of data, so the time complexity and resource complexity have to be carefully considered. Hence parallel algorithm is desirable. In this paper, we will try to explore some algorithms, especially parallel algorithms, and ﬁnd out the trade-oﬀ among them. We will also implement a speciﬁc algorithm in both sequential and parallel, and try to ﬁnd out the speed up of parallel algorithm and the improvement of parallel algorithm compare to the sequential algorithm. ∗ The course project for CSCI 6702, Parallel Computing. 1 The parallel algorithm will be implemented on a CGI model, which has 32 nodes with 2 processors on each node. The organization of the rest of the paper is as follows. Section 2 gives a brief review of the problem of mining association rules and some sequential algorithms. Section 3 gives the description of some parallel algorithms. Section 4 presents out approach. 2 Overview of association rule 2.1 Association Rule The association rule problem was introduced in [2]. It focus on discovering relationships among items in a transactional database. Association rules can be deﬁned formally as follows: Let I = {i1 , i2 , . . . , im } be a set of items. Let D be a set of transactions, where each transaction T is a set of items and T ⊂ I. Given a set X that X ⊂ I, we say T contains X if X ⊂ T . A generalized association rule is an implication of the form X ⇒ Y , where X ⊂ I, Y ⊂ I and X ∩ Y = φ. The support s of X ⇒ Y in the transaction set D is s% of transactions in D contain X ∪ Y . The conﬁdence c of rule X ⇒ Y in the transaction set D is c% of transactions in D that contains X ∪ Y . 2.2 Sequential Algorithms 2.2.1 Apriori One of the most popular algorithms of ﬁnding association rules is Apriori [3]. Apriori uses k-itemsets, which has k items belonging to the set of items I, to generate (k+1) itemsets. The main idea of Apriori is: Any subset of a frequent item set must be a frequent set, so if a subset of a frequent itemset is not a frequent set, the frequent itemset should not be generated or test. Following is the P-Code of Apriori: k1 ← frequent 1-itemsets; //Lk is the set of frequent k-itemsets. k←2 while Lk−1 = φ do generate Ck from Lk−1 //Ck is the set of candidate k-itemsets. for all t ∈ D do Increment the count of all candidates in Ck that are contained in t. end for Lk = All candidates in Ck with minimum support. end while return k Lk {Subroutine of generating Ck from Lk−1 } 2 {Step 1: Self-joining Lk−1 } insert into Ck select p.itme1, p.item2, . . ., p.itemk−1 , q.itemk−1 from Lk−1 p, Lk−1 q where p.itme1 = q.item1 , . . .,p.itemk−2 = q.itmek−2 , p.itmek−1 < q.itmek−1 {Step 2: Pruning} for all itemsets c in Ck do for all (k-1)-subsets s of c do if s ⊂ Lk−1 then delete c from ck end if end for end for 2.2.2 Apriori-like algorithms There are some algorithms based on Apriori, trying to improve it or make it suitable for some certain conditions. AprioriTID and AprioriHybrid AprioriTID [3] uses the raw database for counting the support of candidate itemsets only once in the ﬁrst pass. In later passes, it uses an encoding of the candidate itemsets to count the support. AprioriTID saves much reading eﬀort since the size of the encoding can become much smaller than the raw database. However, AprioriTID is slower then Apriori in the earlier passes, since it uses more memory. So they can be combined to AprioriHybrid [3]. That is, Apriori is used in the initial passes and switches to AprioriTID when th encoding will ﬁt in memory. But the switching does involve a cost. SETM SETM [10] is implemented to use general query languages such as SQL to mining association rules from large datasets in relational databases. DIC DIC [4], Dynamic Itemset Counting, begins to count the k frequent itemsets at any appropriate point instead of always at the beginning of a pass and ﬁnishes to count when the itemsets have been counted over all the transactions. It uses a preﬁx-tree, of Partition Unlike Apriori that counts the support of all (k − 1) candidates to determine the k frequent itemsets, Partition [12] uses the tidlists of the (k − 1) candidates to generate the tidlists of the k frequent itemsets. Since the size of those intermediate results may take too many physical memory, Partition splits the raw dataset into several chunks and do an extra scan to get the globally frequent itemset. 3 2.2.3 Other algorithms All algorithms introduced above are traversing the search space employing breadth-ﬁrst search [9]. Ther are also some algorithms using depth- ﬁrst search(DFS). FP-growth FP-growth [8] uses frequent pattern tree (FP-tree), which is an extended preﬁx-tree structure for storing highly condensed representation of the transaction data, thus saves the costly database scans in the subsequent mining processes. Eclat Eclat [14] combines the depth-ﬁrst search with tidlist intersections by clustering itemsets using equivalence classes or maximal hypergraph cliques and then generating the true frequent itemsets using bottom-up, top-down or hybrid lattice traversal. One of the important things about all of the algorithms is that although the algorithms employ diﬀerent strtegies, their runtime behaviors are quite similar. None of them can fundamentally beats out the other ones [9]. We will focus on the parallel mining association rules that based on Apriori. 3 Parallel algorithms The tow dominate approaches for mining parallel association rules on distributed memory and shared memory systems. In distributed memory system, each processor has a private memory, while in shared memory systems, all processors access common memory. 3.1 Parallel mining association rules on distribute memory system In a distributed memory system, each processor has its own local memory, which can be accessed directly only by this processor, and message passing is used to communication among all processors. Thus, for parallel mining association rules on distribute memory system, we must consider the trade-oﬀs between computation, communication, memory usage, synchronization and the use of problem-speciﬁc information in parallel data mining [1]. 3.1.1 Apriori based In [1] three algorithms are introduced, namely, Count Distribution, Data Distribution and Candidate Distribution. Count Distribution In Count Distribution algorithm, the raw dataset are distributed to the disk of every processors. Each processor generates entire k candidates sets from its own k − 1 Frequent itemsets, and then does sum reduction to obtain the global counts by exchanging local counts with all other processors. Then it can prun k the candidate sets to get the k item frequent sets. 4 In the ﬁrst pass(k = 1), each processor pi dynamically generates its local candidate i itemset C1 depending on the items actually present in its local data partition D i . The candidates counted by diﬀerent processors may not be identical, so the local counts must be exchaged to determine global C1 . For passes k > 1: 1) Each processor pi generates the complete Ck , using the complete frequent itemset Lk−1 careted at the end of pass k − 1. 2) Processor P i makes a pass over its data partition D j and develops local support counts for candidates in Ck . 3) Processor P i exchanges local Ck counts with all ohter processors to develop global Ck counts. Processors are forced to synchronize in this step. 4) Each processor P i now computes Lk from Ck . 5) Each processor P i independently makes the decision to terminate or continue to the next pass. Data Distribution One of the disadvantages of the Counting Distribution is that it does not exploit the aggregate memory of the system efectively, since the number of candidates that can be counted in one pass is determined by the memory size of each processor. In Data Distribution algorithem, each processor counts mutually exclusive candidates, so it exploites better the total system’s memory. However, this algorithm is a communication-happy al- gorithm since it requires each processor broadcasts its local data to all other processors in every pass. According to the experiments in [1], the Data Distribution algorithem performs poorly when compared to Cound Distribution. Candidate Distribution The Candidate Distribution algorithm uses Count Distribution or Data Distribution until pass l. For iteration l, Candidate Distribution algorithm parti- tions the candidates to make each processor can generate disjoint candidates independent of other processors. Thus each processor can work on a unique set of candidates without having to repeatedly brodcast the entire dataset. However, this algorithm also performs wores than Count Distribution, because it pays the cost of redistributing the dataset while scanning the local dataset partition repeatedly. 3.1.2 Other algorithms We can ﬁnd some other algorithms in [11] [7] [5] and [6] 5 3.2 parallel shared memory 3.2.1 Shared memory apriori like algorithm Recall in sequential apriori algorithm, there are two main steps, Candidate generation and Support counting. Let’s denote Ck as the candidate item set of the kth path, and Lk as the frequent item sets of K th path. Ck can be generated based on Lk−1 joining with itself. Then it eliminates all infrequent item sets in Ck and gets Lk . Ck and Lk are lexicographically sorted. Lk can be partitioned into equivalence class according to their common k − 2 preﬁxes and P, the number of processors. So those P classes can be computed on diﬀerent processors simultaneously. The partition can also benﬁt pruning step. Instead of checking all k items of their (k − 1) sub items, we now need check n − (k − 2) sub items. The problem now need be considered is computation balancing. We can simply partition by order. But it suﬀers load imbalance. Interleaved partitioning or Bitonic partitioning can solve this problem. 3.2.2 Multiple Local Parallel Tree (MLPT) Noticed that Apriori algorithm suﬀers I/O problem, MLPT will only scan the database twice and alleviate the I/O time. MLPT ﬁrst scans the database to generate the ordered frequent items, which will be used to generate a tree. By doing this, each processor can be allocated the same number of transactions, and then compute the count of each item locally. Then the global count can be computed by splitting the item list evenly according to the number of processors. Then each processor computes out the global count for the items allocated to this processor. Each processor will be assigned the same number of transactions and do the second database scan. According to the item list generated in step one, a local FP-tree starting with a null node will be created. Then a bottom-up traversal will be used to mine the association rule [13]. 4 Our approach Haveing seen those algorithms, this paper mainly focus on distributed memory parallel. We will implement apriori based count distribution algorithm on our CGM1 system. This cluster is conﬁgured with 32 nodes, and each node consists of two processors which share the same resource in the node. We choose count distribution algorithm since it’s simple to implement and it requires the least communication among nodes. Because it splits the transactions based on the 6 processor number, so it may suﬀer imblance workload. As we have seen before, it will suﬀer repeatly I/O operations too. Because in our CGM1 system, two processors share the same resource in one node, so we can utilize this feature. In our approach, we also employ shared memory apriori algorithm in one node to compute Ck , Lk . Furthermore, the local count can also be computed in parallel in one node. 7 References [1] R. Agrawal and J. C. Shafer. Parallel mining of association rules. Ieee Trans. On Knowledge And Data Engineering, 8:962–969, 1996. [2] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., 26–28 1993. [3] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 12–15 1994. [4] Sergey Brin, Rajeev Motwani, Jeﬀrey D. Ullman, and Shalom Tsur. Dynamic itemset counting and implication rules for market basket data. In Joan Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 255–264. ACM Press, 05 1997. [5] Cheung, Han, Ng, Fu, and Fu. A fast distributed algorithm for mining association rules. In PDIS: International Conference on Parallel and Distributed Information Sys- tems. IEEE Computer Society Technical Committee on Data Engineering, and ACM SIGMOD, 1996. [6] David Wai-Lok Cheung and Yongqiao Xiao. Eﬀect of data skewness in parallel mining of association rules. In Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining, pages 48–60, 1998. [7] Eui-Hong Han, George Karypis, and Vipin Kumar. Scalable parallel data mining for association rules. pages 277–288, 1997. [8] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In Weidong Chen, Jeﬀrey Naughton, and Philip A. Bernstein, editors, 2000 ACM SIGMOD Intl. Conference on Management of Data, pages 1–12. ACM Press, 05 2000. u [9] Jochen Hipp, Ulrich G¨ntzer, and Gholamreza Nakhaeizadeh. Algorithms for associa- tion rule mining – a general survey and comparison. SIGKDD Explorations, 2(1):58–64, July 2000. [10] Maurice A. W. Houtsma and Arun N. Swami. Set-oriented mining for association rules in relational databases. In Philip S. Yu and Arbee L. P. Chen, editors, Proceedings of the Eleventh International Conference on Data Engineering, March 6-10, 1995, Taipei, Taiwan, pages 25–33. IEEE Computer Society, 1995. [11] Jong Soo Park, Ming-Syan Chen, and Philip S. Yu. Eﬃcient parallel data mining for association rules. In Conference on Information and Knowledge Management archive,Proceedings of the fourth international conference on Information and knowl- edge management, pages 31–36. ACM Press, 1995. [12] Ashoka Savasere, Edward Omiecinski, and Shamkant B. Navathe. An eﬃcient algo- rithm for mining association rules in large databases. In The VLDB Journal, pages 432–444, 1995. 8 [13] O. Zaane, M. El-Hajj, and P. Lu. Fast parallel association rule mining without candi- dacy generation, 2001. [14] Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, and Wei Li. New algorithms for fast discovery of association rules. Technical Report TR651, 1997. 9

DOCUMENT INFO

Shared By:

Categories:

Tags:
Association Rules, data mining, Mining Association, International Conference, parallel algorithms, Hong Kong, IEEE Transaction, IEEE Computer Society, institutional repository, How to

Stats:

views: | 11 |

posted: | 3/23/2011 |

language: | English |

pages: | 9 |

OTHER DOCS BY gjjur4356

How are you planning on using Docstoc?
BUSINESS
PERSONAL

Feel free to Contact Us with any questions you might have.