VIEWS: 313 PAGES: 7 CATEGORY: Emerging Technologies POSTED ON: 10/10/2010
IJCSIS is an open access publishing venue for research in general computer science and information security.
Target Audience: IT academics, university IT faculties; industry IT departments; government departments; the mobile industry and computing industry.
Coverage includes: security infrastructures, network security: Internet security, content protection, cryptography, steganography and formal methods in information security; computer science, computer applications, multimedia systems, software, information systems, intelligent systems, web services, data mining, wireless communication, networking and technologies, innovation technology and management.
The average paper acceptance rate for IJCSIS issues is kept at 25-30% with an aim to provide selective research work of quality in the areas of computer science and engineering. Thanks for your contributions in September 2010 issue and we are grateful to the experienced team of reviewers for providing valuable comments.
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 6, September 2010 Scaling Apriori for Association Rule Mining using Mutual Information Based Entropy S.Prakash,Research Scholar Dr.R.M.S.Parvathi M.E.(CSE),Ph.D. Sasurie College of Engineering Principal Vijayamangalam,Erode(DT) Sengunthar College of Engg.for Women Tamilnadu, India.Ph.09942650818 Tiruchengode. Tamilnadu, India Mail:prakash_ant2002@yahoo.co.in rmsparvathi@india.com I. INTRODUCTION Abstract - Extracting information from large datasets is a well-studied research problem. As larger and larger Data mining, also known as knowledge data sets become available (e.g., from customer behavior data from organizations such as Wal-Mart) it discovery in databases, has been recognized as a new is getting essential to find better ways to extract area for dataset research. The problem of discovering relations (inferences) from them. This thesis proposes association rules was introduced in latter stages. an improved Apriori algorithm to minimize the number Given a set of transactions, where each transaction is of candidate sets while generating association rules by a set of items, an association rule is an expression of evaluating quantitative information associated with the from X + Y, where X and Y are sets of items. The each item that occurs in a transaction, which was problem is to find all association rules that satisfy usually, discarded as traditional association rules focus user-specified minimum support and minimum just on qualitative correlations. The proposed approach confidence constraints. reduces not only the number of item sets generated but also the overall execution time of the algorithm. Any valued attribute will be treated as quantitative and will Conceptually, this problem can be viewed as be used to derive the quantitative association rules finding associations between the “l” values in a which usually increases the rules' information content. relational table where all the attributes are Boolean. Transaction reduction is achieved by discarding the The table has an attribute corresponding to each item transactions that does not contain any frequent item set and a record corresponding to each transaction. The in subsequent scans which in turn reduces overall value of an attribute for a given record is “1“ if the execution time. Dynamic item set counting is done by item corresponding to the attribute is present in the adding new candidate item sets only when all of their transaction corresponding to the record, “O” else. subsets are estimated to be frequent. The frequent item ranges are the basis for generating higher order item ranges using Apriori algorithm. During each iteration Relational tables in most business and of the algorithm, use the frequent sets from the previous scientific domains have richer attribute types. iteration to generate the candidate sets and check Attributes can be quantitative (e.g. age, income) or whether their support is above the threshold. The set of categorical (e.g. zip code, make of car). Boolean candidate sets found is pruned by a strategy that attributes can be considered a special case of discards sets which contain infrequent subsets. The categorical attributes. This thesis defines the problem thesis evaluate the scalability of the algorithm by of mining association rules over quantitative attribute considering transaction time, number of item sets used in large relational tables and present techniques for in the transaction and memory utilization. Quantitative association rules can be used in several domains where discovering such rules. This is referred as the the traditional approach is employed. The unique Quantitative Association Rules problem. requirement for such use is to have a semantic connection between the components of the item-value The problem of mining association rules in pairs. The proposal used mutual information based on categorical data presented in customer transactions entropy to generate association rules from non- was introduced by Agrawal, Imielinski and Swami biological datasets. [1][2]. This thesis work provided basic idea to several investigation efforts [4] resulting in descriptions of Keywords-Apriori, Quantitative attribute,Entropy how to extend the original concepts and how to increase the performance of the related algorithms. 221 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 6, September 2010 The original problem of mining association support, gain, chi-squared value, gini, entropy gain, rules was formulated as how to find rules of the form laplace, lift, and conviction [9,6]. However, the main set1 set2. This rule is supposed to denote affinity goal common to all of these algorithms is to reduce or correlation among the two sets containing nominal the number of generated rules. or ordinal data items. More specifically, such an association rule should translate the following A)Existing Scheme meaning: customers that buy the products in set1 also buy the products in set2. Statistical basis is The thesis extend the first group of represented in the form of minimum support and techniques since it do not relax any set of conditions confidence measures of these rules with respect to the nor employ a interestingness criteria to sort the set of customer transactions. generated rules. In this context, many algorithms for efficient generation of frequent item sets have been The original problem as proposed by proposed in the literature since the problem was first Agrawal et al.[2] was extended in several directions introduced in [10]. The DHP algorithm [11] uses a such as adding or replacing the confidence and hash table in pass k to perform efficient pruning of support by other measures, or filtering the rules (k+1)-item sets. The Partition algorithm minimizes during or after generation, or including quantitative I/O by scanning the dataset only twice. In the first attributes. Srikant and Agrawal describe a new pass it generates the set of all potentially frequent approach where quantitative data can be treated as item sets, and in the second pass the support for all categorical. This is very important since otherwise these is measured. The above algorithm are all part of the customer transaction information is specialized techniques which do not use any dataset discarded. operations. Algorithms using only general purpose DBMS systems and relational algebra operations Whenever an extension is proposed it must have also been proposed [9.10]. be checked in terms of its performance. The algorithm efficiency is linked to the size of the Few other works trying to solve this mining dataset that is amenable to be treated. Therefore it is problem for quantitative attributes. In [5], the authors crucial to have efficient algorithms that enable us to proposed an algorithm which is an adaptation of the examine and extract valuable decision-making Apriori algorithm for quantitative attributes. It information in the ever larger databases. partitions each quantitative attribute into consecutive intervals using equi-depth bins. Then adjacent This thesis present an algorithm that can be intervals may be combined to form new intervals in a used in the context of several of the extensions controlled manner. From these intervals, frequent provided in the literature but at the same time item sets (c.f. large item sets in Apriori Algorithm) preserves its performance. The approach in our will then be identified. algorithm is to explore multidimensional properties of the data (provided such properties are present), Association rules will be generated allowing to combine this additional information in a accordingly. The problems with this approach is that very efficient pruning phase. This results in a very the number of possible interval combinations grows flexible and efficient algorithm that was used with exponentially as the number of quantitative attributes success in several experiments using quantitative increases, so it is not easy to extend the algorithm to databases with performance measure done on the higher dimensional cases. Besides, the set of rules memory utilization during the transactional pruning generated may consist of redundant rules for which of the record sets. they present a “greater-than-expected-value” interest measure to identify the interesting ones. II. LITERATURE REVIEW Some other efforts that exploit quantitative Various proposals for mining association information present in transactions for generating rules from transaction data were presented on association rules[12]. In [5], the quantitative rules are different context. Some of these proposals are generated by discrediting the occurrence values of an constraint-based in the sense that all rules must fulfill attribute in fixed-length intervals and applying the a predefined set of conditions, such as support and standard Apriori algorithm for generating association confidence [6,7,8]. The second class identify just the rules. However, although simple, the rules generated most interesting rules (or optimal) in accordance to by this approach may not be intuitive, mainly when some interestingness metric, including confidence, there are semantic intervals that do not match the partition employed. 222 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 6, September 2010 The enhancement of Apriori is done by Other authors [5] proposed novel solutions increasing the efficiency of candidate pruning phase that minimize this problem by considering the by reducing the number of candidates that are distance among item quantities for delimiting the generated to further verification. The proposed intervals, that is, their “physical”' placement, but not algorithm use quantitative information to estimate the frequency of occurrence as a relevance metric. more precisely the overlap in terms of transactions. The basic elements considered in the development of To visualize the information in the massive the algorithm are number of transactions, average tables of quantitative measurements we plan to use size of transaction, average size of the maximal large clustering and mutual information based on entropy. item sets, number of items, and distribution of Clustering is an old studied technique used to extract occurrences of large item sets. this information from customer behavior data sets. This follows from the fact that related customer The second phase of the thesis claimed purchase through word of mouth have similar improvement over A priori by considering memory patterns of customer behavior. Clustering groups consumption for data transaction. This part of the records that are “similar” in the same group. It suffers algorithm generate all candidates based on 2-frequent from two major defects. It does not tell you how the item sets on sorted dataset and already generates all two customer buyin behavior/clusters are exactly frequent item sets that can no longer be supported by related. Moreover, it gives you a global picture and transactions that still have to be processed. Thus the any relation at a local level can be lost. new algorithm no longer has to maintain the covers of all past item sets into main memory. In this way, B) Proposed Scheme The proposed level-wise algorithms accesses a dataset less often than Apriori and require less The proposed scheme comprises of two memory because of the utilization of additional phases. The first phase of the thesis concerns about upward closure properties. the quantitative association rule mining with the enhancement on Apriori algorithm. The second phase C)Mutual Information based entropy of the thesis deals with the reduction of memory utilization during the pruning phase of the The mutual information I (X, Y ) measures transactional execution. how much (on average) the realization of random variable Y tells us about the realization of X, i.e., The algorithm for generating quantitative how by how much the entropy of X is reduced if we association rules starts by counting the item ranges in know the realization of Y . the data set, in order to determine the frequent ones. These frequent item ranges are the basis for I(X; Y ) = H(X) - H(X|Y ) generating higher order item ranges using an algorithm similar to Apriori. Take into account the For example, the mutual information between a cue size of a transaction as the number of items that it and the environment indicates us how much on comprises. average the cue tells us about the environment. The mutual information between a spike train and a a) Define an item set m as a set of items of size m sensory input tells us how much the spike train tells b) Specify frequent (large) item sets by Fm us about the sensory input. If the cue is perfectly c) Specify candidate item sets (possibly frequent) by informative, if it tells us everything about the Lm. environment and nothing extra, then the mutual A n range set is a set of n- item ranges, and information between cue and environment is simply each m-item set has a n-range set that stores the the entropy of the environment: quantitative rules of the item set. During each iteration of the algorithm, the system uses the I(X; Y ) = H(X) - H(X|Y ) = H(X) - H(X|X) = H(X). frequent sets from the previous iteration to generate the candidate sets and check whether their support is In other words, the mutual information above the threshold. The set of candidate sets found between a random variable and itself is simply its is pruned by a strategy that discards sets which entropy: I(X;X) = H(X). Surprisingly, mutual contain infrequent subsets. The algorithm ends when information is symmetric; X tells us exactly as much there are no more candidates’ sets to be verified. about Y as Y tells us about X. 223 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 6, September 2010 contain B (interpret as P(B|A)). The occurrence III. QUANTITATIVE ASSOCIATION frequency of an item set is the number of transactions RULE MINING – MUTUAL that contain the item set. INFORMATION BASED ENTROPY IV. IMPLEMENTATION OF QUANTITATIVE The proposal of this work use mutual APRIORI information based on entropy for generating quantitative association rules. Apart from the usual The function op is an associative and positive correlations between the customers, this commutative function. Thus, the iterations of the criterion would also discover association rules with foreach loop can be performed in any order. The negative correlations in the data sets. It is expected to data-structure Reduc is referred to as the reduction find results of the form attrib 1/ attrib 2 –> ^attrib3, object. The main correctness challenge in which can be interpreted as follows: Attrib1 and parallelizing a attribute like this on a shared memory Attrib2 are co expressed and have silencing effect on machine arises because of possible race conditions attrib 3. Then compare the results from our when multiple processors update the same element of experiments to those obtained from clustering. the reduction object. First tune various parameters (like support, The element of the reduction object that is support fraction, significance level), of the auto updated in a loop iteration is determined only as a performance dataset. This was because even with result of the processing. In the a priori association binary data, 2468 attributes may lead to the power(2, mining algorithm, the data item read needs to be 2468) relations (which the software was not designed matched against all candidates to determine the set of to handle). Here, it is needed to know that for the candidates whose counts will be incremented. The problem under consideration, the auto are attributes, major factors that make these loops challenging to as needed to find relationships among them. To execute efficiently and correctly are as follows: overcome this problem used another approach, in which data attributes were already known to be It is not possible to statically partition the related, using the results obtained from clustering. reduction object so that different processors update This decreases the number of attributes to disjoint portions of the collection. Thus, race manageable levels (both for program). The proposed conditions must be avoided at runtime. work used the approach above to find the relationships (positive, negative) among the The execution time of the function process attributes. can be a significant part of the execution time of an iteration of the loop. Thus, runtime preprocessing or Algorithm Steps scheduling techniques cannot be applied. a. Find all frequent item sets (i.e., satisfy minimum support) The updates to the reduction object are fine b. Generate strong association rules from the grained. The reduction object comprises a large frequent item sets (each rule must satisfy minimum number of elements that take only a few bytes, and support and minimum confidence). the for each loop comprises a large number of c. Identify the quantitative elements iterations, each of which may take only a small d. Sorting the item sets based on the number of cycles. frequency and quantitative elements { * Outer Sequential Loop *} e. Merge the more associated rules of item While() { pairs {* Reduction Loop*} f. Discard the infrequent item value pairs Foreach(element e ){ g. Iterate the steps c to f till the required ( i, val) = process (e); mining results are achieved Reduc(i) = Reduc(i) op val; h. } Let I = { i1, i2 … im} be a set of items, and Fig 1: Pseudo code T a set of transactions, each a subset of I. An association rule is an implication of the form A=>B, The consumer behavior auto databases where A and B are non-intersecting The support of obtained data from UCI Machine Learning A=>B is the percentage of the transactions that Repository. The data obtained was about CPU- contain both A and B. The confidence of A=>B is the performance and automobile mileage. The data was percentage of transactions containing A that also discretized into binary values. For these data sets the 224 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 6, September 2010 discretization was done in accordance with interpretation required. This discretization was done automatically using the written software. This software also formatted the data into the format required by the program. A finer level of discretization (or supporting the real values) would have been more appropriate, but the used approach also gave much of the useful results. V. EXPERIMENTAL RESULTS FROM APRIORI The process of executing quantitative Association rule mining for the auto manufacturer power evaluation data is given below a) Get data (file: auto-mpg.data): 9 attributes, 398 samples. b) Remove unique attributes (IDs). Here, car_name attribute has been removed. c) Remove those samples (total 5) that contain “?” (missing data) as a value for some of their attributes Graph 1: Support VS Time on Quantitative and qualitative A priori (so, we are left with 8 attributes and 393 samples). The thread execution on the quantitative a d) Discretize real-valued attributes based on their priori and qualitative a priori are evaluated for the average values (which is (maximum attribute value + same data set (Graph 2). Here the initial thread minimum attribute value) / 2) requires more time, however consequent threads shows better scalable performance of quantitative e) Run the program to generate association rules Apriori. using mutual information based on entropy metric. The experiment focused on evaluating all quantitative a priori techniques. Since we were interested in seeing the best performance, we used banking data set samples. We used a 1 GB dataset. A confidence of 90% and support of 0.5 is used. Execution times using 1, 2, 3, and 4 threads are presented on the processor. With 1 thread, Apriori does not have any significant overheads as compared to the sequential version. Therefore, this version is used for reporting all speedups. Though the performance of quantitative a priori is considerably lower than a priori, they are promising for the cases when sufficient memory for supporting full replication may not be available. We consider four support levels, 0.1%, 0.05%, 0.03%, and 0.02%. The execution time efficiency is improved for the quantitative a priori on frequent item set evaluation with the support count (Graph 1). Graph 2: Thread Vs Time for a priori execution on rule set generation 225 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 6, September 2010 By observing the output from the program it is seen efficiency improvement results from that the that a few relationships between the attributes had generation of the informative rule set needs fewer high values of mutual information. Namely, the candidates and dataset accesses than that of the highest MI-values were obtained for: association rule set rather than large memory usage like some other efficient algorithms. a) displacement and horsepower. Further, by observing the entropy values we may notice that there are very few cars that have small displacement REFERENCES and high horsepower. b) displacement and weight. Further, by observing [1] R. Agrawal, T. Imielinski, and A. Swami. Dataset mining: A performance perspective. In IEEE Transactions on Knowlegde and the entropy values we may notice that there are very Data Engineering, December 1993. few cars that have large displacement and light [2] R. Agrawal, T. Imielinski, and A. Swami. Mining association weight. rules between sets of items in large databases. In Proc. of the c) cylinders and weight. Further, by observing the ACM SIGMOD Washington, D.C., pages 207-216, May 1993. [3] R. Miller and Y. Yang. Association rules over interval data. In entropy values we may notice that there are very few ACM SIGMOD Conference, Tucson, Arizona, pages 452 - 461, cars that have small number of cylinders and heavy May 1997. weight. [4] J. Park, M. Chen, and P. Yu. An effective hash based algorithm d) horsepower and weight. Further, by observing the for mining associative rules. In ACM SIGMOD Conference, San Jose, CA, pages 175 - 186, May 1995. entropy values we may notice that there are very few [5] R. Srikant and R. Agrawal. Mining quantitative association cars that have large horsepower but heavy weight. rules in large relational tables. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 1–12, VI. CONCLUSION Montreal, Canada, June, 1996. [6] R. Agrawal, T. Imielinski, and A. Swami. Dataset mining: A performance perspective. In IEEE Transactions on Knowlegde and The thesis have defined a new a rule set Data Engineering, December 1993. namely the informative rule set that presents [7] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and prediction sequences equal to those presented by the A.Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, San Jose, CA, pages association rule set using the confidence priority. The 307-328, 1996. informative rule set is significantly smaller than the [8] R. Bayardo, R. Agrawal, and D. Gunopulos. Constraint- based association rule set, be especially when the minimum rule mining in large, dense databases. In 15th International support is small. Conference on Data Engineering, Sydney, Australia, pages 188 - 197, March 1999. [9] R. Bayardo and R. Agrawal. Mining the most interesting rules. The proposed method has some merit in In 5th ACM SIGKDD International Conference on Knowledge, extracting information from huge data sets by San Diego, CA, Pages 145 - 154, August 1999. pruning the initial information (to bring it down to [10] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of the the manageable levels) and then finding the ACM SIGMOD Washington, D.C., pages 207-216, May 1993. association rules among the attributes. Further, the [11] J. Park, M. Chen, and P. Yu. An effective hash based approach is used to predict the relationships among algorithm for mining associative rules. In ACM SIGMOD the silencer auto power, weight, model, year etc., Conference, San Jose, CA, pages 175 - 186, May 1995. [12] Prakash.S and R.M.S.Parvathi. “ Scaling Apriori for could be extended to unknown function. Association Rule Mining Efficiency “.Proceedings of the Fourth International Conference, Amrutvani College of Engineering, The proposed scheme have characterized the Sangamner,Maharatra.pp 29, March 2009. relationships between the informative rule set and the non-redundant association rule set, and revealed that the informative rule set is a subset of the non- redundant association rule set. The thesis considers the upward closure properties of informative rule set for omission of uninformative association rules, and presented a direct algorithm to efficiently generate the informative rule set without generating all frequent item sets. The informative rule set generated in this thesis is significantly smaller than both the association rule set and the non-redundant association rule set for a given dataset that can be generated more efficiently than the association rule set. The 226 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 6, September 2010 AUTHORS PROFILE Dr. R.M.S. Parvathi has completed her Prof. S. Prakash has completed his M.E., Ph.D., degree in Computer Science and (Computer Science and Engineering) in Engineering in 2005 in Bharathiar K.S.R.College of Technology , Tamilnadu, University, Tamilnadu, India. Currently she India in 2006. Now he is doing research in is a Principal and Professor , Department of the field of Association Rule Mining CSE in Sengunthar College of Engineering algorithms. Currently, he is working as for Women, Tamilnadu, India, She has completed 20 years Assistant Professor in the department of of teaching service. She has published more than 28 articles Information Technology, Sasurie College of Engineering, in International / National Journals. She has authorized 3 and Tamilnadu. India. He has completed 9 years of books with reputed publishers. She is guiding 20 Research teaching service. scholars. Her research areas of interest are Software Engineering, Data Mining, Knowledge Engineering, and Object Oriented System Design. 227 http://sites.google.com/site/ijcsis/ ISSN 1947-5500