Scaling Apriori for Association Rule Mining using Mutual Information Based Entropy

Document Sample
Scaling Apriori for Association Rule Mining using Mutual Information Based Entropy Powered By Docstoc
					                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 8, No. 6, September 2010

    Scaling Apriori for Association Rule Mining
     using Mutual Information Based Entropy
S.Prakash,Research Scholar                                        Dr.R.M.S.Parvathi M.E.(CSE),Ph.D.
Sasurie College of Engineering                                    Principal
Vijayamangalam,Erode(DT)                                          Sengunthar College of Engg.for Women
Tamilnadu, India.Ph.09942650818                                   Tiruchengode. Tamilnadu, India                        

                                                                                      I.   INTRODUCTION
Abstract - Extracting information from large datasets is
a well-studied research problem. As larger and larger
                                                                            Data mining, also known as knowledge
data sets become available (e.g., from customer
behavior data from organizations such as Wal-Mart) it             discovery in databases, has been recognized as a new
is getting essential to find better ways to extract               area for dataset research. The problem of discovering
relations (inferences) from them. This thesis proposes            association rules was introduced in latter stages.
an improved Apriori algorithm to minimize the number              Given a set of transactions, where each transaction is
of candidate sets while generating association rules by           a set of items, an association rule is an expression of
evaluating quantitative information associated with               the from X + Y, where X and Y are sets of items. The
each item that occurs in a transaction, which was                 problem is to find all association rules that satisfy
usually, discarded as traditional association rules focus         user-specified minimum support and minimum
just on qualitative correlations. The proposed approach
                                                                  confidence constraints.
reduces not only the number of item sets generated but
also the overall execution time of the algorithm. Any
valued attribute will be treated as quantitative and will                  Conceptually, this problem can be viewed as
be used to derive the quantitative association rules              finding associations between the “l” values in a
which usually increases the rules' information content.           relational table where all the attributes are Boolean.
Transaction reduction is achieved by discarding the               The table has an attribute corresponding to each item
transactions that does not contain any frequent item set          and a record corresponding to each transaction. The
in subsequent scans which in turn reduces overall                 value of an attribute for a given record is “1“ if the
execution time. Dynamic item set counting is done by              item corresponding to the attribute is present in the
adding new candidate item sets only when all of their
                                                                  transaction corresponding to the record, “O” else.
subsets are estimated to be frequent. The frequent item
ranges are the basis for generating higher order item
ranges using Apriori algorithm. During each iteration                       Relational tables in most business and
of the algorithm, use the frequent sets from the previous         scientific domains have richer attribute types.
iteration to generate the candidate sets and check                Attributes can be quantitative (e.g. age, income) or
whether their support is above the threshold. The set of          categorical (e.g. zip code, make of car). Boolean
candidate sets found is pruned by a strategy that                 attributes can be considered a special case of
discards sets which contain infrequent subsets. The               categorical attributes. This thesis defines the problem
thesis evaluate the scalability of the algorithm by               of mining association rules over quantitative attribute
considering transaction time, number of item sets used
                                                                  in large relational tables and present techniques for
in the transaction and memory utilization. Quantitative
association rules can be used in several domains where            discovering such rules. This is referred as the
the traditional approach is employed. The unique                  Quantitative Association Rules problem.
requirement for such use is to have a semantic
connection between the components of the item-value                        The problem of mining association rules in
pairs. The proposal used mutual information based on              categorical data presented in customer transactions
entropy to generate association rules from non-                   was introduced by Agrawal, Imielinski and Swami
biological datasets.                                              [1][2]. This thesis work provided basic idea to several
                                                                  investigation efforts [4] resulting in descriptions of
Keywords-Apriori, Quantitative attribute,Entropy
                                                                  how to extend the original concepts and how to
                                                                  increase the performance of the related algorithms.

                                                                                              ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 8, No. 6, September 2010

          The original problem of mining association              support, gain, chi-squared value, gini, entropy gain,
rules was formulated as how to find rules of the form             laplace, lift, and conviction [9,6]. However, the main
set1     set2. This rule is supposed to denote affinity           goal common to all of these algorithms is to reduce
or correlation among the two sets containing nominal              the number of generated rules.
or ordinal data items. More specifically, such an
association rule should translate the following                   A)Existing Scheme
meaning: customers that buy the products in set1 also
buy the products in set2. Statistical basis is                              The thesis extend the first group of
represented in the form of minimum support and                    techniques since it do not relax any set of conditions
confidence measures of these rules with respect to the            nor employ a interestingness criteria to sort the
set of customer transactions.                                     generated rules. In this context, many algorithms for
                                                                  efficient generation of frequent item sets have been
          The original problem as proposed by                     proposed in the literature since the problem was first
Agrawal et al.[2] was extended in several directions              introduced in [10]. The DHP algorithm [11] uses a
such as adding or replacing the confidence and                    hash table in pass k to perform efficient pruning of
support by other measures, or filtering the rules                 (k+1)-item sets. The Partition algorithm minimizes
during or after generation, or including quantitative             I/O by scanning the dataset only twice. In the first
attributes. Srikant and Agrawal describe a new                    pass it generates the set of all potentially frequent
approach where quantitative data can be treated as                item sets, and in the second pass the support for all
categorical. This is very important since otherwise               these is measured. The above algorithm are all
part of the customer transaction information is                   specialized techniques which do not use any dataset
discarded.                                                        operations. Algorithms using only general purpose
                                                                  DBMS systems and relational algebra operations
          Whenever an extension is proposed it must               have also been proposed [9.10].
be checked in terms of its performance. The
algorithm efficiency is linked to the size of the                           Few other works trying to solve this mining
dataset that is amenable to be treated. Therefore it is           problem for quantitative attributes. In [5], the authors
crucial to have efficient algorithms that enable us to            proposed an algorithm which is an adaptation of the
examine and extract valuable decision-making                      Apriori algorithm for quantitative attributes. It
information in the ever larger databases.                         partitions each quantitative attribute into consecutive
                                                                  intervals using equi-depth bins. Then adjacent
          This thesis present an algorithm that can be            intervals may be combined to form new intervals in a
used in the context of several of the extensions                  controlled manner. From these intervals, frequent
provided in the literature but at the same time                   item sets (c.f. large item sets in Apriori Algorithm)
preserves its performance. The approach in our                    will then be identified.
algorithm is to explore multidimensional properties
of the data (provided such properties are present),                        Association rules will be generated
allowing to combine this additional information in a              accordingly. The problems with this approach is that
very efficient pruning phase. This results in a very              the number of possible interval combinations grows
flexible and efficient algorithm that was used with               exponentially as the number of quantitative attributes
success in several experiments using quantitative                 increases, so it is not easy to extend the algorithm to
databases with performance measure done on the                    higher dimensional cases. Besides, the set of rules
memory utilization during the transactional pruning               generated may consist of redundant rules for which
of the record sets.                                               they present a “greater-than-expected-value” interest
                                                                  measure to identify the interesting ones.
                                                                            Some other efforts that exploit quantitative
         Various proposals for mining association                 information present in transactions for generating
rules from transaction data were presented on                     association rules[12]. In [5], the quantitative rules are
different context. Some of these proposals are                    generated by discrediting the occurrence values of an
constraint-based in the sense that all rules must fulfill         attribute in fixed-length intervals and applying the
a predefined set of conditions, such as support and               standard Apriori algorithm for generating association
confidence [6,7,8]. The second class identify just the            rules. However, although simple, the rules generated
most interesting rules (or optimal) in accordance to              by this approach may not be intuitive, mainly when
some interestingness metric, including confidence,                there are semantic intervals that do not match the
                                                                  partition employed.

                                                                                              ISSN 1947-5500
                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                  Vol. 8, No. 6, September 2010

                                                                           The enhancement of Apriori is done by
          Other authors [5] proposed novel solutions             increasing the efficiency of candidate pruning phase
that minimize this problem by considering the                    by reducing the number of candidates that are
distance among item quantities for delimiting the                generated to further verification. The proposed
intervals, that is, their “physical”' placement, but not         algorithm use quantitative information to estimate
the frequency of occurrence as a relevance metric.               more precisely the overlap in terms of transactions.
                                                                 The basic elements considered in the development of
          To visualize the information in the massive            the algorithm are number of transactions, average
tables of quantitative measurements we plan to use               size of transaction, average size of the maximal large
clustering and mutual information based on entropy.              item sets, number of items, and distribution of
Clustering is an old studied technique used to extract           occurrences of large item sets.
this information from customer behavior data sets.
This follows from the fact that related customer                          The second phase of the thesis claimed
purchase through word of mouth have similar                      improvement over A priori by considering memory
patterns of customer behavior. Clustering groups                 consumption for data transaction. This part of the
records that are “similar” in the same group. It suffers         algorithm generate all candidates based on 2-frequent
from two major defects. It does not tell you how the             item sets on sorted dataset and already generates all
two customer buyin behavior/clusters are exactly                 frequent item sets that can no longer be supported by
related. Moreover, it gives you a global picture and             transactions that still have to be processed. Thus the
any relation at a local level can be lost.                       new algorithm no longer has to maintain the covers
                                                                 of all past item sets into main memory. In this way,
B) Proposed Scheme                                               The proposed level-wise algorithms accesses a
                                                                 dataset less often than Apriori and require less
          The proposed scheme comprises of two                   memory because of the utilization of additional
phases. The first phase of the thesis concerns about             upward closure properties.
the quantitative association rule mining with the
enhancement on Apriori algorithm. The second phase               C)Mutual Information based entropy
of the thesis deals with the reduction of memory
utilization during the pruning phase of the                               The mutual information I (X, Y ) measures
transactional execution.                                         how much (on average) the realization of random
                                                                 variable Y tells us about the realization of X, i.e.,
         The algorithm for generating quantitative               how by how much the entropy of X is reduced if we
association rules starts by counting the item ranges in          know the realization of Y .
the data set, in order to determine the frequent ones.
These frequent item ranges are the basis for                               I(X; Y ) = H(X) - H(X|Y )
generating higher order item ranges using an
algorithm similar to Apriori. Take into account the              For example, the mutual information between a cue
size of a transaction as the number of items that it             and the environment indicates us how much on
comprises.                                                       average the cue tells us about the environment. The
                                                                 mutual information between a spike train and a
a) Define an item set m as a set of items of size m              sensory input tells us how much the spike train tells
b) Specify frequent (large) item sets by Fm                      us about the sensory input. If the cue is perfectly
c) Specify candidate item sets (possibly frequent) by            informative, if it tells us everything about the
Lm.                                                              environment and nothing extra, then the mutual
          A n range set is a set of n- item ranges, and          information between cue and environment is simply
each m-item set has a n-range set that stores the                the entropy of the environment:
quantitative rules of the item set. During each
iteration of the algorithm, the system uses the                  I(X; Y ) = H(X) - H(X|Y ) = H(X) - H(X|X) = H(X).
frequent sets from the previous iteration to generate
the candidate sets and check whether their support is                     In other words, the mutual information
above the threshold. The set of candidate sets found             between a random variable and itself is simply its
is pruned by a strategy that discards sets which                 entropy: I(X;X) = H(X). Surprisingly, mutual
contain infrequent subsets. The algorithm ends when              information is symmetric; X tells us exactly as much
there are no more candidates’ sets to be verified.               about Y as Y tells us about X.

                                                                                             ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 8, No. 6, September 2010

                                                                  contain B (interpret as P(B|A)). The occurrence
       III. QUANTITATIVE ASSOCIATION                              frequency of an item set is the number of transactions
             RULE MINING – MUTUAL                                 that contain the item set.
                                                                     IV. IMPLEMENTATION OF QUANTITATIVE
          The proposal of this work use mutual                                     APRIORI
information based on entropy for generating
quantitative association rules. Apart from the usual                        The function op is an associative and
positive correlations between the customers, this                 commutative function. Thus, the iterations of the
criterion would also discover association rules with              foreach loop can be performed in any order. The
negative correlations in the data sets. It is expected to         data-structure Reduc is referred to as the reduction
find results of the form attrib 1/ attrib 2 –> ^attrib3,          object. The main correctness challenge in
which can be interpreted as follows: Attrib1 and                  parallelizing a attribute like this on a shared memory
Attrib2 are co expressed and have silencing effect on             machine arises because of possible race conditions
attrib 3. Then compare the results from our                       when multiple processors update the same element of
experiments to those obtained from clustering.                    the reduction object.

          First tune various parameters (like support,                     The element of the reduction object that is
support fraction, significance level), of the auto                updated in a loop iteration is determined only as a
performance dataset. This was because even with                   result of the processing. In the a priori association
binary data, 2468 attributes may lead to the power(2,             mining algorithm, the data item read needs to be
2468) relations (which the software was not designed              matched against all candidates to determine the set of
to handle). Here, it is needed to know that for the               candidates whose counts will be incremented. The
problem under consideration, the auto are attributes,             major factors that make these loops challenging to
as needed to find relationships among them. To                    execute efficiently and correctly are as follows:
overcome this problem used another approach, in
which data attributes were already known to be                             It is not possible to statically partition the
related, using the results obtained from clustering.              reduction object so that different processors update
This decreases the number of attributes to                        disjoint portions of the collection. Thus, race
manageable levels (both for program). The proposed                conditions must be avoided at runtime.
work used the approach above to find the
relationships (positive, negative) among the                                The execution time of the function process
attributes.                                                       can be a significant part of the execution time of an
                                                                  iteration of the loop. Thus, runtime preprocessing or
Algorithm Steps                                                   scheduling techniques cannot be applied.
         a. Find all frequent item sets (i.e., satisfy
minimum support)                                                            The updates to the reduction object are fine
         b. Generate strong association rules from the            grained. The reduction object comprises a large
frequent item sets (each rule must satisfy minimum                number of elements that take only a few bytes, and
support and minimum confidence).                                  the for each loop comprises a large number of
         c. Identify the quantitative elements                    iterations, each of which may take only a small
         d. Sorting the item sets based on the                    number of cycles.
frequency and quantitative elements                                         { * Outer Sequential Loop *}
         e. Merge the more associated rules of item                         While() {
pairs                                                                          {* Reduction Loop*}
         f. Discard the infrequent item value pairs                             Foreach(element e        ){
         g. Iterate the steps c to f till the required                               ( i, val) = process (e);
              mining results are achieved                                          Reduc(i) = Reduc(i) op val;
         h.                                                                      }
         Let I = { i1, i2 … im} be a set of items, and                                       Fig 1: Pseudo code
T a set of transactions, each a subset of I. An
association rule is an implication of the form A=>B,                       The consumer behavior auto databases
where A and B are non-intersecting The support of                 obtained data from UCI Machine Learning
A=>B is the percentage of the transactions that                   Repository. The data obtained was about CPU-
contain both A and B. The confidence of A=>B is the               performance and automobile mileage. The data was
percentage of transactions containing A that also                 discretized into binary values. For these data sets the

                                                                                              ISSN 1947-5500
                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                 Vol. 8, No. 6, September 2010

discretization was done in accordance with
interpretation required. This discretization was done
automatically using the written software. This
software also formatted the data into the format
required by the program. A finer level of
discretization (or supporting the real values) would
have been more appropriate, but the used approach
also gave much of the useful results.


         The process of executing quantitative
Association rule mining for the auto manufacturer
power evaluation data is given below

a) Get data (file: 9 attributes, 398

b) Remove unique attributes (IDs). Here, car_name
attribute has been removed.

c) Remove those samples (total 5) that contain “?”
(missing data) as a value for some of their attributes          Graph 1: Support VS Time on Quantitative and qualitative A priori
(so, we are left with 8 attributes and 393 samples).
                                                                         The thread execution on the quantitative a
d) Discretize real-valued attributes based on their             priori and qualitative a priori are evaluated for the
average values (which is (maximum attribute value +             same data set (Graph 2). Here the initial thread
minimum attribute value) / 2)                                   requires more time, however consequent threads
                                                                shows better scalable performance of quantitative
e) Run the program to generate association rules                Apriori.
using mutual information based on entropy metric.

         The experiment focused on evaluating all
quantitative a priori techniques. Since we were
interested in seeing the best performance, we used
banking data set samples. We used a 1 GB dataset. A
confidence of 90% and support of 0.5 is used.

          Execution times using 1, 2, 3, and 4 threads
are presented on the processor. With 1 thread, Apriori
does not have any significant overheads as compared
to the sequential version. Therefore, this version is
used for reporting all speedups. Though the
performance of quantitative a priori is considerably
lower than a priori, they are promising for the cases
when sufficient memory for supporting full
replication may not be available. We consider four
support levels, 0.1%, 0.05%, 0.03%, and 0.02%. The
execution time efficiency is improved for the
quantitative a priori on frequent item set evaluation
with the support count (Graph 1).
                                                                    Graph 2: Thread Vs Time for a priori execution on rule set

                                                                                              ISSN 1947-5500
                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                 Vol. 8, No. 6, September 2010

By observing the output from the program it is seen             efficiency improvement results from that the
that a few relationships between the attributes had             generation of the informative rule set needs fewer
high values of mutual information. Namely, the                  candidates and dataset accesses than that of the
highest MI-values were obtained for:                            association rule set rather than large memory usage
                                                                like some other efficient algorithms.
a) displacement and horsepower. Further, by
observing the entropy values we may notice that
there are very few cars that have small displacement                                    REFERENCES
and high horsepower.
b) displacement and weight. Further, by observing               [1] R. Agrawal, T. Imielinski, and A. Swami. Dataset mining: A
                                                                performance perspective. In IEEE Transactions on Knowlegde and
the entropy values we may notice that there are very
                                                                Data     Engineering, December 1993.
few cars that have large displacement and light                 [2] R. Agrawal, T. Imielinski, and A. Swami. Mining association
weight.                                                         rules      between sets of items in large databases. In Proc. of the
c) cylinders and weight. Further, by observing the              ACM SIGMOD Washington, D.C., pages 207-216, May 1993.
                                                                 [3] R. Miller and Y. Yang. Association rules over interval data. In
entropy values we may notice that there are very few
                                                                ACM SIGMOD Conference, Tucson, Arizona, pages 452 - 461,
cars that have small number of cylinders and heavy              May 1997.
weight.                                                         [4] J. Park, M. Chen, and P. Yu. An effective hash based algorithm
d) horsepower and weight. Further, by observing the             for     mining associative rules. In ACM SIGMOD Conference, San
                                                                Jose, CA,     pages 175 - 186, May 1995.
entropy values we may notice that there are very few
                                                                [5] R. Srikant and R. Agrawal. Mining quantitative association
cars that have large horsepower but heavy weight.               rules in        large relational tables. In Proceedings of the ACM
                                                                SIGMOD           Conference on Management of Data, pages 1–12,
                 VI. CONCLUSION                                 Montreal, Canada, June, 1996.
                                                                [6] R. Agrawal, T. Imielinski, and A. Swami. Dataset mining: A
                                                                performance perspective. In IEEE Transactions on Knowlegde and
         The thesis have defined a new a rule set               Data      Engineering, December 1993.
namely the informative rule set that presents                   [7] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and
prediction sequences equal to those presented by the            A.Verkamo. Fast       discovery of association rules. In Advances in
                                                                Knowledge Discovery and           Data Mining, San Jose, CA, pages
association rule set using the confidence priority. The
                                                                307-328, 1996.
informative rule set is significantly smaller than the          [8] R. Bayardo, R. Agrawal, and D. Gunopulos. Constraint- based
association rule set, be especially when the minimum            rule        mining in large, dense databases. In 15th International
support is small.                                               Conference on       Data Engineering, Sydney, Australia, pages 188
                                                                - 197, March 1999.
                                                                [9] R. Bayardo and R. Agrawal. Mining the most interesting rules.
         The proposed method has some merit in                  In 5th ACM SIGKDD International Conference on Knowledge,
extracting information from huge data sets by                   San Diego, CA, Pages 145 - 154, August 1999.
pruning the initial information (to bring it down to            [10] R. Agrawal, T. Imielinski, and A. Swami. Mining association
                                                                rules       between sets of items in large databases. In Proc. of the
the manageable levels) and then finding the
                                                                ACM SIGMOD            Washington, D.C., pages 207-216, May 1993.
association rules among the attributes. Further, the            [11] J. Park, M. Chen, and P. Yu. An effective hash based
approach is used to predict the relationships among             algorithm for          mining associative rules. In ACM SIGMOD
the silencer auto power, weight, model, year etc.,              Conference, San Jose, CA,         pages 175 - 186, May 1995.
                                                                [12] Prakash.S and R.M.S.Parvathi. “ Scaling Apriori for
could be extended to unknown function.
                                                                Association Rule Mining Efficiency “.Proceedings of the Fourth
                                                                International Conference, Amrutvani College of Engineering,
         The proposed scheme have characterized the             Sangamner,Maharatra.pp 29, March 2009.
relationships between the informative rule set and the
non-redundant association rule set, and revealed that
the informative rule set is a subset of the non-
redundant association rule set. The thesis considers
the upward closure properties of informative rule set
for omission of uninformative association rules, and
presented a direct algorithm to efficiently generate
the informative rule set without generating all
frequent item sets.

          The informative rule set generated in this
thesis is significantly smaller than both the
association rule set and the non-redundant association
rule set for a given dataset that can be generated more
efficiently than the association rule set. The

                                                                                               ISSN 1947-5500
                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                    Vol. 8, No. 6, September 2010

                AUTHORS PROFILE
                                                                                   Dr. R.M.S. Parvathi has completed her
                Prof. S. Prakash has completed his M.E.,
                                                                                   Ph.D., degree in Computer Science and
                (Computer Science and Engineering) in
                                                                                   Engineering in        2005 in Bharathiar
                K.S.R.College of Technology , Tamilnadu,
                                                                                   University, Tamilnadu, India. Currently she
                India in 2006. Now he is doing research in
                                                                                   is a Principal and Professor , Department of
                the field of Association Rule Mining
                                                                                   CSE in Sengunthar College of Engineering
                algorithms. Currently, he is working as
                                                                   for Women, Tamilnadu, India, She has completed 20 years
                Assistant Professor in the department of
                                                                   of teaching service. She has published more than 28 articles
Information Technology, Sasurie College of Engineering,
                                                                   in International / National Journals. She has authorized 3
and Tamilnadu. India. He has completed 9 years of
                                                                   books with reputed publishers. She is guiding 20 Research
teaching service.
                                                                   scholars. Her research areas of interest are Software
                                                                   Engineering, Data Mining, Knowledge Engineering, and
                                                                   Object Oriented System Design.

                                                                                               ISSN 1947-5500

Description: IJCSIS is an open access publishing venue for research in general computer science and information security. Target Audience: IT academics, university IT faculties; industry IT departments; government departments; the mobile industry and computing industry. Coverage includes: security infrastructures, network security: Internet security, content protection, cryptography, steganography and formal methods in information security; computer science, computer applications, multimedia systems, software, information systems, intelligent systems, web services, data mining, wireless communication, networking and technologies, innovation technology and management. The average paper acceptance rate for IJCSIS issues is kept at 25-30% with an aim to provide selective research work of quality in the areas of computer science and engineering. Thanks for your contributions in September 2010 issue and we are grateful to the experienced team of reviewers for providing valuable comments.