Seminar Report On


                           Submitted by

                       BENCY CLEETUS

       In the partial fulfillment of requirements in degree of

Master of Technology (M-Tech) in Computer & Information Science.






I thank GOD almighty for guiding me throughout the seminar. I would like to thank all
those who have contributed to the completion of the seminar and helped me with valuable
suggestions for improvement.

I am extremely grateful to Prof. Dr. K Poulose Jacob, Director, Department of
Computer Science, for providing me with best facilities and atmosphere for the creative
work guidance and encouragement.

I would like to thank my coordinator, Mr.G.Santhosh Kumar ,Lecture ,Department of
Computer Science, for all help and support extend to me. I thank all staff members of my
college and friends for extending their cooperation during my seminar.

Above all I would like to thank my parents without whose blessings; I would not have
been able to accomplish my goal.

Department of Computer Science, CUSAT                                                 1

        Association rules are one of the most researched areas of data mining and have
recently received much attention from the database community. They have proven to be
quite useful in the marketing and retail communities as well as other more diverse fields.
Association mining task is to discover a set of attributes shared among a large number of
objects in a given database. There are many potential application areas for association
rule technology which include catalog design, store layout, customer segmentation,
telecommunication alarm diagnosis, and so on. One of the important problems in data
mining is discovering association rules from databases of transactions where each
transaction consists of a set of items. The most time consuming operation in this
discovery process is the computation of the frequency of the occurrences of interesting
subset of items in the database of transactions

Department of Computer Science, CUSAT                                                    2

                            TABLE OF CONTENTS
1.     INTRODUCTION                                   4
2.     OVERVIEW OF ASSOCIATION RULES                  6
       2.1     Association rule problem               7
       2.2     Association rules Generation           8
       2.3     Basic association rules                9
3.     APRIORI ITEMSET GENERATION                     9
               4.1 Generalized Association Rules      15
               4.2 Multiple-Level Association Rules   16
               4.3 Quantitative Association Rules     17
               4.3 Using Multiple Minimum Supports    18
5.     MEASURING THE QUALITY OF RULES                 19
7.     CONCLUSION                                     21
8.     REFERENCES                                     22

Department of Computer Science, CUSAT                      3

                                  1. INTRODUCTION

         Data Mining is the discovery of hidden information found in databases and can be
viewed as a step in the knowledge discovery process . Data mining functions include
clustering, classification, prediction, and link analysis (associations). One of the most
important data mining applications is that of mining association rules. Association rules
are used to identify relationships among a set of items in a database. These relationships
are not based on inherent properties of the data themselves (as with functional
dependencies), but rather based on co-occurrence of the data items

         Association rule mining, one of the most important and well researched
techniques of data mining, was first introduced in Agrawal, R., Imielinski, T., and
Swami, A. N. 1993. Mining association rules between sets of items in large databases.
Association (rule) mining, the task of finding correlations between items in a dataset.
Initial research was largely motivated by the analysis of market basket data, the results of
which allowed companies to more fully understand purchasing behavior and, as a result,
better target market audiences. For example, consider the sales database of a bookstore,
where the objects represent customers and the attributes represent books. The discovered
patterns are the set of books most frequently bought together by the customers. The store
can use this knowledge for promotions, shelf placement, etc. It aims to extract interesting
correlations, frequent patterns, associations or casual structures among sets of items in the
transaction databases or other data repositories. Association rules are widely used in
various areas such as telecommunication networks, market and risk management,
inventory control etc.

   One of the reasons behind maintaining any database is to enable the user to find
interesting patterns and trends in the data For example, an insurance company, by finding
a strong correlation between two policies A and B, of the form A => B, indicating that
customers that held policy A were also likely to hold policy B, could more efficiently
target the marketing of policy B through marketing to those clients that held policy A but
not B.

Department of Computer Science, CUSAT                                                      4

    In effect, the rule represents knowledge about purchasing behavior. Association
mining applications have since been applied to many different domains including market
basket and risk analysis in commercial environments, epidemiology, clinical medicine,
fluid dynamics, astrophysics, crime prevention, and counter-terrorism all areas in which
the relationship between objects can provide useful knowledge. Association mining is
user-centric as the objective is the elicitation of useful rules from which new knowledge
can be derived. Association mining analysis is a two part process. First, the identification
of sets of items or itemsets within the dataset. Second, the subsequent derivation of
inferences from these itemsets..


          Data mining is often defined as finding hidden information in a database. It has
been called exploratory data analysis, data driven discovery, and deductive learning. Data
mining access of database differs from traditional access in several ways. The query
might not be well formed. The data accessed is usually a different version from that of
the original database. The output of the data mining query probably is not a subset of the
database. Data mining algorithms can be characterized according to model, preference,
and search. Model means the algorithm is fit a model to the data .Preference is some
criteria must be fit one model over another. All algorithms require some technique to
search the data.
          The model can be either predictive or descriptive in nature. A predictive model
makes a prediction about values of data using known results found from different data.
Predictive modeling may be made based on the use of the other historical data. For
example, a credit card use might be refused not because of the user's own credit history.
Predictive model data mining        tasks include classification, regression , time series
analysis , and prediction. A descriptive model identifies patterns or relationships in data.
It serves as a way to explore the properties of the data examined.Clustering,
summarization,association rules, and sequence discovery are usually descriptive in

Department of Computer Science, CUSAT                                                       5

                                  Data mining

                        Predictive                    Descriptive

 Classification                            Clustering          Sequence discovery
           Regression             Prediction
                                             Association rules
                     Time series analysis


       Association rule mining is to find out association rules that satisfy the predefined
minimum support and confidence from a given database. The problem is usually
decomposed into two subproblems. One is to find those itemsets whose occurrences
exceed a predefined threshold in the database; those itemsets are called frequent or large
itemsets. The second problem is to generate association rules from those large itemsets
with the constraints of minimal confidence.

  Let I=I1, I2, … , Im be a set of m distinct attributes, T be transaction that contains a set
of items such that T      I, D be a database with different transaction records Ts. An
association rule is an implication in the form of X =>Ywhere X, Y         I are sets of items

called itemsets, and X ∩ Y =∅. X is called antecedent while Y is called consequent, the

rule means X implies Y.

       There are two important basic measures for association rules, support and
confidence. Since the database is large and users concern about only those frequently
purchased items, usually thresholds of support and confidence are predefined by users to
drop those rules that are not so interesting or useful. The two thresholds are called
minimal support and minimal confidence respectively. Support of an association rule is

defined as the percentage/fraction of records that contain XUY to the total number of

records in the database. Suppose the support of an item is 0.1%, it means only 0.1 percent
of the transaction contain purchasing of this item.
Department of Computer Science, CUSAT                                                         6

       Confidence of an association rule is defined as the percentage/fraction of the
number of transactions that contain X UY to the total number of records that contain X.
Confidence is a measure of strength of the association rules, suppose the confidence of
the association rule X=>Yis 80%, it means that 80% of the transactions that contain X
also contain Y together.
Constraint-based association mining a process of weeding out uninteresting rules using
constraints provided by the user.
   •   knowledge type constraints: specify what to be mined, e.g.,association rules,
       classification etc
   •   data constraints: specify the set of task-relevant data, e.g., “Find product pairs
       sold together in North region in Q3’02"
   •   dimension/level constraints: specify the dimension of the data or levels of concept
   •   rule constraints: specify the form of rules, e.g., metarules, max./min. number of
       predicates, etc

2.1 Association rule problem

A formal statement of the association rule problem is:
Definition 1: Let I ={I1, I2, … , Im} be a set of m distinct attributes, also called literals.
Let D be a database, where each record (tuple) T has a unique identifier, and contains a
set of items such that T I An association rule is an implication of the form X=>Y,

where X, Y I, are sets of items called itemsets, and X∩ Y=∅ Here, X is called
antecedent, and Y consequent.
Two important measures for association rules, support (s) and confidence (α), can be
defined as follows.
Definition 2: The support (s) of an association rule is the ratio (in percent) of the records
that contain X" Y to the total number of records in the database. Therefore, if we say that
the support of a rule is 5% then it means that 5% of the total records contain X" Y.
Support is the statistical significance of an association rule. Grocery store managers
probably would not be concerned about how peanut butter and bread are related if less
than 5% of store transactions have this combination of purchases. While a high support is
often desirable for association rules, this is not always the case.

Department of Computer Science, CUSAT                                                       7

        For example, if we were using association rules to predict the failure of
telecommunications switching nodes based on what set 3 of events occur prior to failure,
even if these events do not occur very frequently association rules showing this
relationship would still be important.

Definition 3: For a given number of records, confidence (α) is the ratio (in percent) of
the number of records that contain X" Y to the number of records that contain X.
        Thus, if we say that a rule has a confidence of 85%, it means that 85% of the
records containing X also contain Y. The confidence of a rule indicates the degree of
correlation in the dataset between X and Y. Confidence is a measure of a rule’s strength.
Often a large confidence is required for association rules. If a set of events occur a small
percentage of the time before a switch failure or if a product is purchased only very rarely
with peanut butter, these relationships may not be of much use for management.
        Mining of association rules from a database consists of finding all rules that meet
the user-specified threshold support and confidence. The problem of mining association
rules can be decomposed into two subproblems
    •   Generate all itemsets that have a support that exceeds the threshold. These sets of
        the item are called large(Frequent) itemsets.Note that large here means large
    •   For each large itemset all the rules that have a minimum confidence are generated
        as follows:

2.2 Association rules Generation
    Number of association rules that can be generated from d items 3d - 2d+1 + 1. For
example 6 items will yield 36 -27 + 1 = 602 rules. Generating all association rules for
large d is intractable.

    Frequent itemset mining came from efforts to discover useful patterns in customer's
transaction databases. A customer's transaction database is a sequence of transactions
(T={t1,t2,….tn }), where each transaction is an itemset (ti         T). An itemset with k
elements is called a k-itemset. The support of an itemset X in T denoted as support(X), is
the number of those transactions that contain X, i.e. support (X)= |{ti : X   tj }| .

Department of Computer Science, CUSAT                                                     8

     An itemset is frequently if its support is greater than a support threshold, originally
denoted by min_supp. The frequent itemset mining problem is to find all frequent itemset
in a given transaction database. The algorithms were judged for three main tasks: all
frequent itemsets mining, closed frequent itemset mining, and maximal frequent itemset
mining. A frequent itemset is called closed if there is no superset that has the same
support (i.e., is contained in the same number of transactions).

    Closed itemsets capture all information about the frequent itemsets, because from
them the support of any frequent itemset can be determined. A frequent itemset is called
maximal if there no superset that is frequent. Maximal itemsets define the boundary
between frequent and infrequent sets in the subset lattice. Any frequent itemset is often
also called free itemset to distinguish it from closed and maximal ones.

2.3 Basic Association Rules

       The minimal set of association rules are analogous to minimal functional
dependancies and propose, a set of inference rules based on restricted conditional
probability distribution that address Armstrong’s axioms.

       The GenBR algorithm results in the generation of Basic Association Rules which
are nonredundant single consequent or canonical rules. Theoretical analysis shows that
the search space of the algorithm can be translated to an n-cube graph. The set of classes
of basic association rules generated by GenBR is easy for users to understand and

                    3.APRIORI ITEMSET GENERATION

       Apriori is a classic algorithm for learning association rules. Apriori is designed to
operate on databases containing transactions. Apriori uses a "bottom up" approach, where
frequent subsets are extended one item at a time and groups of candidates are tested
against the data. Apriori uses breadth-first search and a hash tree structure to count
candidate item sets efficiently. It generates candidate item sets of length k from item sets
of length k − 1. Then it prunes the candidates which have an infrequent sub pattern.
According to the downward closure lemma, the candidate set contains all frequent k-

Department of Computer Science, CUSAT                                                     9

length item sets. After that, it scans the transaction database to determine frequent item
sets among the candidates. For determining frequent items quickly, the algorithm uses a
hash tree to store candidate itemsets. This hash tree has item sets at the leaves and hash
tables at internal nodes. Note that this is not the same kind of hash tree used in for
instance p2p systems.

     Apriori Algorithm:
Pass 1

     1. Generate the candidate itemsets in C1
     2. Save the frequent itemsets in L1

Pass k

     1. Generate the candidate itemsets in Ck from the frequent

         itemsets in Lk-1

             1. Join Lk-1 p with Lk-1q, as follows:

                   insert into Ck
                   select p.item1, q.item1, . . . , p.itemk-1, q.itemk-1
                   from Lk-1 p, Lk-1q
                   where p.item1 = q.item1, . . . p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1

            2. Generate all (k-1)-subsets from the candidate itemsets in Ck
            3. Prune all candidate itemsets from Ck where some (k-1)-subset of the
               candidate itemset is not in the frequent itemset Lk-1
     2. Scan the transaction database to determine the support for each candidate itemset
        in Ck
     3. Save the frequent itemsets in Lk

Example 1: Assume the user-specified minimum support is 50%

     •   Given: The transaction database shown below

TID            A               B             C             D             E              F
T1             1               0             1             1             0              0
T2             0               1             0             1             0              0
T3             1               1             1             0             1              0
T4             0               1             0             1             0              1

Department of Computer Science, CUSAT                                                            10

   •     The candidate itemsets in C2 are shown below

Itemset X            supp(X)
{A,B}                25%
{A,C}                50%
{A,D}                25%
{B,C}                25%
{B,D}                50%
{C,D}                25%

   •     The frequent itemsets in L2 are shown below

Itemset X             supp(X)
{A,C}                 50%
{B,D}                 50%

Assume the user-specified minimum support is 40%, then generate all frequent itemsets.

Given: The transaction database shown below

TID             A               B             C            D              E
T1              1               1             1            0              0
T2              1               1             1            1              1
T3              1               0             1            1              0
T4              1               0             1            1              1
T5              1               1             1            1              0

Pass 1
                C1                                               L1
                                                   Itemset X          supp(X)
Itemset X           supp(X)                        A                  100%
A                   ?                              B                  60%
B                   ?                              C                  100%
C                   ?                              D                  80%
D                   ?                              E                  40%
E                   ?

Pass 2
Itemset X            supp(X)
A,B                  ?
A,C                  ?
A,D                  ?
A,E                  ?
B,C                  ?
B,D                  ?
B,E                  ?
C,D                  ?
Department of Computer Science, CUSAT                                                11

C,E                    ?
D,E                    ?

Nothing pruned since all subsets of these itemsets are infrequent

                  L2                           L2 after saving only the frequent itemsets
Itemset X                  supp(X)                    Itemset X            supp(X)
A,B                        60%                        A,B                  60%
A,C                        100%                       A,C                  100%
A,D                        80%                        A,D                  80%
A,E                        40%                        A,E                  40%
B,C                        60%                        B,C                  60%
B,D                        40%                        B,D                  40%
B,E                        20%                        C,D                  80%
C,D                        80%                        C,E                  40%
C,E                        40%                        D,E                  40%
D,E                        40%

Pass 3
To create C3 only look at items that have the same first item (in pass k, the first k - 2
items must match)
         C3                                         C3 after pruning

                                                Itemset     supp(X)
              Itemset        supp(X)
                                                A,B,C       ?
join AB       A,B,C          ?
with AC                                         A,B,D       ?

join AB       A,B,D          ?                  A,C,D       ?

with AD                                         A,C,E       ?

join AB       A,B,E          ?                  A,D,E       ?

with AE                                         B,C,D       ?

join AC       A,C,D          ?                  C,D,E       ?

with AD
join AC       A,C,E          ?
with AE
join AD       A,D,E          ?
with AE
join BC       B,C,D          ?

Department of Computer Science, CUSAT                                                       12

with BD
join CD      C,D,E       ?
with CE

   •     Pruning eliminates ABE since BE is not frequent
   •     Scan transactions in the database


Itemset X              supp(X)
A,B,C                  60%
A,B,D                  40%
A,C,D                  80%
A,C,E                  40%
A,D,E                  40%
B,C,D                  40%
C,D,E                  40%

Pass 4
       First k - 2 = 2 items must match in pass k = 4
                  Itemset X        supp(X)

combine ABC A,B,C,D                ?
with ABD
combine           A,C,D,E          ?
ACD with

   •     Pruning:
            o For ABCD we check whether ABC, ABD, ACD, BCD are frequent. They
                are in all cases, so we do not prune ABCD.
            o For ACDE we check whether ACD, ACE, ADE, CDE are frequent. Yes,
                in all cases, so we do not prune ACDE


Itemset X             supp(X)
A,B,C,D               40%
A,C,D,E               40%
Department of Computer Science, CUSAT                                      13

Both are frequent
Pass 5: For pass 5 we can't form any candidates because there aren't two frequent 4-
itemsets beginning with the same 3 items.
The Apriori algorithm assumes that the database is memory-resident. The maximum
number of database scans is one more than the cardinality of the largest large itemset.
Given an itemeset I= {a,b,c,d,e}.If an item set is frequent, then all of its subsets must also
be frequent and vice-versa.

if {c,d,e} is frequent then all its subsets must also be frequent

If {a,b} is infrequent, then all it supersets are infrequent

Department of Computer Science, CUSAT                                                      14


4.1 Generalized Association Rules
          We introduce the problem of mining generalized association rules. Given a large
database of transactions, where each transaction consists of a set of items, and a taxonomy (is-a
hierarchy) on the items, we find associations between items at any level of the taxonomy. For
example, given a taxonomy that says that jackets is-a outerwear is-a clothes, we may infer a rule
that “people who buy outerwear tend to buy shoes”. This rule may hold even if rules that “people
who buy jackets tend to buy shoes”, and “people who buy clothes tend to buy shoes” do not hold.
An obvious solution to the problem is to add all ancestors of each item in a transaction to the
transaction, and then run any of the algorithms for mining association rules on these “extended
transactions”   .    However,   this   “Basic”   algorithm   is   not   very   fast;   we   present   two
algorithms,Cumulate and EstMerge, which run 2 to 5 times faster than Basic (and more than 100
times faster on one real-life dataset). We also present a new interest-measure for rules which
uses the information in the taxonomy. Given a user-specified “minimum-interest-level”, this
measure prunes a large number of redundant rules; 40% to 60% of all the rules were pruned on
two real-life datasets.

                    Clothes                         Footwear

          Outerwear        shirts          Shoes             Hiking Boots

Jackets             Ski Pants
                    Example of Taxonomy
          Earlier work on association rules did not consider the presence of taxonomies and
restricted the items in association rules to the leaf-level items in the taxonomy. However,
finding rules across different levels of the taxonomy is valuable since:
     •     Rules at lower levels may not have minimum support. Few people may buy
           Jackets with Hiking Boots, but many people may buy Outerwear with Hiking
           Boots. Thus many significant associations may not be discovered if we restrict
           rules to items at the leaves of the taxonomy. Since department stores or
           supermarkets typically have hundreds of thousands of items, the support for
           rules involving only leaf items (typically UPC or SKU codes) tends to be
           extremely small.
Department of Computer Science, CUSAT                                                                 15

     •    Taxonomies can be used to prune uninteresting or redundant rules.
         Multiple taxonomies may be present. For example, there could be a taxonomy for
     the price of items (cheap, expensive, etc.), and another for the category. Multiple
     taxonomies may be modeled as a single taxonomy which is a DAG (directed acyclic
     graph). A common application that uses multiple taxonomies is loss-leader analysis.
     In addition to the usual taxonomy which classifies items into brands, categories,
     product groups, etc., there is a second taxonomy where items which are on sale are
     considered to be children of a “items-on-sale” category, and users look for rules
     containing the “items-on-sale” item.

4.2 Multiple-Level Association Rules

         Previous studies on mining association rules find rules at single concept level;
however, mining association rules at multiple concept levels may lead to the discovery of
more specific and concrete knowledge from data. A top-down progressive deepening
method is developed for mining multiple-level association rules from large transaction
databases by extension of some existing association rule mining techniques.
         Multiple-Level Association Rules use a hierarchy information encoded
transaction table, instead of the original transaction table, in iterative data mining. This is
based on the following considerations. First, a data mining query is usually in relevance
to only a portion of the transaction database, such as food instead of all the items. It is
beneficial to first collect the relevant set of data and then work repeatedly on the task-
relevant set. Second, encoding can be performed during the collection of task-relevant
data, and thus there is no extra “encoding pass” required. Third, an encoded string, which
represents a position in a hierarchy, requires fewer bits than the corresponding object
identifier or bar-code. Moreover, encoding makes more items to be merged due to their
identical encoding, which further reduces the size of the encoded transaction table.

         For example encoded as a sequence of digits in the transaction table , the item
‘2% Foremost milk’ is encoded as ‘112’ in which the first digit, ‘l’, represents ‘milk’ at
level-l, the second, ‘I’, for ‘2% (milk)’ at level-2, and the third, ‘2’, for the brand
‘Foremost’ at level-3..

Department of Computer Science, CUSAT                                                       16

4.3 Quantitative Association Rules

       We introduce the problem of mining association rules in large relational tables
containing both quantitative and categorical attributes. An example of such an association
might be “ 10% of married people between age 50 and 60 have at least 2 cars”. We deal
with quantitative attributes by fine partitioning the values of the attribute and then
combining adjacent partitions as necessary. We introduce measures of partial
completeness which quantify the information lost due to partitioning. A direct application
of this technique can generate too many similar rules. We tackle this problem by using a
“greater-than-expected-value” interest measure to identify the interesting rules in the
output. We give an algorithm for mining such quantitative association rules. Finally, we
describe the results of using this approach on a real-life dataset

4.4 Using Multiple Minimum Supports

       Since a single threshold support is used for the whole database, it assumes that all
items in the data are of the same nature and/or have similar frequencies. In reality, some
items may be very frequent while others may rarely appear. However, the latter may be
more informative and more interesting than the earlier. For example, besides finding a
rule bread => cheese with a support of 8%, it might be more informative to show that
wheatBread => swissCheese with a support of 3%. Another simple example could be
some items in a super market which are sold less frequently but more profitable, food
processor and cooking pan. Therefore, it might be very interesting to discover a useful
rule foodProcessor => cookingPan with a support of 2%.

       If the threshold support is set too high, rules involving rare items will not be
found. To obtain rules involving both frequent and rare items, the threshold support has
to be set very low. Unfortunately, this may cause combinatorial explosion, producing too
many rules, because those frequent items will be associated with another in all possible
ways and many of them are meaningless. This dilemma is called the “rare item problem”
To overcome this problem, one of the following strategies may be followed : (a) split the
data into a few blocks according to the supports of the items and then discover
association rules in each block with a different threshold support, (b) group a number of
related rare items together into an abstract item so that this abstract item is more frequent.
Then apply the algorithm of finding association rules in numerical interval data.
Department of Computer Science, CUSAT                                                      17

        It is evident that both approaches are ad hoc and approximate. Rules associated
with items across different blocks are difficult to find using the first approach. The
second approach cannot discover rules involving individual rare items and the more
frequent items. Therefore, a single threshold support for the entire database is inadequate
to discover important association rules because it cannot capture the inherent natures
and/or frequency differences in the database. It extended the existing association rule
model to allow the user to specify multiple threshold supports. The extended new
algorithm is named as MISapriori. In this method, the threshold support is expressed in
terms of minimum item supports (MIS) of the items that appear in the rule. The main
feature of this technique is that the user can specify a different threshold item support for
each item. Therefore, this technique can discover rare item rules without causing frequent
items to generate too many unnecessary rules.

         Similar to conventional algorithms, the MISapriori generates all large itemsets
by making multiple passes over the data. In the first pass, it counts the supports of
individual items and determines whether they are large. In each subsequent pass, it uses
large itemsets of the previous pass to generate candidate itemsets. Computing the actual
supports of these candidate sets, the MISaprioi determines which of the candidate sets are
actually large at the end of the pass. However, the generation of large itemsets in the
second pass differs from other algorithms.A key operation in the MISapriori is the sorting
of the items I in ascending order of their MIS values. This ordering is used in the
subsequent operation of the algorithm.

        The extended model was tested and evaluated by using synthetic data as well as
real-lifedata sets. In the experimental study of this algorithm with synthetic data, three
very low LSvalues, 0.1%, 0.2%, and 0.3% were used. It has been reported that the
number of large itemsets is significantly reduced by MISapriori method when 〈 is not too
large. The number of large itemsets found by this approach is close to single minsup
method when 〈 becomes larger. This is because when 〈 becomes larger more and more
items’ MIS values reach LS. It has also been argued that the execution time reduces

Department of Computer Science, CUSAT                                                     18

4.5 Correlation Rules

        A correlation rule is defined as a set of itemsets that are correlated. The
motivation for developing these correlation rules is that negative correlations may be
useful. Correlation satisfies upward closure in the itemset lattice. Thus ,if a set is
correlated, so is every superset of it

Correlation(A=>B)= P(A,B) / [ P(A) P(B)]

     This correlation value is lower than 1,it indicates a negative correlation between A
and B

                   5. MEASURING THE QUALITY OF RULES

                Support and confidence are the normal methods used to measure the
    quality of an association rules :
                                         s(A=>B) = P(A,B) and α(A=>B) = P(B | A)
        Another technique to measure the significance of rules by using the chi squared
    test for independence has been proposed .This significance test was proposed for use
    with correlation rules. Unlike the support or confidence measurement ,the chi squared
    significance test takes into account both the presence and the absence of items in sets.
    Here it is used to measure how much an itemset count differs from the expected.

        The chi squared statistic can be calculated in the following manner. Suppose the
    set of items is I= {I1,I2,………Im}.A transaction tj can be viewed as

                tj Є {I1, I1} x {I2, I2} x ……x {Im, Im }

    Given any possible itemset X, it also is viewed as a subset of the Cartesian Product.
    The chi squared statistic is then calculated for X as
                        X2=Σx=I [O(x)-E(x)]2 / E(x)

Department of Computer Science, CUSAT                                                    19

   Here O(x) is the count of number of transactions that contain the items in X.For one
   item I1,the expected values is E(Ii)=O(Ii),the count of the number of transactions that
   contain Ii.E(Ii)=n-O(Ii).The Expected value E(x) is calculated assuming independence
   And is thus defined as

                              E(x)= n x   m
                                           π   i=1   E(Ii) / n

   Here n is the number of transactions


       For develop a new generation of databases, called “inductive databases” (IDBs),
suggested by Imielinski and Mannila .This kind of databases integrate raw data with
knowledge extracted from raw data, materialized under the form of patterns into a
common framework that supports the knowledge discovery process within a database
framework. In this way, the process of KDD consists essentially in a querying process,
enabled by an ad-hoc, powerful and universal query language that can deal either with
raw data or patterns and that can be used throughout the whole KDD process across many
different applications. We are far from an understanding of fundamental primitives for
such query languages when considering various kinds of knowledge discovery processes.
The so-called association rule mining. It provides an interesting context for studying the
inductive database framework

       The association rule mining (ARM) method to discovering heuristic rules for
power system restoration (PSR) to guide a fast restoration process. In order to employ the
popular algorithms of ARM, the process of PSR is represented as a series of actions out
of a finite action set. The interesting attributes of each action are mapped as items and the
actions are mapped as transactions. Fuzzy set and clustering method are adopted to
evaluate the performance of individual action.

       The association rules mining for Named Entity Recognition (NER) and co-
reference resolution. The method uses several morphological and lexical features such as
Pronoun Class (PC) and Name Class (NC), String Similarity (SP) and Position (P) in the
text, into a vector of attributes. Applied to a corpus of newspaper in the Indonesian
language, the method outperforms state-of-the-art maximum entropy method in name

Department of Computer Science, CUSAT                                                     20

entity recognition and is comparable with state-of-the-art machine learning methods,
decision tree, for co-reference resolution.

       The association rules mining method, considering users' differential emphasis on
each item through fuzzy regions. This is more realistic and practical than prior
association rules methods. Moreover, the discovered rules are expressed in natural
language that is more understandable to humans.

        The discovery of spatial association rules, that is, association rules involving
spatial relations among (spatial) objects. Spatial association rule mining is the extension
of transaction association rule The method is based on a multi-relational data mining
approach and takes advantage of the representation and reasoning techniques developed
in the field of inductive logic programming (ILP). In particular, the expressive power of
predicate logic is profitably used to represent spatial relations and background knowledge
(such as spatial hierarchies and rules for spatial qualitative reasoning) in a very elegant,
natural way. The integration of computational logics with efficient spatial database
indexing and querying procedures permits applications that cannot be tackled by
traditional statistical techniques in spatial data analysis. The proposed method has been
implemented in the ILP system SPADA (spatial pattern discovery algorithm).The
preliminary results of the application of SPADA to Stockport census data.

Department of Computer Science, CUSAT                                                    21

                                  7. CONCLUSION

        Association mining has become a mature field of research with diverse branches
of specialization. The fundamentals of association mining are now well established and,
with some important exceptions. The task of finding correlations between items in a
dataset, association mining, has received considerable attention. There appears little
current research involving the improvement of general itemset identification or rule
generation. Modeling of specific association patterns that are both statistically based on
support and confidence and semantically related to given objective that a user wants to
achieve or is interested in.

Department of Computer Science, CUSAT                                                  22

                                  8. REFERENCES

[1] ACM Computing Surveys ,Vol.38,No.2,Article 5,Publication

        date: July 2006

[2] Data Mining Introductory and Advanced Topics Margaret H.Dunham

       Southern Methodist University



Department of Computer Science, CUSAT                                23

To top