IJETTCS-2013-06-25-158 by editorijettcs


									    International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May – June 2013                                             ISSN 2278-6856

      Balancing Security overhead and Performance
      Metrics using a novel Multi-Objective Genetic
                   Wasim Khalil Shalish 1, A. Z. Ghalwash2 , H. M. El-Deeb3 and K. Badran4
                                     1                                   2
                                     Syrian Armed Forces                   Helwan University
                                 3                                 4
                                     Modern University                 Military Technical College

Abstract: With fast progress of the networks, data mining               solve privacy problem, Privacy-Preserving Data Mining
and information sharing techniques, the security of the                 (PPDM) has become a hotspot in data mining and
privacy of sensitive information in a database becomes a vital          database security fields.
issue to be resolved. The mission of association rule mining is         Recent advances in data mining algorithms increased the
discovering hidden relationships between items in database              risk of information leakage and its confidence issue.
and revealing frequent item sets and strong association rules.          Because of this progress, the parallel research area has
Some rules or frequent item sets called sensitive which
                                                                        been started to overcome the information leakage risks
contain some critical information that is vital or private for
its owner.
                                                                        and immunization of mining environment. Privacy
In recent research there is GA, users have tried to combine             preserving against mining algorithms is a new research
(or aggregate) multiple objectives into a single scalar                 area that investigates the side-effects of data mining
function using different weights for each objective, or by              methods that is derived from the privacy diffusion of
adding penalty functions for specific objectives. But these             persons and organizations. Mining these effects can be
methods add more adjustable parameters which require                    considered as an optimization problem.
profound domain knowledge which is usually not available.
In addition, the solutions generated are usually very sensitive         Optimization Technique
to small changes in these weights or penalties functions. In            Optimization techniques are used for optimizing
this paper, we propose a method that solves those constraints           problems in which one needs to minimize or maximize a
by using multi objective fitness functions, where it leaves the
                                                                        real function by methodically choosing the values of real
choice for user to minimize or maximize many objectives
                                                                        or integer variables from within a particular set. It is
depending on his or her problem. We investigate the problem
using Multi-Objective Genetic Algorithm to find optimum                 finding the "best available" values of some objective
state of modification. Finally we establish some experiments            function given a defined area, including a variety of
and test our approach by datasets. The experimental results             different types of objective functions and different types of
showed that the number of sensitive rules in sanitized data set         domains. Many types of optimization techniques and
(hiding failure) equal to zero. The number of non- sensitive            optimization algorithms are used in various types of
patterns discovered from the original database D and the                approaches. In this paper we use the genetic algorithm for
sanitized database is different. Since we hide most of the              minimizing the cost function.
patterns considered sensitive from the original data set, thus
the miss cost (MC) is equal to 36%. The percentage of the               Genetic Algorithm
discovered patterns that are artifacts (AP) is 27%. The                 The genetic algorithm (GA) is an optimization and search
percentage of the dissimilarity (DISS) between the original
                                                                        technique based on the ethics of genetics and usual
and the sanitized datasets is 26%. The amount of non-
sensitive association rules that are removed as an effect of the
                                                                        selection. GA allows a population composed of many
sanitization process is four.                                           individuals to develop under particular selection rules to a
                                                                        state that maximizes the “fitness” (i.e., minimizes the cost
Keywords:      MOPP, MOGA, DBMS, MST, MCT and                           function).
GA.                                                                     In GA, a population consists of a cluster of individuals
                                                                        called chromosomes that signify a complete solution to a
                                                                        certain problem. Each chromosome is a sequence of 0s or
1-INTRODUCTION                                                          1s. The initial set of the population is an erratically
Nowadays, due to successful applications of data mining                 generated set of individuals. A new population is
techniques, they have been demonstrated in many areas                   generated by two methods: steady state Genetic algorithm
that benefit commercial, social and human activities.                   and generational Genetic Algorithm. The steady-state
Along with the success of these techniques, they pose a                 Genetic Algorithm replaces one or two members of the
threat to privacy. One can easily disclose other’s sensitive            population; whereas the generational Genetic Algorithm
information or knowledge by using these techniques. So,                 replaces all of them at each generation of progression. In
before releasing database, sensitive information or                     this work a steady-state Genetic Algorithm is adopted as
knowledge must be hidden from unauthorized access. To
Volume 2, Issue 3 May – June 2013                                                                                        Page 407
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May – June 2013                                             ISSN 2278-6856

population replacement method. This method tries to            process, frequent items are updated through crossover
keep a certain number of the best individuals from each        operation. Crossover is the main process of genetic
generation and copies them to the new generation.              algorithm so in this step most of the frequent items
Each transaction is represented as a chromosome and            become infrequent. Remaining items are modified in the
occurrence of an ith item in transaction showed by 1 and       mutation process. After ensuring the conditions i.e. all the
non occurrence of the item by 0 in ith bit of transaction.     sensitive items are modified then the process is completed
The fitness of a chromosome is dogged by several               and the execution is terminated. Finally, a priori
methods and different strategies. Each population consists     algorithm has been applied to the modified database for
of several chromosomes and the best chromosome is used         finding the frequent item sets for generating the sensitive
to generate the next population. For the initial population,   rules. Now, we have to ensure that all the sensitive rules
a large number of random transactions are preferred.           are hidden; no false rules are generated from the dataset
Based on the survival fitness, the population will make        and the non sensitive items are not affected.
over into the future generation.                               The rest of paper is organized as follows: Section 2 gives
                                                               a summary of the high-tech methodologies and related
Fitness function                                               works for privacy preserving in data mining and
Fitness function is defined over the genetic representation    association rule hiding with dataset sanitization. Section
and measures the superiority of the represented solution.      3 describes problem formulation and enlightens the major
The fitness function is forever problem dependent. Once        concepts upon which we base the proposal for the new
we have the genetic representation and the fitness             privacy preserving framework. Section 4 introduces our
function defined, GA proceeds to initialize a population       proposed solution for dataset sanitization against
of solutions randomly, and then improves it through            association rule mining. Section 5 presents the
repetitive application of mutation, crossover, and             experiments we performed in large scale datasets to
inversion and selection operators.                             introduce our approach and to prove the effectiveness of
                                                               our method. Finally the conclusion will be given in
Selection                                                      section 6.
In selection process, the individuals producing offspring
are elected. The selection step is preceded by the fitness
assignment which is based on the objective value. This         2- RELATED WORK
fitness is used for the real selection process.
                                                               Researchers have proposed several approaches for
                                                               knowledge hiding, in context of association rule hiding.
                                                               Chirag et al. in [2] introduced two heuristic blocking
Main function of crossover operation in genetic
                                                               based algorithms named ISARC (Increase Support of
algorithms is to blend two chromosomes mutually to
                                                               Common Antecedent of Rule Clusters) and DSCRC
generating novel offspring (child) [1]. Crossover occurs
                                                               (Decrease Support of Common Consequent of Rule
only with some probability (crossover probability).
                                                               Clusters) to preserve privacy for sensitive association
Chromosomes are not subjected to crossover remain
                                                               rules. Proposed algorithms cluster the sensitive rules
unmodified. The perception following crossover is the
                                                               based on some criteria and hide them in fewer selected
exploration of new solutions and abuse of old solutions.
                                                               transactions by using unknowns (“?”). They preserve
Better fitness chromosomes have a prospect to be selected
                                                               certain privacy for sensitive rules in database, while
more than the inferior ones, so good solution always alive
                                                               maintaining knowledge discovery.
to the next generation. There are different crossover
                                                               A new multi-objective method was introduced for hiding
operators that have been developed for various purposes.
                                                               sensitive association rules based on the concept of genetic
Single point crossover and multi-point are the most
                                                               algorithms in [3]. The main purpose of this method is
famous operators. In this paper single-point crossover has
                                                               fully supporting security of database and keeping the
been applied to make a new offspring.
                                                               utility and certainty of mined rules at highest level. In
                                                               their work, they have used four sanitization strategies
                                                               such as confidence, support, hybrid and max-min. They
Mutation is a genetic operator that alters one or more
                                                               introduced the idea of both rule and item set sanitization,
gene values in a chromosome from its initial state. This
                                                               which complements the old idea behind data sanitization.
can result in entirely new gene values being added to the
                                                               In [4], two algorithms were proposed ISL (Increase
gene pool. With these new gene values, the genetic
                                                               Support of LHS) and DSR (Decrease Support of RHS).
algorithm may be able to arrive at better solution than
                                                               Predicting items are given as input for both algorithms to
was previously possible.
                                                               automatically hide sensitive association rules without pre-
First the sensitive items and number of modifications
                                                               mining and selection of hidden rules.
required for each sensitive item are initialized. Next
                                                               In [5], two algorithms, DCIS (Decrease Confidence by
fitness function is evaluated for each transaction. Based
                                                               Increase Support) and DCDS (Decrease Confidence by
on this fitness values, each transaction selection process
                                                               Decrease Support) were proposed to automatically hide
are carried out in the third step. After the selection

Volume 2, Issue 3 May – June 2013                                                                               Page 408
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May – June 2013                                             ISSN 2278-6856

collaborative recommendation association rules without         Where fD(i), fD’ (i) represent the frequency of the ith item in
pre-mining and selection of hidden rules. The ISL and          the dataset D and D’ respectively, and n is the number of
DCIS algorithms have tried to increase the support of left     distinct items in the original dataset D.
hand side of the rule. Furthermore, DSR and DCDS               Miss Cost (MC) quantifies the percentage of the
algorithms have tried to decrease the support of the right     nonrestrictive patterns that are hidden as side-effects of
hand side of the rule. It is observed that ISL requires        the sanitization process. It is computed as follows:
more running time than DSR. Also both algorithms                              R P ( D)  R P ( D)
exhibit contrasting side effects. DSR algorithm shows no              MC                                           (2)
hiding failure (0%), few new rules (5%) and some lost                                 R P ( D)
rules (11%). ISL algorithm shows some hiding failure           Where   R P   (D) is the set of all non sensitive rules in the
(12.9%), many new rules (33%) and no lost rule (0%).
Algorithm DCIS requires more running time than DCDS.
                                                               original database D and     R P ( D ) is the set of all non
DCIS and DCDS also exhibit contrasting side effects            sensitive rules in the sanitized data base D . As one can
similar to ISL and DSR algorithms. DCDS algorithm              notice that there exists a compromise between the miss
shows no hiding failure (0%), few new rules (1%) and           cost and the hiding failure, since the more sensitive
some lost rules (4%). DCIS algorithm shows no hiding           association rules need to hide, the more legitimate
failure (0%), many new rules (75%) and no lost rule            association rules are expected to miss.
(0%).                                                          Similar to the measure of miss cost, Side-Effect Factor
In [6], an algorithm DSC (Decrease Support and                 (SEF) is used to quantify the amount of non-sensitive
Confidence) was proposed in which pattern-inversion tree       association rules that are removed as an effect of the
was used to store related information so that only one         sanitization process. It is defined as follows:
scan of database is required. The proposed algorithm can                   P  ( P  Rp( D) )
                                                                   SEF                                            (3)
automatically sanitize informative rule sets without pre-                     P  Rp( D)
mining and selection of a class of rules under one
                                                               Artificial patterns (AF) quantify the percentage of the
database scan. There are about 4% of new rules generated
                                                               discovered patterns that are artifacts. It is computed as
and about 9% of rules are lost on the DSC algorithm and
it also shows hiding failure for two predicting items.
Border based approach was presented in [7-9]. It hides                    P  P  P
                                                                    AP                                           (4)
sensitive association rule by modifying the borders in the                     P
lattice of the frequent and the infrequent itemsets of the     Where P is the set of association rules discovered in the
original database. The itemsets which are at the position      original dataset D and P  is the set of association rules
of the borderline separating the frequent and infrequent       discovered in D .
itemsets forms the borders.                                    Hiding Failure (HF) quantifies the percentage of the
In [10, 11], Exact approach was provided. This approach        sensitive patterns that remain exposed in the sanitized
contains non heuristic algorithms which formulates the         dataset. It is defined as the fraction of the restrictive
hiding process as a constraints satisfaction problem or an     association rules that appear in the sanitized database
optimization problem which is solved by integer                divided by the ones that appeared in the original dataset,
programming. These algorithms can provide optimal              formally:
hiding solution with ideally no side effects.
The related works previously described, use different               HF 
                                                                          R P ( D)                               (5)
performance metrics; most of them use the (hiding                          R P (D)
failure, dissimilarity, and miss cost, artificial false, and
                                                               where Rp( D ) corresponds to the sensitive rules
side effect).
                                                               discovered in the sanitized dataset D , RP (D) to the
                                                               sensitive rules appearing in the original dataset D.
Performance evaluation measures for the association
                                                               Ideally, the hiding failure should be 0%. The performance
                                                               metrics for privacy preserving association rule mining
The efficiency of the association rule mechanisms can be
                                                               algorithms are given in [12].
characterized by the following measures:
Dissimilarity quantifies the difference between the
original and the sanitized datasets by comparing their
histograms, where the horizontal axis contains the items       3. PROBLEM FORMULATION
in the dataset and the vertical axis corresponds to their      A sample transaction database D taken from [13] is
frequencies. It is calculated as follows:                      shown in Table 1. TID shows unique transaction number.
                          1        n                           Binary valued item shows whether an item is present or
        Diss(D,D)  n            [ f D(i)  f D(i) ] (1)   absent in that transaction. Suppose MST and MCT are
                       i1 fD(i) i1                          selected to be 50%, 70% respectively. Table 2 shows
                                                               sensitive rules satisfying MST, generated from sample
                                                               database D.

Volume 2, Issue 3 May – June 2013                                                                                 Page 409
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May – June 2013                                             ISSN 2278-6856

So, the possible number of association rules satisfying       transactions and calculate the support and confidence of
MST and MCT, generated by Apriori algorithm [14] are          the candidate rules to determine if they are considerable
given:        ,         ,         ,        . Suppose the      or not. A rule is considerable, if its support and
rules          and          are specified as sensitive and    confidence is higher than the user specified minimum
should be hidden in sanitized database.                       support and minimum confidence threshold. In this way,
The problem of privacy preserving in association rule         algorithms do not retrieve all possible association rules
mining (so called association rule hiding) that is focused    that can be derivable from a dataset, but only a small
by this paper can be formulated as follows:                   subset that satisfies the minimum support and minimum
Given a transaction database (D), minimum support             confidence requirements set by the users.
threshold (MST), minimum confidence threshold (MCT),
                                                              Apriori association rule-mining algorithm works as
a set of significant association rules R mined from (D)
                                                              follows. It finds all the sets of rules that appear frequently
and a set of sensitive rules           to be hide.            enough to be considered relevant and then it derives from
Generate a new database D .                                  them the association rules that are strong enough to be
Such that the rules in                     can be             considered interesting. The major goal here is to
mined from D under the same “MST” and “MCT”.                 preventing some of these rules that we refer to as
Where no normal rules in                are falsely hidden    "sensitive rules", from being revealed. We want to hide
(lost rules), and no extra spurious rules (ghost rules) are   association rules using the best way by multi objective
mistakenly will mined after the rule hiding process.          genetic algorithm. Also we are interested in investigating
                                                              the performance of association rules (hiding failure (HF),
                   Table 1: Sample database D                 dissimilarity (DIS), artificial pattern (AF), side effect
  TID              Item           Item (Binary From)          (SEF), and miss cost (MC)).
   0               013                    1101                Figure (1) presents the basic architecture of a database
   1                 1                    0100                system with the association rule mechanism.
   2               023                    1011
   3                01                    1100
   4               013                    1101

                     Table 2: Sensitive rules

                                                               Figure 1 Architecture of a database application with the
4. PROPOSED SOLUTION                                                         association rule procedure

4.1 Security and Association rule Mining Trade                  4.2 Security and Association Rule Mining Trade
                                                                    using Optimization
The association rule hiding problem can be considered as
a deviation of the well identified database inference         In this paper we are studying the privacy breaches which
control problem in statistical and multilevel databases.      incurred from certain type association rules. In doing so
The primary goal in database inference control is to guard    we suppose that a certain subset of association rule, which
access to sensitive information that can be obtained          is extracted from specific datasets, is considered as
through non sensitive data and inference rules. In            sensitive/critical rules. Our major goal then is
association rule hiding, we think about that it is not the    modification of original data source in such a way that it
data itself but somewhat the sensitive association rules      would be impossible for the adversary to mine the
that produce a breach to privacy.                             sensitive rules from the modified data set as long as all
For the simplicity of presentation and without loss of        the remaining non sensitive information and/or
                                                              knowledge remains as close as possible to this of the
generality, we make the following assumptions in this
                                                              original set, as our minor goal.
implementation:                                               The method developed in this paper uses binary
We want to extract all association rules which satisfy        transactional dataset as an input and modifies the original
minimum support transaction (MST), minimum                    dataset based on the concept of genetic algorithms for
confidence transaction (MCT). Support is a measure of         privacy preserving of association rule to find the best
the frequency of a rule. The confidence is a measure of       solution for sanitizing original dataset based on multi-
the strength of the relation between sets of rules.           objective optimization. In such a way that all of sensitive
Association rule mining algorithms scan the database of       rules become hide and minimum modification performed

Volume 2, Issue 3 May – June 2013                                                                               Page 410
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May – June 2013                                             ISSN 2278-6856

in original dataset. The most famous possible style for        To emphasis such activities in mathematical concepts, the
transaction modification is distortion of original database    mathematical formulation of multi-objective optimization
(i.e., by replacing 1’s by 0’s and vice versa). We select      problem could be defined as:
this style of modification in our method. Modification of      Find the vector      y  [ y1, y2 ,..., yn ]T which satisfies the
the dataset causes so many side-effect problems.
The modification process can affect the original set of        m inequality constraints and the p equality constraints:
rules, that can be mined from the original database, either    gi ( x )  0 i = 1,2,…,m                           (6)
by hiding rules which are not sensitive (lost rules), or by
introducing rules in the mining of the modified database,      hi ( x )  0   i = 1,2,…,p                                           (7)
which were not supported by the original database (ghost
rules). We have tried to minimize these unpleasant results     And optimizes (here we assume minimization) the vector
by minimum and suitable modification of original dataset.      function:
The steps our work are explained in Figure 2.                   f ( x )  [ f1( x ), f 2 ( x ),..., f k ( x )]T (8)
                                                               Where x  [ x1, x2 ,..., xn ]T is the vector of decision
                                                               variables, and the constraints given by equations (6) and
                                                               (7) define the feasible region F.?

                                                               Traditional Technique
                                                               Convert the multi-objective optimization problem into
                                                               one objective problem i.e. to find one optimal solution by
                                                               combining the objectives through weighting.
                                                               F  w1  f ( x1)  w2  f ( x 2 )  ..., wn  f ( xn )
                                                               Where w 1  w2  ...  wn  1
  Figure 2 Multi objectives privacy preserving (MOPP)
                                                               Proposed technique
The following steps illustrate the methodology of the          Keep the problem AS multi-objective optimization
proposed solution:                                             problem i.e. to find the pareto optimal solution
Step 1: Consider a transactional database with a set of        We say that a vector of decision variables y  F is
items and transactions.
Step 2: Write two external files one for original data set     optimal if there is no other x  F such that
and one for sensitive rules.                                    fi ( x )  fi ( y ) for all i = 1, . . . , k and f j ( x )  f j ( y )
Step 3: Convert every chromosome to double value and           for at least one j.
store in population then convert that value to binary
                                                               A vector u  (u1 , u2 ,...uk ) is                 said      to     dominate
Step 4: Create file for Apriori algorithm.                     v  (v1, v2 ,..vk ) (denoted by) u v if and only if u is
Step 5: Apriori algorithm is used to find the frequent item    partially less than v , i.e.
sets based on the minimum support threshold.
Step 6: From the frequent item sets, the set of association    i  {1,2,..., k}, ui  vi  i  {1,2...., k} : ui  vi .
rules can be generated based on the minimum support            Our fitness vector consists from two elements:
and confidence thresholds.                                                               R
Step 7: Select the sensitive rules from the set of               f1 =Hiding Failure = Sen( D)                                        (9)
                                                                                         RSen( D )
association rules.
Step 8: Read association rules from output file and put in                                       1
                                                                   f 2 =Dissimilarity=                   [ f D(i)  f D(i) ]     (10)
structure for comparison with sensitive rules.                                               n
Step 9: Compare association rules with sensitive to
                                                                                          i1 f D(i)     i1

calculate Fitness Vector (1).                                  Where f D (i ) , f D (i ) represents the frequency of the ith
Step 10: Compare chromosome with original dataset to           item in the dataset D, and D respectively, and n is the
calculate Fitness Vector (2).                                  number of distinct items in the original dataset D.
Step 11: Genetic algorithm is used for modifying the           Farther, we can choose menu of optimal solutions
items based on the fitness function.                           according to our problem.
Step 12: Repeat the steps 5, 6 and 7 for the modified data     The main contributions are focused on three points: first,
set.                                                           a new proposed algorithm for hiding sensitive association
Step 13: Verify (i) all the sensitive rules are hidden, (ii)   rules using multi objective genetic algorithm and
no non-sensitive rules are hidden (iii) no false rules.        Modification old Math Model, the second contribution is
                                                               achieving balance between security and performance in

Volume 2, Issue 3 May – June 2013                                                                                                 Page 411
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May – June 2013                                             ISSN 2278-6856

database, the last point of the contribution is evaluation of                                        (hiding failure, dissimilarity, artificial pattern, side effect,
hiding performance in our work.                                                                      miss cost).

5. DISCUSSION AND EXPERIMENTAL                                                                         Table 4: Best rules inference extracted from original
RESULTS                                                                                                      dataset with MCT=0.58 and MST=0.25
                                                                                                      TID                                  Rules
                                                                                                       1      adoption-of-the-budget-resolution=y physician-fee-freeze=n 219
  5.1 Experimental Setup                                                                                       Class Name=democrat
The data set Congress Voting Data set [15] includes votes                                              2      adoption-of-the-budget-resolution=y physician-fee-freeze=n aid-
                                                                                                              to-nicaraguan-contras=y 198  Class Name=democrat
for each of the U.S. House of Representatives                                                          3      physician-fee-freeze=n aid-to-nicaraguan-contras=y 211 
Congressmen on the 16 key votes identified by the CQA.                                                        Class Name=democrat 210
The CQA lists nine different types of votes: voted for,                                                4      physician-fee-freeze=n education-spending=n 202  Class
paired for, and announced for (these three simplified to                                                      Name=democrat 201
                                                                                                       5      physician-fee-freeze=n 247  Class Name=democrat 245
yea), voted against, paired against, and announced                                                     6      Class Name=democrat el-salvador-aid=n 200  aid-to-
against (these three simplified to nay), voted present,                                                       nicaraguan-contras=y 197
voted present to avoid conflict of interest, and did not vote                                          7      el-salvador-aid=n 208  aid-to-nicaraguan-contras=y 204
or otherwise make a position known (these three                                                        8      el-salvador-aid=y 212  religious-groups-in-schools=y 197
simplified to an unknown disposition). Number of
Instances: 435 (267 democrats, 168 republicans) Number                                                      Table 5: shows the association rule evaluation
of Attributes: 16 + class name = 17 (all Boolean valued).                                                               performance results
A sample transaction database D taken from [15] is                                                          Parameters                    Results
shown in Table (3). TID shows unique transaction                                                                HF                           0%
number, Suppose MST and MCT are selected 25% and                                                               MC                           36%
58% respectively.                                                                                               AP                          27%
                                                                                                               DISS                         26%
                         Table 3:             Sample data set                                                  SEF                            4



            Class Name







                                                                                                     As shown in Table (5), and figure(3.a) the number of

  TID                                                                                                sensitive rules in sanitized data set equal to zero, most of
                                                                                                     the developed privacy preserving algorithms are designed
  1      republican             N              Y               N             y         Y      Y      with the goal of obtaining zero hiding failure. Thus, we
  2      republican             N              Y               N             y         Y      Y      hide all the patterns considered sensitive from the
  3       democrat              ?              Y               Y             ?         Y      Y      original data set. The number of non- sensitive patterns
  4       democrat              N              Y               Y             n         ?      Y
                                                                                                     discovered from the original database D, and the sanitized
  5       democrat              Y              Y               Y             n         Y      Y
  6       democrat              N              Y               Y             n         Y      Y      database is the different, since we hide most the patterns
  7       democrat              N              Y               N             y         Y      Y      considered sensitive from the original data set, thus the
  8      republican             N              Y               N             y         Y      Y      MC is equal to 36% as obviously in figure (3.b). The
  9      republican             N              Y               N             y         Y      Y
                                Y              Y               Y             n         N      N
                                                                                                     percentage of the discovered patterns that are artifacts
  10      democrat
                                                                                                     (AP) is 27% as obviously in figure (3.c). The percentage
                                                                                                     of the dissimilarity (DISS) between the original and the
  5.2 Association Rules Mining Methodology using                                                     sanitized datasets is 26% as obviously in figure (3.d). The
    optimization                                                                                     amount of non-sensitive association rules that are
Table (4) shows frequent rules satisfying MST, generated                                             removed as an effect of the sanitization process is four as
from sample database D, in following; the possible                                                   obviously in figure (3.e).
number of association rules satisfying MST and MCT,
generated by Apriori algorithm are given: (20). Suppose
the rule: (el-Salvador-aid=y 212  religious-groups-in-
schools=y 197) are specified as sensitive and should be
hidden in sanitized database, the transactions which
contain the sensitive items are called population. The
chromosomes of this population the fitness function has
applied. After applying the crossover and mutation
operations, based on fitness function the sensitive items of
the original database are modified and for keeping the
privacy of the database. After modification, Apriori
algorithm has been applied to verify all the sensitive rules
are hidden with the same support and confidence. Then
we evaluated the performance and security metrics

Volume 2, Issue 3 May – June 2013                                                                                                                                Page 412
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May – June 2013                                             ISSN 2278-6856

6. Conclusion                                                [11] A. Divanis, V. Verykios, “Exact Knowledge Hiding
The drawbacks of the traditional techniques in [2, 3, 4, 5        through Database Extension”, IEEE Transactions on
and 6] are weights values are unknown so it’s assumed in          Knowledge and Data Engineering, vol. 21(5), pp.
advance. Also, it is no warranty to achieve hiding failure        699–713, May 2009.
with high performance. But these methods add more            [12] C. Aggarwal, P.Yu, “Privacy-Preserving Data
adjustable parameters which require profound domain               Mining: Models and Algorithms”, Springer,
knowledge which is usually not available, In addition, the        Heidelberg, pp. 267–286, 2008.
solutions generated in [2,3,4,5 and 6] are usually very      [13] K. Duraiswamy, D. Manjula, “Advanced Approach
sensitive to small changes in these weights or penalties          in Sensitive Rule Hiding”, Modern Applied Science,
functions.                                                        Vol.3, no. 2, 2009.
The proposed approach penetrate the problem of               [14] C. Clifton, M. Kantarcioglu, J. Vaidya, “Defining
Balancing Security and Performance Metrics in generic             Privacy for Data Mining”, In Proceedings US Nat'l
way since we do optimize between hiding failure as                Science Foundation Workshop on Next Generation
security over head and ((AF), (Diss), (SEF), (MC)) as             Data Mining, pp. 126-133, 2002.
database performance metrics.                                [15] J. Schlimmer, “Concept acquisition through
The approach generates multiple solutions rather than             representational adjustment”, Doctoral dissertation,
only biased solution.                                             Department of Information and Computer Science,
This approach could be used in a tailored fashion base            University of California, Irvine, CA. 1987
especially in military applications or in civilian
application with dynamic policies concentrating on either
the security or the performance or both.
[1] D. Whitley, “A genetic algorithm tutorial”, Colorado
     State University, 1994.
[2] M. Chirag, et al, “An Efficient Solution for Privacy
     Preserving Association Rule Mining”, (IJCNS)
     International Journal of Computer and Network
     Security, Vol. 2, No. 5, May 2010.
[3] M. Dehkordi.. “A Novel Method for Privacy
     Preserving in Association Rule Mining Based on
     Genetic Algorithms”, Journal of software-JSW,
     volume 4, no 6, 2009.
[4] S. Wang, B. Parikh, A. Jafari, “Hiding informative
     association rule sets”, ELSEVIER, Expert Systems
     with Applications, pp. 316–323, 2006.
[5] S. Wang, D. Patel, et al, “Hiding collaborative
     recommendation association rules”, Published
     Springer, Science Business Media, LLC 2007.
[6] S. Wang, R. Maskey, et al , “Efficient sanitization of
     informative association rules”, ACM , Expert
     Systems with Applications: An International Journal,
     Volume 35, Issue 1-2, July, 2008 .
[7] G. Moustakides, V. Verykios, “A maxmin approach
     for hiding frequent itemsets”, Data and Knowledge
     Engineering, pp.75–89, 2008.
[8] G. Moustakides, V. S. Verykios, “A max–min
     approach for hiding frequent itemsets”, In
     Workshops Proceedings of the 6th IEEE
     International Conference on Data Mining (ICDM),
     pp. 502–506, 2006.
[9] X. Sun, P. Yu, “Hiding sensitive frequent itemsets by
     a border–based approach”, Computing science and
     engineering, pp.74–94, 2007.
[10] A. Divanis, V. Verykios, “An Integer Programming
     Approach for Frequent Itemset Hiding”, In Proc
     ACM       Conf     Information     and   Knowledge
     Management (CIKM ’06), Nov. 2006.

Volume 2, Issue 3 May – June 2013                                                                          Page 413

To top