Privacy Preserving Classification of heterogeneous Partition Data through ID3 Technique

Document Sample
Privacy Preserving Classification of heterogeneous Partition Data through ID3 Technique Powered By Docstoc
					   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: Email:,
Volume 1, Issue 4, November – December 2012                                    ISSN 2278-6856

                Privacy Preserving Classification of
            heterogeneous Partition Data through ID3
                                                     Saurabh Karsoliya
                                             B.Tech. (CSE) MANIT, Bhopal, M.P., INDIA

Abstract: The goal of data mining is to extract or mine          Step 1: Evaluation of splits for each attribute and
knowledge from large amounts of data. For information            selection of the best split
Extraction this knowledge several data mining classification     Step 2: Creation of partitions using the best split. Having
techniques are used. ID3 algorithm is widely used technique      determined the overall best split, partitions can be created
in this classification arena. ID3 Algorithm classifies data by   by a simple application of the splitting criterion to the
creating decision tree over heterogeneously partitioned data.    data.
In this paper we propose vertically partitioned micro array
                                                                 Entropy and Gini Index are two protocols which compute
data along with preserving privacy by different methods of
privacy preserving i.e. secure multi party computation
                                                                 Information-Gain at each step for producing a decision
However, micro data is often collected by several different      tree. The Gini Index, however, has been less studied in
sites. Privacy, legal and commercial concerns restrict           privacy-preserving data mining for classifying the Micro
centralized access to this data. Together, these enable the      array data.
secure mining of knowledge. We focus on the problem of
decision tree learning with the popular ID3 algorithm. We        The formula used for calculation of Entropy and Gini are
consider that database is vertically Partitioned into two        as follows
pieces. Database which is considered is Micro array data that
is heterogeneously classified.
Keywords: Privacy Preserving, ID3, Decision tree,
Classification, Micro array Data.

1. INTRODUCTION                                                  Where Pj is the relative frequency of class j in S. Based
                                                                 on the entropy or the gini index, we can compute the
In data mining knowledge are extracted through different
                                                                 information gain if attribute A is used to partition the
technique such as classification, clustering, association
                                                                 data set S
etc. The ID3 algorithm is a standard, popular, and simple
method for data classification and decision tree creation.
it is developed by J. R. Quinlan, also known as Ross
Quinlan [3].
Since privacy-preserving data mining should be taken
into consideration, several secure multi-party computation
protocols have been presented based on this technique [2].
In this paper every extraction of knowledge is comes out         Where v represents any possible values of attribute A; Sv
in terms of decision tree, the input for the decision tree       is the subset of S for which attribute A has value v; |Sv| is
creation is the micro array data. Decision tree is a rooted      the number of elements in Sv; |S| is the number of
tree containing nodes and edges. In which each internal          elements in S.
node is a test Node and corresponds to an attribute; the         In Gini index splits are done in such that the largest class
edges leaving a node correspond to the possible values           goes into one pure node while the other classes go into
taken on by that attribute. For example, the attribute           the other node. Entropy normally tries to create balanced
“Home-Owner” would have two edges leaving it, one for            tree. In this paper, we proposed that how Gini can be used
“Yes” and one for “No”. Finally, the leaves of the tree          in privacy-preserving classification of DNA Microarry
contain the expected class value for transactions matching       data in ID3 algorithms to create decision tree. ID3
the path from the root to that leaf [3]. The basic building      worked iteratively, it uses top-down traversing approach
block of the ID3 algorithm is used through entropy and           where initially all training cases belong to a single root
Gini index protocol for creation of the tree [3, 4].             node which is then successively split to form a tree.
There are two main operations during tree building to
obtain the information Gain:                                     Building of decision tree with ID3 algorithm

Volume 1, Issue 4 November - December 2012                                                                        Page 135
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: Email:,
Volume 1, Issue 4, November – December 2012                                    ISSN 2278-6856

 Step 1: Select the attribute with the most Information        Classification involves finding rules that partition the
gain.                                                          data into disjoint groups. The input for the classification
Step 2: Create the subset for each value of the Attribute.     is the training data set, whose class labels are already
Step3: For each subset If not all the elements of the subset   known. It analyzes the training data set and constructs a
belongs to some class repeat the step 1-3 for the subset.      model based on the class label. It is a kind of supervised
                                                               learning because class field is known Real life Example of
Empirical evidence suggests that a correct decision tree is    classification: the diagnosis of a medical condition from
usually found more quickly by this iterative method than       symptoms, in which the classes could be either the
by forming a tree directly from the entire training set. As    various disease states or the possible therapies;
its well known that ID3 was designed for the condition         determining the game-theoretic value of a chess position,
where there are many attributes and the training set           with the classes won for white, lost for white, and drawn;
contains many objects, but where a reasonably good             and deciding from atmospheric observations whether a
decision tree is required without much computation, as in      severe thunderstorm is unlikely, possible or probable.
DNA micro array a typical glass slide is used in which
DNA molecules are fixed in an orderly manner at specific        Clustering – finding new biological classes or refining
locations called spots (or features). A micro array may        existing ones.
contain thousands of spots and each spot may contain a
few million copies of identical DNA molecules that              Gene Selection: this method is also used in DNA micro
uniquely correspond to a gene [5].                             arrays data.

The DNA in a spot may either be genomic DNA or short           Because the microarray dataset has many more features
stretch of oligo-nucleotide strands that correspond to a       than records, the common statistical and machine
gene. The spots are printed on to the glass slide by a robot   learning procedures such as classification can lead to true
or are synthesized by the process of photolithography          discoveries due to random chance.
[5,6]. Micro arrays may be used to measure gene
expression in many ways, but one of the most popular           The highlights of the common errors is identifying
applications is to compare expression of a set of genes        informative features and developing accurate classifiers,
from a cell maintained in a particular condition to the        and shows the correct approach [2]. [3] Author presents a
same set of genes from a reference cell. Family of             review of methods available in Microarray classification,
algorithms for Top down Induction of Decision Trees            which cover the full spectrum of micro array data
                                                               analysis, including data preprocessing, experimental
The DNA Microarray data classification is done in such a       design, quality control, gene selection and differential
way that involved parties that can jointly compute the         expression analysis, classification, and clustering. One
gain value of each normal attribute without revealing          would expect that different datasets representing the same
their own private information to each other, while the         biological system will display some amount of “invariant”
database is vertically partitioned over two or more parties.   biological     characteristics   independent     of     the
                                                               idiosyncrasies or details of the sample sources, the
Micro arrays have opened the possibility of creating data      preparation procedures and the technological platforms
sets of molecular information to represent many systems        used to obtain the data.
of biological or clinical interest. Gene expression profiles
can be used as inputs to large-scale data analysis, for        These invariant biological characteristics, when properly
example, to serve as fingerprints to build more accurate       captured and exposed, can provide the basis to build more
molecular classification, to discover hidden taxonomies or     robust, general and accurate classification models. To
to increase our understanding of normal and disease            classify heterogeneous factors is based on IFs (impact
states.                                                        factors) addresses this problem. The IFs provide a way to
                                                               measure the variations between individual classes in train
The main types of data analysis needed to for biomedical       and test samples and can be integrated into standard
applications include:                                          classifiers such as Weighted Voting or k-NN resulting in
     Gene Selection – in data mining terms this is a          a significantly improvement in the accuracy for
         process of attribute selection, which finds the       classifying heterogeneous samples.
         genes most strongly related to a particular class.

     Classification – classifying diseases or predicting      2. RELATED WORK
       outcomes based on gene expression patterns, and
                                                               In data mining knowledge are extracted through different
       perhaps even identifying the best treatment for
       given genetic signature.                                technique such as classification, clustering, association
                                                               etc. In early work in the field of Privacy Preserving Data
                                                               Mining. problem propose a solution to the privacy

Volume 1, Issue 4 November - December 2012                                                                     Page 136
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: Email:,
Volume 1, Issue 4, November – December 2012                                    ISSN 2278-6856

preserving classification problem using the oblivious            information, but about different entities. An example of
transfer protocol, a powerful tool developed by the secure       that would be grocery shopping data collected by different
multi-party computation studies [4]. The solution,               supermarkets (also known as market-basket data in the
however, only deals with the horizontally partitioned data       data mining literature) [11]. Figure below illustrates
and targets only for the ID3 algorithm (because it only          horizontal partitioning and shows the credit card
emulates the computation of the ID3 algorithm).                  databases of two diffrent (local) credit Unions. Taken
                                                                 together, one may that fraudulent customers often have
Another approach for solving the privacy preserving              similar Transaction histories, etc.          Horizontally
classification problem was proposed and also studied in          partitioned data is data which is homogeneously
[4, 6]. In this approach, each individual data item is           distributed, meaning that all data tuples yield over the
perturbed and the distribution of the all data is                same item or feature set. Essentially this boils down to
reconstructed at an aggregate level. The technique works         different data sites collecting the same kind of
for those data mining algorithms that use the probability        information over different individuals. In Horizontal
distributions rather than individual records. An example         partitioned data: the database scheme is looking like the
of classification algorithm which uses such aggregate            Figure 3.1 shown below,
information is also discussed [7].

There has been research considering preserving privacy
for other type of data mining. For instance, proposed a
solution to the privacy preserving distributed Association
mining problem is discussed in [6].

Secure Multi-party Computation. The problem we are
studying is actually a special case of a more general
problem, the Secure Multi-party Computation (SMC)
problem. Briefly, a SMC problem deals with computing
any function on any input, in a distributed network where                                Figure 3.1
each participant holds one of the inputs, while ensuring
that no more information is revealed to a participant in         Example: Consider for instance a supermarket chain
the computation than can be inferred from that                   which gathers information on the buying behavior of its
participant’s input and output [8]. The SMC problem              customers. Typically, such a company has different
literature is extensive, having been introduced by [7] and       branches, implying data to be horizontally distributed.
expanded [6, 9]. It has been proved that for any function,
there is a secure multiparty computation solution [4].           Horizontal partitioning involves putting different rows
   The approach used is as follows the function F to be          into different tables. Perhaps customers with ZIP codes
computed is first represented as a combinatorial circuit,        less than 50000 are stored in Customers-East, while
and then the parties run a short protocol for every gate in      customers with ZIP codes greater than or equal to 50000
the circuit. Every participant gets corresponding shares of      are stored in Customers-West. The two partition tables
the input wires and the output wires for every gate. This        are then Customers-East and Customers-West, while a
approach, though appealing in its generality and                 view with a union might be created over both of them to
simplicity, means that the size of the protocol depends on       provide a complete view of all customers.
the size of the circuit, which depends on the size of the
input. This is highly inefficient for large inputs, as in data   In this paper we proposed heterogeneously distributed
mining [8]. It has been well accepted that for special           data that is also known a s vertically partitioned data , in
cases of computations, special solutions should be               the data base system database can be partitioned into
developed for efficiency reasons. Therefore in each and          different types of partitioned such as           horizontal
every case either horizontal or vertical partition are           partitioning , vertical and grid partitioning, that is the
considered but we proposed to consider vertical partition        combination of both the partitioning        horizontal and
of DNA Micro array data over ID3 classification by               vertical also.
preserving privacy also.                                          In Vertically partitioned data: the database scheme is
                                                                 looking like the Figure 3.2 shown below,
In horizontal partitioning (a.k.a. homogeneous
distribution), different sites collect the same set of

Volume 1, Issue 4 November - December 2012                                                                        Page 137
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: Email:,
Volume 1, Issue 4, November – December 2012                                    ISSN 2278-6856

                                                                and Figure 5.1, the graph shows the comparison result of
                                                                DNA dataset are shown below.
                                                                                          Table 5.1

                                                                     No of    Horizotal      Vertical        Proposed
                                                                     DNA                                  Heterogeneous
                                                                     paira                                  partitioned
                                                                                                          based method
                             Figure 3.2                          6           0.11           0.034         0.107
Vertical partitioning involves creating tables with fewer
columns and using additional tables to store the                 8           0.22           0.103         0.334
remaining columns.           Concept of database such as         10          0.33           0.206         0.452
Normalization also involves this splitting of columns
across tables, but vertical partitioning goes beyond that        12          0.52           0.343         0.674
and partitions columns even when already normalized.
Different physical storage might be used to realize
vertical partitioning as well; storing infrequently used or
very wide columns on a different device, for example, is a
method of vertical partitioning. Done explicitly or
implicitly, this type of partitioning is called "row
splitting" (the row is split by its columns). A common
form of vertical partitioning is to split (slow to find)
dynamic data from (fast to find) static data in a table
where the dynamic data is not used as often as the static.
Creating a view across the two newly created tables
restores the original table with a performance penalty,
however performance will increase when accessing the
static data e.g. for statistical analysis.
                                                                                        Figure 5.1
Vertically distributed data is data which is
heterogeneously distributed. Basically this means that          6. CONCLUSION
data is collected by different sites or parties on the same
                                                                Microarrays are a revolutionary new technology with
individuals but with differing item or feature sets.
                                                                great potential to provide accurate medical diagnostics
Consider for instance financial institutions as banks and
                                                                help find the right treatment and cure for many diseases
credit card companies, they both collect data on
                                                                and provide a detailed genome-wide molecular portrait of
customers having a credit card but with differing item          cellular states. By considering the vertical partitioning of
sets. Vertical partitioning is also known as heterogeneous      the data good decision tree can be created by using the
distribution of data which implies that though different        ID3 classification algorithm so that accurate medical
sites gather information about the same set of entities,        decision and diagnostics can be done to provide better
they collect different feature sets.                            cure for the diseases by creating decision tree on the basis
                                                                of the gene
4. IMPLEMENTATION                                               Finding new insights into the molecular basis of
                                                                biological processes and searching for new drugs and
To check the performance of the proposed algorithm, four
                                                                treatments is a problem of high complexity and where the
different datasets are used to see how much
                                                                techniques of molecular biology has been applied for
communication overhead is caused by the proposed
                                                                many decades. The process is analogous to a large search
algorithm and algorithms by [1, 2].
                                                                of a few molecular entities, connections or relationships
                                                                in a large sea of possibilities.
5. EXPERIMENTAL SETUP                                           We hope that this special issue on Microarray Data
                                                                Mining will make more researchers interested in the field
For testing the proposed algorithm four different datasets      and its challenges and will be a contribution towards
were used; DNA dataset taken from UCI Machine                   realizing the potential of microarrays for biology and
Learning Repository [11]. The DNA dataset consist of            medicine.
150 entities, 3 classes and 4 attributes for each entity, the
experiment is compared and is shown in the Table 5.1

Volume 1, Issue 4 November - December 2012                                                                        Page 138
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: Email:,
Volume 1, Issue 4, November – December 2012                                    ISSN 2278-6856

[1] M.C. Doganay, T.B. Pederson, Y. Saygin, E. Savas
     and A. Levi. “Distributed privacy preserving k-
     means clustering with additive secret sharing,”
     Proceedings of the 2008 international workshop on
     Privacy and anonymity in information society. PAIS
     '08, pp. 6003-6011, Mar 2008.
 [2]    Jaideep Vaidya and Chris Clifton \Privacy-
     preserving k - means clustering over vertically
     partitioned data,"Proceedings of ninth ACM
     SIGKDD international Conference on Knowledge
     discovery and data mining. USA '03, pp. 206-215,
     Dec 2003.
[3] A. Rakesh and R. Srikant \Privacy- preserving data
     mining, "Proceedings Of the 2000 ACM SIGMOD
     International conference of Management of
     Data.USA, pp. 439-450, Mar 2000.
[4] Margaret H. Dunham, Data Mining - Introductory
     and Advanced Concepts, Person Education, 2006.
[5] H Kargupta, S Datta,Q wang and K Siva
     Kumar\Random-data perturbation techniques and
     privacy-preserving data mining "IEEE conference on
     Knowledge and Information system on data mining.
     London, pp. 387- 414, sep 2004.
[6] S.V. Kaya, T.B. Pedersen, E. Savas and Y Saygan
     \Efficient Privacy- preserving distributed clustering
     based on secret sharing, “In PAKDD 2007
     International Workshops: Emerging Technologies in
     Knowledge Discovery and data mining. Springer,
     pp. 280-291, Mar 2007.
[7]     Random-permutation:
     /Random Permutation.
[8] Pascal Pailliar. \Public key Cryptosystem based on
     composite degree residuosity class, "Advances in
     Cryptology     EUROCRYPT          99     International
     Conference on Theory and Application of
     Cryptographic Techniques. pp. 223-238, May 1999.
[9]       Jaideep Vaidya and Chris Clifton.\Privacy-
     preserving association rules in vertically partitioned
     data."In Proceedings of Eighth ACMSIGKD
     international Conference on Knowledge discovery
     and data mining. CANADA '02, pp. 639-644, july
[10]Secure-multiparty-computation:         multiparty
[11] Merz C J, Murphy P M, "UCI Repository of Machine
     Learning             Database,"             Available mlearn/.

Volume 1, Issue 4 November - December 2012                                          Page 139

Shared By:
Description: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: Email:, Volume 1, Issue 4, November – December 2012, ISSN 2278-6856, Impact Factor of IJETTCS for year 2012: 2.524