PRIVACY PRESERVING CLUSTERING ON CENTRALIZED DATA THROUGH SCALING TRANSF

Document Sample
PRIVACY PRESERVING CLUSTERING ON CENTRALIZED DATA THROUGH SCALING TRANSF Powered By Docstoc
					  International Journal of JOURNAL OF COMPUTER (IJCET), ISSN 0976-
 INTERNATIONALComputer Engineering and Technology ENGINEERING
                           6375(Online) Volume 4, Issue 3, May
  6367(Print), ISSN 0976 – & TECHNOLOGY (IJCET)– June (2013), © IAEME


ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)                                                 IJCET
Volume 4, Issue 3, May-June (2013), pp. 449-454
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
                                                                       ©IAEME
www.jifactor.com




      PRIVACY PRESERVING CLUSTERING ON CENTRALIZED DATA
               THROUGH SCALING TRANSFORMATION

              Khatri Nishant P.                           Ms. Preeti Gupta
               M. Tech. (CSE)                                  CSE Dept.
          Amity School of Engg. & Tech.               Amity School of Engg & Tech
           Amity University Rajasthan,                 Amity University Rajasthan,
                Jaipur, India                                Jaipur, India

                                        Tusal Patel
                                       M. Tech. (CSE)
                                Amity School of Engg. & Tech.
                                 Amity University Rajasthan,
                                        Jaipur, India



  ABSTRACT

          Data sharing among organizations is considered to be useful as it offers mutual
  benefits for effective decision making and business growth. Data mining techniques can be
  applied on this shared data which can help in extracting meaningful, useful, previously
  unknown and ultimately comprehensible information from large databases. This ultimately
  leads to knowledge discovery and the mined knowledge can be used for irrefutable profits by
  both the parties. However information which is an important asset to business organizations,
  when shared raises an issue of privacy breach. Though this paper, privacy preserving
  clustering for centralized data through scaling based transformation is being introduced.

  Keywords: Data mining, Clustering, Privacy Preservation, Scaling

  I    INTRODUCTION

          The information age has enabled many organizations to gather large volume of data.
  However, the usefulness of this data is negligible if “meaningful information” or
  “knowledge” cannot be extracted from it and is not put to best use in future to increase
  effectiveness. Data mining otherwise known as knowledge discovery is the technique used by

                                              449
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME

analysts to find out the hidden and unknown pattern from the collection of data which can be
put to great use for deducing convincing opportunities. In contrast to standard statistical
methods, data mining techniques search for interesting information. Many techniques like
classification, clustering, association rule mining, etc. can be applied for mining knowledge
from large databases.

    Confidentiality Issues in Data Mining: It can be seen that there are situations where
sharing of data among organizations can lead to mutual gain. But a key issue that arises in
any kind of sharing of data is that of confidentiality. The need for privacy is sometimes due to
law (e.g., for medical databases) or can be motivated by business interests. Therefore the
issue raises a challenge for researchers for finding techniques to preserve the privacy of data
among the communicating parties.
   Most privacy preserving data mining methods use some form of transformation on data to
perform privacy preservation. Typically, such methods reduce the granularity of representation
to preserve privacy.
   This paper presents a technique of privacy preserving clustering where irreversible scaling
transformation applied on centralized data stored in a data matrix can lead to preserving of
confidentiality yet not changing the nature of the data and the relationship existing between
the data objects.

II.   RELATED WORK

        [1] suggests the method of privacy preserving computation of cluster means. It is done
using two protocols ( one based on oblivious polynomial evaluation and second on
homomorphic encryption). In [2], the k-means technique is used to preserve privacy of
vertically partitioned data. Vertically partitioned data means the complete attribute set of
database is divided into two or more sets and each set serves as individual database. [3]
suggests the decision tree technique for privacy preserving over vertically partitioned data.
[4] suggests the method for privacy preserving clustering by Rotation Based Technique(RBT)
which is very effective method concentrated mainly on isometric transformation. [5] presents
an algorithm for privacy preservation for Support Vector Machine(SVM) based classification
using local and global models. Local models are local to each party which are not disclosed
while generating global model jointly. The global model remains the same for every party
which is then used for classifying new data objects. [6] represents the modified k-means
algorithm for privacy preserving. A privacy preserving protocol for k-clustering is used on
horizontally partitioned databases. Many more privacy preservation techniques has been
presented in [6] for Naive Bayes and Decision Tree classification. [7] presented various
techniques for privacy preservation for different procedures of data mining. An algorithm is
suggested for preserving privacy in association rule mining. A subroutine has also been
presented for securely finding the closest cluster in k-means clustering for privacy
preservation. [8] represents various cryptographic techniques for privacy preserving. [9]
presents the theoretical and experimental results to demonstrate that most probably the
random data distortion preserves little data privacy.




                                              450
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME

III. PRIVACY  PRESERVING                        CLUSTERING            BY   DATA       MATRIX
TRANSFORMATION

A. Terms Used
                                                a. Data Matrix

    Objects (e.g. individuals, patterns, events) are usually represented as points (vectors) in a
multidimensional space. Each dimension represents a distinct attribute describing the object.
Thus, an object is represented as an m x n matrix D, where there are m rows, one for each
object, and n columns, one for each attribute. This matrix is referred to as a data matrix,
represented as follows:

                                    a 11   .     a 1k   .   a1n 
                                    a 21   .     a 2k   .   a 2n 
                                                                 
                                    .      .      .     .     . 
                                                                 
                                    a m1   . a mk       .   a mn 


B. Assumption

    1) In the paper an effort to secure attributes with numeric values is made, with an
assumption that numeric data (e.g. salary, age, phone number, etc.). is definitely the most
sensitive data that needs to be secured.

C. General Approach

Scaling Based Transformation (SBT) method:

      Let Dmxn be a data matrix, where each row represents an object, and each object contains
values for each of n numerical attributes. The SBT method of dimension n is an ordered pair,
defined as SBT = (D, fs), where:
1. D R mxn is a normalized data matrix of objects to be clustered
2. fs is scaling based transformation function

     In this procedure as the scaling operation of data matrix is used , which is taken as 2D
transformation. So it is mandatory to decide the scaling factor. Here it is supposed to be kept
same in both the x and y direction. Doing so will lead to shifting point on a higher scale. This
is the key factor of maintaining the cluster distribution before and after the transformation.
Even the points will be distorted as compared to original data points, the cluster distribution
remains the same. Thus this procedure preserves the privacy without distorting the data
mining results before transformation.




                                                 451
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME

D. Proposed Algorithm

SBT_Algorithm

Input : Dmxn // Dmxn is normalized data matrix
Output: D'mxn
1. k ← n/2
2. Pk ← k pairs(Ai,Aj) in D such that 1 ≤ i,j ≤ n and i ≠ j
3. Decide scaling factor s.
4. For each selected pair Pk in pairs(d) do
   a. V(A'i,A'j) ← S X V(Ai,Aj) // S is scaling matrix with s as scaling factor
   End for
End

E. Results

       For performing the proposed procedure iris2D dataset is used which contains 150
records. We have performed the clustering operation using Weka 3.6. We have used simple
k-Means clustering algorithm for our dataset.

1) Cluster distribution before transformation.

                     Figure 1- Cluster Distribution before transformation




    This output shows that 100 records belong to first cluster (cluster 0) and rest of 50 records
belong to second cluster (cluster 1).

    After this the transformed data set is supplied to Weka for k-Means clustering and the
visualized output is as shown below.
2) Cluster distribution after transformation.




                                                 452
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME

                     Figure 2- Cluster Distribution after transformation




        Comparing Figure 1 and Figure 2 it is clear that the cluster distribution before and
after transformation remains the same. Hence our procedure works effectively to maintain
privacy for the confidential numeric data.

F. Security

   The above stated procedure provides security to the numeric data. It means even if the
standard deviation and mean of the numeric dataset is published then also the original
numeric data of dataset before transformation cannot be interpreted correctly. This is
accomplished mainly in two steps:
1) Data Camouflage: First we try to conceal raw data by normalization. Obviously it is not
secure but it is beneficial in two ways a) It gives an equal weight to all attributes and b) It
makes difficult the re-identification of objects with other datasets.
2) Attribute Distortion: By scaling two attribute values at a time attribute distortion is
achieved

IV. CONCLUSION

        In this paper, a scaling based transformation method has been introduced for Privacy
Preserving Clustering on Centralized Data. The proposed method is designed to preserve
privacy only for numeric confidential data. This procedure also ensures the similar cluster
distributions before and after transformation. This method is clustering algorithm
independent. Moreover unsuccessful attempt is also made to recover original data from
normalized data which ensures the security of data after transformation without changes in
cluster distribution.
        Nowadays whatever data is required at particular site only that data is stored locally.
So the complete dataset is stored in distributed manner. Doing so maintains the availability of
data and also reduces the load of data server.
        Hence as a part of future work this procedure can be applied to the distributed data by
making some changes for preserving privacy. This would lead to better method for
maintaining confidentiality of distributed (Horizontally/Vertically) data.


                                             453
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME

V.   REFERENCES

[1] "Privacy Preserving Clustering" by S.Jha, L. Kruger, P. McDaniel
[2] "Privacy Preserving KMeans Clustering over Vertically Partitioned Data" by Jaideep
     Vaidya,Chris Clifton in SIGKDD 2003.
[3] "Privacy Preserving Decision Trees over Vertically Partitioned Data" by Jaideep
     Vaidya,Chris Clifton, Murat Kantarcioglu, A. Scott Patterson at ACM Transactions on
     Knowledge Discovery from Data, Vol. 2, No. 3, Article 14, Publication date: October
     2008.
[4] "Privacy Preserving Spatio-Temporal Clustering on Horizontally Partitioned Data" Ali
     Inan, Yucel Saygin
[5] "Privacy Preserving SVM Classification on Vertically Partitioned Data" Hwanjo Yu,
     Jaideep Vaidya, Xiaoqian Jiang.
[6] "Communication Efficient Privacy-Preserving Clustering" Geetha Jagannathan,
     Krishnan Pillaipakkamnatt, Rebecca N. Wright, Daryl, Umano
[7] A thesis on "Privacy Preserving Data Mining Over Vertically Partitioned Data" by
     Jaideep Shrikant Vaidya.
[8] "Cryptographic techniques for privacy-preserving data mining" by Benny Pinkas.
[9] "Random Data Perturbation Techniques and Privacy Preserving Data Mining " by Hillol
     Kargupta, Souptik Dutta, Qi Wang, Krishnamoorthy Sivakumar.
[10] Deepika Khurana and Dr. M.P.S Bhatia, “Dynamic Approach To K-Means Clustering
     Algorithm”, International Journal of Computer Engineering & Technology (IJCET),
     Volume 4, Issue 3, 2013, pp. 204 - 219, ISSN Print: 0976 – 6367, ISSN Online:
     0976 – 6375.




                                          454

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:7/6/2013
language:
pages:6