Study on the k-anonymous algorithm for data publishing scenario by


More Info
									                                                                                               Journal of
                                                                                              Science and

2050-2311/Copyright © 2012 IE Enterprises ltd                                        Jour. of Comp. Sci. and Eng.
All right reserved                                                                      Vol. 1, Num.1, 0028–0031, 2012

       Study on the K-anonymity Algorithm for Data Publishing
                                 Xuesong Zhanga, Yong Xub, Zecheng Wangc a*
                   School of Computer Science & Engineering,Anhui University of Finance & Economics,Bengbu 233030, China


Anonymous data publishing technology is used to preserve personal privacy via avoiding information leakage, whilst at the
same time maximizing the effectiveness of data dissemination technology. The paper introduced the basic concepts of data
mining, the k-anonymity firstly, and then compared the two anonymous algorithms based on k-means and kNN.

Keywords: K-anonymity; K-means; KNN

1. Introduction
   Modern life, through the development of both computer and information technology, means that people can
more easily and conveniently have access to and use shared information. But, everything has two sides, because
of the wide spread application of database systems, databases themselves have faced increasing levels of threats
over recent years. The phenomenon of personal and/or private data being leaked also occurs repeatedly. So,
people all over the world have a growing awareness of the database privacy protection issue.
   The questions as to how we can achieve effective information sharing on the one hand and effectively protect
private and/or sensitive information from being leaked on the other has become an active research direction in
the field of information security. Anonymous data release technology is a positive attempt at solving this

   * Corresponding author. Xuesong Zhang
   E-mail address:
   September 2012
                    Xuesong Zhang et al / Journal of Computational Science and Engineering 1:1 (2012) 0028–0031

2. Basic Concept
   Cox in 1980 was one of the first to put forward and implement the method of using anonymous achieve
privacy protection, then in 1986 Dalenius used privacy protection application anonymous technology for use
with census records. Since the anonymous concept was first proposed many domestic and foreign scholars have
conducted a wide range of technical research.

   Definition 1: Data Mining [1]. A means of knowledge discovery, Extraction of information, previously
unknown, from the huge amount of data storage, has potential value for decision-making rules.
   Definition 2: Clustering [2]. The degree of similarity when gathering large amounts of data samples (n) into
k-th class (k <n), meaning that samples with greater similarities will belong within the same categories but also
that samples with smaller similarities will belong in different classes.
   Definition 3: Anonymity Technology [3]. A privacy protection technology which means that if data mining
does somehow occur the data owner's identity can not be deduced.
   Definition 4: K-anonymity Model [5]. K-anonymity technology was put forward in 1998 by Samarati and
Sweeney [4], the k-anonymity model is a simple and effective privacy protection model. It requires that a certain
amount of data is released (at least k), this can not be distinguished between in the records in the quasi-identifier.
The attacker can not distinguish which private information belongs to a specific individual, thereby protecting
the personal privacy of k-anonymity; parameter k can withstand the specified user information leakage risks. K-
anonymity to a certain extent protects the privacy of individuals but at the same time will reduce the availability
of data. Therefore, the k- anonymity research work focuses on the protection of private information and
improving data availability.
   Commonly used anonymous technologies are based on the attribute hierarchy of generalization anonymous
technology. Such as the; full-domain generalization [6][7], local generalization[8], single-level generalization
schemes [6][7][9], multi-level generalization schemes[8][10]. They are also based on; clustering anonymous
technology [11][12]; division-based methods[13] and search based on the power set[14].

3. Select Algorithms

3.1. K-means Algorithm [15]

  K-means algorithm is a typical distance-based clustering algorithm. Using distance as the similarity
evaluation, thus the closer the similarity of the two objects the better. The algorithms that cluster are composed
by the distance between the objects, so compact and separate clusters are the ultimate goal.

3.1.1 Algorithm Idea

   K initial clustering centre for the selection of clustering results has great influence, because the algorithm is
the first step. In random selections, with an arbitrary k object, as the initial clustering centre initial represent a
cluster. This algorithm, in each iteration of the concentration of data to the rest of each object, shows that
according to its domain each cluster centre’s distance to each object will be assigned to the recent cluster.
   When the study of the data object was completed, an iterative computation was complete and the new cluster
centres were calculated. In a before and after iteration, the value of the standard measurement function was not
changed, therefore showing that the algorithm had already converged.

3.1.2 Algorithm Process

   1)    Randomly select k documents from the n documents as mass centre.
   2)    Measure for each document the remaining distance to each centroid, and its classification to the nearest
         centroid class.
   3)    Recalculation of the centroid of each class.
   4)    Iterative 2 ~ 3 steps until the new centroid and the original centroid are equal or less than the specified
         threshold value, the algorithm ends.
                        Xuesong Zhang et al / Journal of Computational Science and Engineering 1:1 (2012) 0028–0031

3.2. kNN Algorithm [16]

   KNN algorithm also known as k-Nearest Neighbour algorithm. On behalf of the k nearest neighbour
classification, through a combination of k most of the similar history to identify the new record.

3.2.1 Algorithm Idea

   If in a sample of the feature space the most similar samples k the most belongs to a category, the sample also
belongs to this category. kNN algorithms, and the choice of neighbours are already classified as correct objects.
The method in this kind of decision making is based upon the nearby one, or a few samples of classes,
belonging to the category of the sample. kNN method from this principle depends on the limit theorem, but only
a very small amount of adjacent samples in the category of decision-making, therefore, this method can be used
to avoid the imbalance of the samples. In addition, because kNN method is mainly limited by adjacent samples
and not by the discrimination class domain method to determine the category, class domains cross and/or
overlap more, kNN method is more suitable than other methods.

3.2.2 Algorithm Process

  1)    Calculate the sample data and the distance to be classified data.
  2)    Classify data to select the k-th sample and its minimum distance.
  3)    Statistics of the k samples in most of the sample classification.
  4)    This classification is to be classified as the data which belongs to the classification.

4. Algorithm Performance Comparison
    K-means and kNN written in C + +,Run in VC6.0. Make these two programs k =3. In the kNN program,
select the 30 sets of data as a training data set. The actual operation of the two programs, respectively, using 30
data sets, 100 data sets, 300 data sets and 1000 data sets to test each group of data in the two programs five
times. The experimental results are shown in Figure 1.

Fig. 1. Comparison for times of two algorithms

5. Conclusion
  This article mainly introduced the k-means algorithm and kNN algorithm. From the theoretical knowledge
and experimental learning and mastering of the two algorithms further study and research will lay the
foundation for the next use of these two algorithms.


  The work is supported by Anhui Provincial Natural Science Foundation of China under Grant No.
11040606M140; Humanity and Social Science foundation of Ministry of Education of China under Grant No.
                       Xuesong Zhang et al / Journal of Computational Science and Engineering 1:1 (2012) 0028–0031

12YJA630136; the Natural Science Foundation of Anhui Education Department of China under Grant No.
KJ2010ZD01; Anhui University of Finance & Economics student research innovation fund under Grant No.45.


[1] Chen MS, Han JW, Yu PS. Data mining: An overview from a atabase perspective. IEEE Trans. on Knowledge and Data
[2] Zhang MW, Liu Y, Zhang B, Zhu ZL. Concept-Based data clustering model. Journal of Software, 2009,20(9):2387−2396.(In Chinese)
[3] MA Ting-huai, TANG Mei-li. Data Mining Based on Privacy Preserving .Computer Engineering34(9):78-80. (In Chinese)
[4] Samarati P ,Sweeney L . Generalizing data to provide anonymity when disclosing information ( abstract ) [ A ] . Proc of the 17th ACM-
   SIGMOD-SIGACT-SIGART Symposium on the Princi-ples of Database Systems [ C ] . Seattle ,WA ,USA : IEEE press ,1998. 188.
[5] HAN Jian-min ,CEN Ting-ting ,YU Hui-qun. Research in Microaggregation Algorithms for k- Anonymity. ACTA ELECTRONICA
   SINICA 36(10):2022-2029 (In Chinese)
[6] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty,
   Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571-588.
[7] K. LeFevre, D. DeWitt and R. Ramakrishnan, Incognito: Efficient full-domain k-anonymity, in:Proceedings of the ACM SIGMOD
   International Conference on Management of Data (SIGMOD),Baltimore, MD, USA, 2005.
[8] B.C.M. Fung, K. Wang and P.S. Yu, Top–down specialization for information and privacy preservation,in: Proceedings of the
   International Conference on Data Engineering (ICDE), Tokyo, Japan,2005.
[9] P. Samarati, Protecting respondents’ privacy in microdata release, IEEE Transactions on Knowledge and Data Engineering 13 (2001),
[10] V.S. Iyengar, Transforming data to satisfy privacy constraints, in: Proceedings of the ACM SIGKDDInternational Conference on
   Knowledge Discovery and Data Mining (KDD), Edmonton, AB,Canada, 2002.
[11] JW Byun, A Kamra, E Bertino, N Li. Efficient k-anonymization using clustering techniques. Proceedings of Database Systems for
   Advanced Applications(DASFAA2007), LNCS4443. Berlin: Springer-Verlag, 2007:188-200.
[12] G. Aggrawal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas and A. Zhu, Achievinganonymity via clustering, in:
   Proceedings of the ACM Symposium on Principles of Database Systems(PODS), Chicago, IL, USA, 2006.
[13] K. LeFevre, D. DeWitt and R. Ramakrishnan, Mondrian multidimensional k-anonymity, in: Proceedingsof the International
   Conference on Data Engineering (ICDE), Atlanta, GA, USA, 2006.
[14] R.J. Bayardo and R. Agrawal, Data privacy through optimal k-anonymization, in: Proceedings ofthe International Conference on Data
   Engineering (ICDE), Tokyo, Japan, 2005.
[15] Han JW, Kamber M, Wrote; Fan M, Meng XF, Trans. Data Mining Concepts and Techniques. Beijing: China Machine Press,
   2001.232−236 (in Chinese).
[16] ZHANG Ning, JIA Ziyan, SHI Zhongzhi. Text Categorization with KNN Algorithm. Computer Engineering.31(8):171-185 (In


 Zhang Xue-song, born in 1989, School of Computer Science & Engineering,Anhui University of Finance
& Economics 2009 undergraduate.

To top