Journal of Computational Science and Engineering www.jcseuk.com 2050-2311/Copyright © 2012 IE Enterprises ltd Jour. of Comp. Sci. and Eng. All right reserved Vol. 1, Num.1, 0028–0031, 2012 Study on the K-anonymity Algorithm for Data Publishing Scenario Xuesong Zhanga, Yong Xub, Zecheng Wangc a* a,b,c School of Computer Science ＆ Engineering，Anhui University of Finance ＆ Economics，Bengbu 233030, China Abstract Anonymous data publishing technology is used to preserve personal privacy via avoiding information leakage, whilst at the same time maximizing the effectiveness of data dissemination technology. The paper introduced the basic concepts of data mining, the k-anonymity firstly, and then compared the two anonymous algorithms based on k-means and kNN. Keywords: K-anonymity; K-means; KNN 1. Introduction Modern life, through the development of both computer and information technology, means that people can more easily and conveniently have access to and use shared information. But, everything has two sides, because of the wide spread application of database systems, databases themselves have faced increasing levels of threats over recent years. The phenomenon of personal and/or private data being leaked also occurs repeatedly. So, people all over the world have a growing awareness of the database privacy protection issue. The questions as to how we can achieve effective information sharing on the one hand and effectively protect private and/or sensitive information from being leaked on the other has become an active research direction in the field of information security. Anonymous data release technology is a positive attempt at solving this problem. * Corresponding author. Xuesong Zhang E-mail address: firstname.lastname@example.org. September 2012 Xuesong Zhang et al / Journal of Computational Science and Engineering 1:1 (2012) 0028–0031 2. Basic Concept Cox in 1980 was one of the first to put forward and implement the method of using anonymous achieve privacy protection, then in 1986 Dalenius used privacy protection application anonymous technology for use with census records. Since the anonymous concept was first proposed many domestic and foreign scholars have conducted a wide range of technical research. Definition 1: Data Mining . A means of knowledge discovery, Extraction of information, previously unknown, from the huge amount of data storage, has potential value for decision-making rules. Definition 2: Clustering . The degree of similarity when gathering large amounts of data samples (n) into k-th class (k <n), meaning that samples with greater similarities will belong within the same categories but also that samples with smaller similarities will belong in different classes. Definition 3: Anonymity Technology . A privacy protection technology which means that if data mining does somehow occur the data owner's identity can not be deduced. Definition 4: K-anonymity Model . K-anonymity technology was put forward in 1998 by Samarati and Sweeney , the k-anonymity model is a simple and effective privacy protection model. It requires that a certain amount of data is released (at least k), this can not be distinguished between in the records in the quasi-identifier. The attacker can not distinguish which private information belongs to a specific individual, thereby protecting the personal privacy of k-anonymity; parameter k can withstand the specified user information leakage risks. K- anonymity to a certain extent protects the privacy of individuals but at the same time will reduce the availability of data. Therefore, the k- anonymity research work focuses on the protection of private information and improving data availability. Commonly used anonymous technologies are based on the attribute hierarchy of generalization anonymous technology. Such as the; full-domain generalization , local generalization, single-level generalization schemes , multi-level generalization schemes. They are also based on; clustering anonymous technology ; division-based methods and search based on the power set. 3. Select Algorithms 3.1. K-means Algorithm  K-means algorithm is a typical distance-based clustering algorithm. Using distance as the similarity evaluation, thus the closer the similarity of the two objects the better. The algorithms that cluster are composed by the distance between the objects, so compact and separate clusters are the ultimate goal. 3.1.1 Algorithm Idea K initial clustering centre for the selection of clustering results has great influence, because the algorithm is the first step. In random selections, with an arbitrary k object, as the initial clustering centre initial represent a cluster. This algorithm, in each iteration of the concentration of data to the rest of each object, shows that according to its domain each cluster centre’s distance to each object will be assigned to the recent cluster. When the study of the data object was completed, an iterative computation was complete and the new cluster centres were calculated. In a before and after iteration, the value of the standard measurement function was not changed, therefore showing that the algorithm had already converged. 3.1.2 Algorithm Process 1) Randomly select k documents from the n documents as mass centre. 2) Measure for each document the remaining distance to each centroid, and its classification to the nearest centroid class. 3) Recalculation of the centroid of each class. 4) Iterative 2 ~ 3 steps until the new centroid and the original centroid are equal or less than the specified threshold value, the algorithm ends. Xuesong Zhang et al / Journal of Computational Science and Engineering 1:1 (2012) 0028–0031 3.2. kNN Algorithm  KNN algorithm also known as k-Nearest Neighbour algorithm. On behalf of the k nearest neighbour classification, through a combination of k most of the similar history to identify the new record. 3.2.1 Algorithm Idea If in a sample of the feature space the most similar samples k the most belongs to a category, the sample also belongs to this category. kNN algorithms, and the choice of neighbours are already classified as correct objects. The method in this kind of decision making is based upon the nearby one, or a few samples of classes, belonging to the category of the sample. kNN method from this principle depends on the limit theorem, but only a very small amount of adjacent samples in the category of decision-making, therefore, this method can be used to avoid the imbalance of the samples. In addition, because kNN method is mainly limited by adjacent samples and not by the discrimination class domain method to determine the category, class domains cross and/or overlap more, kNN method is more suitable than other methods. 3.2.2 Algorithm Process 1) Calculate the sample data and the distance to be classified data. 2) Classify data to select the k-th sample and its minimum distance. 3) Statistics of the k samples in most of the sample classification. 4) This classification is to be classified as the data which belongs to the classification. 4. Algorithm Performance Comparison K-means and kNN written in C + +,Run in VC6.0. Make these two programs k =3. In the kNN program, select the 30 sets of data as a training data set. The actual operation of the two programs, respectively, using 30 data sets, 100 data sets, 300 data sets and 1000 data sets to test each group of data in the two programs five times. The experimental results are shown in Figure 1. Fig. 1. Comparison for times of two algorithms 5. Conclusion This article mainly introduced the k-means algorithm and kNN algorithm. From the theoretical knowledge and experimental learning and mastering of the two algorithms further study and research will lay the foundation for the next use of these two algorithms. Acknowledgments The work is supported by Anhui Provincial Natural Science Foundation of China under Grant No. 11040606M140; Humanity and Social Science foundation of Ministry of Education of China under Grant No. Xuesong Zhang et al / Journal of Computational Science and Engineering 1:1 (2012) 0028–0031 12YJA630136; the Natural Science Foundation of Anhui Education Department of China under Grant No. KJ2010ZD01; Anhui University of Finance ＆ Economics student research innovation fund under Grant No.45. References  Chen MS, Han JW, Yu PS. Data mining: An overview from a atabase perspective. IEEE Trans. on Knowledge and Data Engineering,1996,8(6):866−883.  Zhang MW, Liu Y, Zhang B, Zhu ZL. Concept-Based data clustering model. Journal of Software, 2009,20(9):2387−2396.(In Chinese)  MA Ting-huai, TANG Mei-li. Data Mining Based on Privacy Preserving .Computer Engineering34（9）：78-80. (In Chinese)  Samarati P ,Sweeney L . Generalizing data to provide anonymity when disclosing information ( abstract ) [ A ] . Proc of the 17th ACM- SIGMOD-SIGACT-SIGART Symposium on the Princi-ples of Database Systems [ C ] . Seattle ,WA ,USA : IEEE press ,1998. 188.  HAN Jian-min ,CEN Ting-ting ,YU Hui-qun. Research in Microaggregation Algorithms for k- Anonymity. ACTA ELECTRONICA SINICA 36（10）：2022-2029 (In Chinese)  L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 571-588.  K. LeFevre, D. DeWitt and R. Ramakrishnan, Incognito: Efficient full-domain k-anonymity, in:Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD),Baltimore, MD, USA, 2005.  B.C.M. Fung, K. Wang and P.S. Yu, Top–down specialization for information and privacy preservation,in: Proceedings of the International Conference on Data Engineering (ICDE), Tokyo, Japan,2005.  P. Samarati, Protecting respondents’ privacy in microdata release, IEEE Transactions on Knowledge and Data Engineering 13 (2001), 1010–1027.  V.S. Iyengar, Transforming data to satisfy privacy constraints, in: Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining (KDD), Edmonton, AB,Canada, 2002.  JW Byun, A Kamra, E Bertino, N Li. Efficient k-anonymization using clustering techniques. Proceedings of Database Systems for Advanced Applications(DASFAA2007), LNCS4443. Berlin: Springer-Verlag, 2007:188-200.  G. Aggrawal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas and A. Zhu, Achievinganonymity via clustering, in: Proceedings of the ACM Symposium on Principles of Database Systems(PODS), Chicago, IL, USA, 2006.  K. LeFevre, D. DeWitt and R. Ramakrishnan, Mondrian multidimensional k-anonymity, in: Proceedingsof the International Conference on Data Engineering (ICDE), Atlanta, GA, USA, 2006.  R.J. Bayardo and R. Agrawal, Data privacy through optimal k-anonymization, in: Proceedings ofthe International Conference on Data Engineering (ICDE), Tokyo, Japan, 2005.  Han JW, Kamber M, Wrote; Fan M, Meng XF, Trans. Data Mining Concepts and Techniques. Beijing: China Machine Press, 2001.232−236 (in Chinese).  ZHANG Ning, JIA Ziyan, SHI Zhongzhi. Text Categorization with KNN Algorithm. Computer Engineering.31（8）：171-185 (In Chinese) Biography Zhang Xue-song, born in 1989, School of Computer Science ＆ Engineering，Anhui University of Finance ＆ Economics 2009 undergraduate.
Pages to are hidden for
"Study on the k-anonymous algorithm for data publishing scenario"Please download to view full document