Document Sample

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 3, May-June 2014 ISSN 2278-6856 Anonymization Of Social Networks By A Novel Clustering Algorithm K.Santhi1, N. Sai Lohitha2 1,2 SREE VIDYANIKETHAN ENGINEERING COLLEGE, TIRUPATHI Abstract—Now-a-days people are using social networks data earlier to its publication in order to address the need more popularly. With the force of social networks on society, to respect the privacy of the individuals whose sensitive the people are becoming more and more sensitive regarding information is included in the data. Data anonymization privacy issues in the communication networks. characteristically trades off with effectiveness. Therefore, Anonymization of social networks is essential to preserve it is required to find a golden path in which the released privacy of information gathered by such social networks anonymized data still holds enough utility and privacy (Twitter,MySpace, Facebook, and Orkut). The goal of the preservation to some accepted degree on the other hand. proposed work is to arrive an anonymized view of the social networks without enlighten to any information about the In this paper we propose a novel anonymization nodes and links between nodes that are forbidden by the data technique based on clustering the nodes into some big holders. The main contributions mentioned in this paper are nodes known as super-nodes in which each of size at least sequential clustering algorithm to anonymize a social k, where k is essential anonymity parameter. The study of network and count the information loss measure in the anonymizing social networks so far are centralized anonymization process to preserve privacy. The algorithm networks, i.e., networks are held by only one player or significantly outperforms the SaNGreeA algorithm, which is data holder but in distributed networks settings, the the leading algorithm to achieve anonymity in social network data is split between several players. networks using clustering method. SaNGreeA builds the This study deals with social networks where the nodes clustering greedily, one cluster at a time by selecting the seed could be accompanied by descriptive data and propose a node and then keep adding to it the next node. The main new anonymization method (namely, by clustering the disadvantage of SaNGreeA is it does not contain any mechanism to correct bad clustering decisions which are nodes) that concerns anonymized views of the graph with made earlier and also it includes structural information loss considerably lesser information losses than which may be evaluated only when all of the clustering is anonymizations issued by the algorithms of [2] and [3]. defined. The sequential clustering algorithm does not undergo from those problems because in each stage of its 2. ANONYMIZATION BY CLUSTERING execution it will do full clustering. It always makes decisions Commonly we represent the social network as a simple based on the real information loss and constantly allows the undirected graph, with G = (V, E), where V= {v1,…,vN}is correction of previous clustering decisions. the set of nodes and E ⊆ VC2 is the set of edges. Each Keywords- Social networks, clustering, privacy node in the graph represents to an individual in the graph preserving data mining, information loss. and edge which connects two nodes describes the relationship between respective individuals. 1. INTRODUCTION Each node in the social network graph is described by set As we know Networks are structures which describe a set of non identifying attributes like zip code or age, which of entities and the relationships between them. In the are called quasi identifiers. Combination of such quasi same manner social networks are also provides attributes could be used for unique identification by information about the individuals in some population and means of linking attacks [4], therefore they should be the links between them, which may describe relationships generalized in order to block such type of attacks. Quasi like friendship, correspondence, collaboration, and so identifiers are not themselves “unique Identifiers”. forth. Networks may be modeled by a graph in which the node of the graph represents the entities and edges denote Let A1, . . . ,AI denotes the quasi identifiers as well as the relationship between them. Real social networks are more set of values that may attain. For example if A1 is the complex or contain some additional information such as gender attribute then set of values for A1 is {M, F}. edges would be labeled and the graph nodes could be associated by attributes that provide demographic Each node Vn ,1≤n≤N, is described by a quasi identifier information i.e. age, gender, location or occupation record, Rn =( Rn(1),…, Rn(I)) A1 X . . .X AI . Here etc.Such social networks are of interest to researchers Generalization means replacing quasi-identifiers, the set from many categories like market research sociology, of non identifying attributes such as age or zipcodes with psychology or epidemiology. less specific and less informative but semantically Although the data in such social networks cannot be consistent values. For example the zipcodes {47824, released as it is, because it might contain some sensitive 47865, 47843} can be generalized to 478** by stripping information. Consequently, we have to anonymize the Volume 3, Issue 3 May – June 2014 Page 242 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 3, May-June 2014 ISSN 2278-6856 the rightmost digits. Various generalization strategies RI={ R1I ,. . . , RT I }, where RIt is the minimal have been proposed in [5]. Those papers allow value from record in Ā1 X…X ĀI that generalizes all quasi- different domain levels to be combined to generate the identifier records of individuals in Ct, 1≤ t≤ T. generalization. Generalization on graph data can be done by using one of the five categories proposed in [8]. Let us define a clustered graph say Gc=(Vc,Ec), that was derived from a graph G=(V,E) of some social network To summarize, a novel anonymized social network is SN =(V,E, R). Then, in addition to the structural data, defined as follows: which is specified by Ec and the integral labels of the nodes, (│Ct│,│et│) and of the edges, et,s, one Definition 2.1: A social network SN=(V,E,R) is defined accompanies such a graph with descriptive data that is over V={v1,…,vN} and E ⊆ VC2, is the structural data derived from R, which is the original descriptive data . (edges),describing relation between individuals in V, A1, . . . ,AI be a set of quasi identifier attributes. And Here we apply the generalization of the quasi-identifiers, R={R1,...,RN}, where Rn A1 X . . . X AI , 1≤ n ≤ N, are which is the general method in anonymizing tabular data. the individuals descriptive data in V. Each of the quasi-identifiers, Ai, 1 ≤ i≤ I, is accompanied by a collection of subsets, Ai, which are the subsets of Ai As in [1], [2], [3], we consider for a given social that could be used for generalization. network anonymization using clustering. Let C = {C1,…,CT} be a partition of V into disjoint subsets, or 3. SEQUENTIAL CLUSTERING T clusters; i.e., V = Ut=1 and intersection of any two ALGORITHM clusters is set to null i.e Ct ∩ Cs = for all 1≤ t≠ s ≤ T. The sequential clustering algorithm for k-anonymizing The equivalent clustered graph Gc=(Vc; Ec) is the graph tables was presented in [7]. It was shown there to be a in which the set of nodes is Vc, which equals to C, and an very efficient algorithm in terms of runtime as well as in edge connects clusters Ct and Cs in Ec iff E contains an terms of the utility of the output anonymization. We edge from a node in cluster Ct to a node in cluster Cs. proceed to describe an adaptation of it for anonymizing Each node Ct Vc is associated by two pieces of social networks. information. In that first one is the number of original V - nodes that cluster Ct contains which is represented by │Ct│, and second one is │ et │, which is the number of edges in E that connect nodes within Ct. In addition, each edge {Ct,Cs} Ec is labeled by a weight et,s , which represents number of edges in E that connect a node in cluster Ct to a node in cluster Cs. Fig1: A network and corresponding clustering Definition 2.2. Let us consider a social network graph SN= (V, E, R) and let Ā1,… ĀI are generalization taxonomies for the quasi identifier attributes A1, . . .,AI. Then given a clustering C= {C1 … CT} of V, the equivalent clustered social network graph is represented Fig 2: Sequential Clustering Algorithm as SN=(C,V,Ec,RI) where This algorithm begins with an arbitrary division of the records into some T number of clusters. Then it goes over Ec is a set of edges on Vc such that Ec ⊆ VC2 , the n records in a cyclic way and for each record the where {Ct ,Cs} Ec iff there exist Vn Ct and algorithm verifies whether it may be moved from its Vn ׀ Cs such that (Vn, Vn ) ׀ E; present cluster to another different cluster while The clusters in Vc are labeled by their size and the increasing the utility of the induced anonymization. number of intra-cluster edges, while the edges in This loop may be iterating until it reaches a local Ec are labeled by the corresponding number of optimum (a point in which no single record move). As intercluster edges in E; there is no guarantee that such process finds the global Volume 3, Issue 3 May – June 2014 Page 243 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 3, May-June 2014 ISSN 2278-6856 optimum, it may be repeated many times with different information loss since it is used only once at the arbitrary partitions as the starting point in order to find beginning.) the finest local optimum among those repeated searches. Experimentation with various values of α and β revealed The beginning clusters are chosen so that all of them are that in all types of networks that we tested, and in all of size k0 or k0+ 1,where k0 =αk is an integer and α is sizes, β=1.5 and α = 0.5 gave good or best results. some parameter that represents lower bound of the cluster Sequential clustering, on the other hand, may be repeated and that needs to be determined. During the main loop, several times with different random partitions as the we allow the size of the clusters to vary in the range starting point, in order to find the best local minimum [2,k1]where k1=βk for some predetermined fixed among those repeated searches. Sequential clustering is parameter β. When a cluster becomes a singleton, we known to perform better both in terms of runtime and remove it and transfer the node that was in that cluster to quality of the output [6]. the cluster where it fits best, in terms of information loss measure (Step 2d). On the other hand, when a cluster Sequential clustering achieves significantly better results becomes too large (i.e., its size becomes larger than the than SanGreeA algorithm. One reason is that greedy upper bound k1), we split it into two equally sized algorithms, such as SaNGreeA, do not have a mechanism clusters in a random manner. of correcting bad clustering decisions that were made in an earlier stage; sequential clustering, on the other hand, The main loop of the algorithm is repeated until we reach constantly allows the correction of previous clustering a stage where an entire loop over all nodes in the network decisions. Another advantage of sequential clustering found no node that could be moved to another cluster in over, SaNGreeA is that it may evaluate at each stage order to decrease the information loss. That stopping during its operation the actual measure of information criterion may be replaced by another one, in order to loss, since at each stage it has a full clustering of all avoid iterations with negligible improvements. A natural nodes. criterion of that sort is to stop the main loop and continue to the next stage if the improvement in the information The latter advantage in terms of utility translates to a loss in the last execution of the main loop was too small. disadvantage in terms of runtime. While SaNGreeA We used in our experiments the latter criterion, with a requires O (N2) evaluations of the cost function, the threshold of 0.5 percent. number of cost function evaluations in the sequential clustering depends on N3. (The algorithm scans all N Agglomerative step: Some of the clusters formed may be nodes and for each one it considers O (N/k) alternative large, in the sense that their size is at least k, while others cluster allocations; the computation of the cost function may be small. If there exist small clusters, we apply an for each such candidate alternative clustering requires to agglomerative procedure on them in the following update the intercluster costs Is,2 (.,.) for all O (N/k) pairs manner (Step 5): of clusters that involve either the cluster of origin or the cluster of destination in that contemplated move.) Hence, We arbitrarily select one cluster and then find any other we proceed to describe a relaxed variant of sequential cluster regardless of their size as well as close to it. which clustering which requires only O (N2) evaluations of the of the other clusters (of any size) is closest to it, in the cost function. sense that joining them will cause the negligible increase in the information loss; after discovering the closest 4. MODIFIED MEASURE FOR cluster, we amalgamate the two clusters and repeat this STUCTURAL INFORMATION LOSS procedure until all clusters are of size at least k. Let B be the N X N adjacency matrix of the graph Here we use α and β as two separate parameters to control G=(V,E)i.e., B(n,n I)=1 if {Vn,VnI} E and B(n,nI)=0 the sizes of the clusters and, consequently, the final otherwise. Then, a Hamming-like distance is defined on output information loss. Our aim is to find a setting of V as follows: values to the parameters α and β that would yield lower information losses. For example, higher values of β would result in larger clusters at the output, what implies higher information losses; on the other hand, a too small β would lead to a greater number of small clusters at the end of the first phase; those small clusters would need to be unified in the agglomerative phase (Step 5) and that This definition of distance induces the following too might result in higher information losses since the measure of structural information loss per cluster agglomerative phase is more crude than the first stage, as it involves the unification of whole clusters instead of moving just one node. (α has a much smaller effect on the Volume 3, Issue 3 May – June 2014 Page 244 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 3, May-June 2014 ISSN 2278-6856 The corresponding overall structural information loss is coauthorship graph. Table 1 describes all of the graphs that we tested their type, number of nodes in the graph i.e │V│, and total number of edges in that graph which is denoted by │E│. The descriptive data was extracted from Where the CENSUS2 data set; that data consists of 7 attributes (age, gender, education level, marital status, race, work class, country). In the other words, IIS of a given cluster is the average We experienced the two versions of sequential clustering; distance between all pairs of nodes in that cluster, and IIS the Original version (Is ) is denoted in the plots by Sq, as of the whole clustering is the corresponding weighted the Modified one ( Is1s) is denoted by SqM. We average of structural information losses over all clusters. compared them against the SaNGreeA algorithm due to Campan and Truta [2] (denoted SNG) and the cluster- There are two information loss measures. Those are edge algorithm of Zheleva and Getoor [3] (denoted ZhG). Generalization information loss measure and Structural information loss measure. The first measure describes TABLE 1:Graph Types and Sizes how much descriptive data details are lost during generalization. The second measure describes how much structural details are lost during generalization. For calculating descriptive information loss we use here LM metric. The Loss Measure (LM) metric associates the following loss of information with each and every node in that cluster is given by ID (Ct), 6. INFORMATION LOSS MEASURE The overall Loss Measure information loss is the result of The average information losses I (.) are displayed below averaging those losses of information over all nodes in V, in Fig 3. These were achieved by the algorithms on the i.e. WS (upper left), BA (upper right), and DBLP (lower left) graphs with w = 0:5, and the WS graph with w = 0 (lower right), where the average is over the results that were obtained on the graphs with 1,000, 2,000, and 4,000 nodes. As can be seen, both variants of sequential Finally total weighted measure of information loss is then clustering always achieve considerably better results than SaNGreeA and the Zheleva-Getoor algorithm when w = 0.5. Where w [0, 1]. Whenever the sequential clustering The sequential clustering algorithm still issues the most algorithm implements one of its decisions—be it moving excellent results when w =0, but the tailored version a node from one cluster to another, splitting a large issues similar results to SaNGreeA.(The Zheleva-Getoor cluster, or unifying two small clusters—all that is needed algorithm is unrelated for the case w =0 since it totally in order to update II is to update the intracluster ignores the structural data.) information loss measures of the two clusters that are involved in such an action; there is no need to update also the inter-cluster information loss measures that involve all other clusters (as is the case when using I). This is why the number of cost function evaluations that sequential clustering needs to perform reduces from O(N3) to O(N2), when switching from I to II, in similarity to the SaNGreeA algorithm. 5. EXPERIMENTAL RESULTS By testing our proposed sequential clustering algorithm on three types of graphs: initially an arbitrary graph generated by the Watts-Strogatz (WS) model [9]; Fig. 3. Average information losses in the WS, BA, and secondly an arbitrary graph generated by the Baraba´si- DBLP graphs. Albert (BA) model [10]; and finally a subset of the DBLP Volume 3, Issue 3 May – June 2014 Page 245 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 3, May-June 2014 ISSN 2278-6856 7. CONCLUSION [9] S. Strogatz and D. Watts, “Collective Dynamics of The key objective in releasing the anonymized database ’Small-World’ Networks,” Nature, vol. 393, pp. 409- for data mining is to infer methods of predicting the 410, 1998. private data from the public data. This paper described [10] R. Albert and A. Baraba´si “Emergence of Scaling the sequential clustering algorithm for k-anonymization. in Random Networks,” Science, vol. 286, pp. 509- The objective of the proposed work is to arrive at an 512, 1999. anonymized view of the social networks without enlightening to any information about the nodes and links between nodes in the graph that are controlled by the data AUTHOR(s) holders. K.Santhi, received the B.Tech degree in computer science and Engineering from jntu ananthapur, 2009 and Sequential clustering algorithm begins with an arbitrary pursing M.Tech(2012-2014) in Sree Vidyanikethan division of the records into some T number of clusters. Engineering College, Tirupati. Then it goes over the n records in a cyclic manner. For each record it checks whether it may be moved from its N.Sai Lohitha, received the M.Tech degree in 2011 in present cluster to another one while increasing the utility SRM University, Chennai. Currently she is working as of the induced anonymization. This loop may be iterated Assistant Professor in Sree Vidyanikethan Engineering until it reaches a local optimum (a stage in which no College, Tirupati. single record move). As there is no guarantee that such procedure finds the global optimum, it may be repeated many times with distinct arbitrary partitions as the initial point in order to discover the best local optimum among those repeated searches. Sequential clustering algorithm generates anonymization by means of clustering with better utility than those achieved by presented algorithms. REFERENCES [1] M. Hay,D. Jensen, D.F. Towsley, G. Miklau, and P.Weis,“Resisting Structural Re-Identification in Anonymized SocialNetworks,” Proc. VLDB Endowment, vol. 1 ,2008. [2] A. Campan and T.M. Truta, “Data and Structural k- Anonymity in Social Networks,” Proc. 2nd ACM SIGKDD Int’l Workshop Privacy, Security, and Trust in KDD,pp. 33-54, 2008. [3] E. Zheleva and L. Getoor, “Preserving the Privacy of Sensitive Relationship in Graph Data,” Proc. ACM SIGKDD Ist Int’l Conf.Privacy, Security, and Trust in KDD (PinKDD), pp. 153-171, 2007. [4] L. Sweeney, “Uniqueness of Simple Demographics in the U.S.Population,” Laboratory for Int’l Data Privacy (LIDAP-WP4), 2000. [5] J.Benaloh,”Secret Sharing Homomorphisms: keeping Shares of a Secret Secret,”Proc.Advances in Cryptology (Crypto),1986. [6] A. Campan and T.M. Truta ,“Data and Structural k- Anonymity in Social Networks,” Proc. 2nd ACM SIGKDD Int’l Workshop Privacy, Security, and Trust in KDD (PinKDD), 2008. [7] T. Tassa and J. Goldberger, “Efficient nonymizations with Enhanced Utility,” Trans. Data Privacy, vol. 3, pp. 149-175, 2010. [8] A. Gionis, F. Bonchi and T. Tassa, “Identity Obfuscation in Graphs Through the Information Theoretic Lens,” Proc. IEEE 27th Int’l Conf. Data Eng. (ICDE), 2011. Volume 3, Issue 3 May – June 2014 Page 246

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 3 |

posted: | 7/18/2014 |

language: | English |

pages: | 5 |

Description:
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 3, May-June 2014 ISSN 2278-6856

OTHER DOCS BY editorijettcs

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.