Docstoc

IJETTCS-2014-06-25-135.pdf

Document Sample
IJETTCS-2014-06-25-135.pdf Powered By Docstoc
					    International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 3, May-June 2014                                            ISSN 2278-6856

    Anonymization Of Social Networks By A Novel
               Clustering Algorithm
                                                 K.Santhi1, N. Sai Lohitha2
                                    1,2
                                      SREE VIDYANIKETHAN ENGINEERING COLLEGE,
                                                    TIRUPATHI

Abstract—Now-a-days people are using social networks               data earlier to its publication in order to address the need
more popularly. With the force of social networks on society,      to respect the privacy of the individuals whose sensitive
the people are becoming more and more sensitive regarding          information is included in the data. Data anonymization
privacy    issues     in    the    communication      networks.    characteristically trades off with effectiveness. Therefore,
Anonymization of social networks is essential to preserve          it is required to find a golden path in which the released
privacy of information gathered by such social networks
                                                                   anonymized data still holds enough utility and privacy
(Twitter,MySpace, Facebook, and Orkut). The goal of the
                                                                   preservation to some accepted degree on the other hand.
proposed work is to arrive an anonymized view of the social
networks without enlighten to any information about the
                                                                   In this paper we propose a novel anonymization
nodes and links between nodes that are forbidden by the data       technique based on clustering the nodes into some big
holders. The main contributions mentioned in this paper are        nodes known as super-nodes in which each of size at least
sequential clustering algorithm to anonymize a social              k, where k is essential anonymity parameter. The study of
network and count the information loss measure in the              anonymizing social networks so far are centralized
anonymization process to preserve privacy. The algorithm           networks, i.e., networks are held by only one player or
significantly outperforms the SaNGreeA algorithm, which is         data holder but in distributed networks settings, the
the leading algorithm to achieve anonymity in social               network data is split between several players.
networks using clustering method. SaNGreeA builds the              This study deals with social networks where the nodes
clustering greedily, one cluster at a time by selecting the seed   could be accompanied by descriptive data and propose a
node and then keep adding to it the next node. The main
                                                                   new anonymization method (namely, by clustering the
disadvantage of SaNGreeA is it does not contain any
mechanism to correct bad clustering decisions which are
                                                                   nodes) that concerns anonymized views of the graph with
made earlier and also it includes structural information loss      considerably       lesser    information      losses    than
which may be evaluated only when all of the clustering is          anonymizations issued by the algorithms of [2] and [3].
defined. The sequential clustering algorithm does not
undergo from those problems because in each stage of its           2. ANONYMIZATION BY CLUSTERING
execution it will do full clustering. It always makes decisions    Commonly we represent the social network as a simple
based on the real information loss and constantly allows the
                                                                   undirected graph, with G = (V, E), where V= {v1,…,vN}is
correction of previous clustering decisions.
                                                                   the set of nodes and E ⊆ VC2 is the set of edges. Each
Keywords- Social networks, clustering,                 privacy     node in the graph represents to an individual in the graph
preserving data mining, information loss.                          and edge which connects two nodes describes the
                                                                   relationship between respective individuals.
1. INTRODUCTION                                                    Each node in the social network graph is described by set
As we know Networks are structures which describe a set
                                                                   of non identifying attributes like zip code or age, which
of entities and the relationships between them. In the
                                                                   are called quasi identifiers. Combination of such quasi
same manner social networks are also provides
                                                                   attributes could be used for unique identification by
information about the individuals in some population and
                                                                   means of linking attacks [4], therefore they should be
the links between them, which may describe relationships
                                                                   generalized in order to block such type of attacks. Quasi
like friendship, correspondence, collaboration, and so
                                                                   identifiers are not themselves “unique Identifiers”.
forth. Networks may be modeled by a graph in which the
node of the graph represents the entities and edges denote
                                                                   Let A1, . . . ,AI denotes the quasi identifiers as well as the
relationship between them. Real social networks are more
                                                                   set of values that may attain. For example if A1 is the
complex or contain some additional information such as
                                                                   gender attribute then set of values for A1 is {M, F}.
edges would be labeled and the graph nodes could be
associated by attributes that provide demographic
                                                                   Each node Vn ,1≤n≤N, is described by a quasi identifier
information i.e. age, gender, location or occupation
                                                                   record, Rn =( Rn(1),…, Rn(I))  A1 X . . .X AI . Here
etc.Such social networks are of interest to researchers
                                                                   Generalization means replacing quasi-identifiers, the set
from many categories like market research sociology,
                                                                   of non identifying attributes such as age or zipcodes with
psychology or epidemiology.
                                                                   less specific and less informative but semantically
Although the data in such social networks cannot be
                                                                   consistent values. For example the zipcodes {47824,
released as it is, because it might contain some sensitive
                                                                   47865, 47843} can be generalized to 478** by stripping
information. Consequently, we have to anonymize the

Volume 3, Issue 3 May – June 2014                                                                                    Page 242
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 3, May-June 2014                                            ISSN 2278-6856

the rightmost digits. Various generalization strategies             RI={ R1I ,. . . , RT I }, where RIt is the minimal
have been proposed in [5]. Those papers allow value from             record in Ā1 X…X ĀI that generalizes all quasi-
different domain levels to be combined to generate the               identifier records of individuals in Ct, 1≤ t≤ T.
generalization. Generalization on graph data can be done
by using one of the five categories proposed in [8].           Let us define a clustered graph say Gc=(Vc,Ec), that was
                                                               derived from a graph G=(V,E) of some social network
To summarize, a novel anonymized social network is             SN =(V,E, R). Then, in addition to the structural data,
defined as follows:                                            which is specified by Ec and the integral labels of the
                                                               nodes, (│Ct│,│et│) and of the edges, et,s, one
Definition 2.1: A social network SN=(V,E,R) is defined         accompanies such a graph with descriptive data that is
over V={v1,…,vN} and E ⊆ VC2, is the structural data           derived from R, which is the original descriptive data .
(edges),describing relation between individuals in V, A1,
. . . ,AI be a set of quasi identifier attributes. And         Here we apply the generalization of the quasi-identifiers,
R={R1,...,RN}, where Rn  A1 X . . . X AI , 1≤ n ≤ N, are      which is the general method in anonymizing tabular data.
the individuals descriptive data in V.                         Each of the quasi-identifiers, Ai, 1 ≤ i≤ I, is accompanied
                                                               by a collection of subsets, Ai, which are the subsets of Ai
As in [1], [2], [3], we consider for a given          social   that could be used for generalization.
network anonymization using clustering. Let C =
{C1,…,CT} be a partition of V into disjoint subsets, or        3. SEQUENTIAL CLUSTERING
                            T
clusters; i.e., V = Ut=1 and intersection of any two           ALGORITHM
clusters is set to null i.e Ct ∩ Cs = for all 1≤ t≠ s ≤ T.    The sequential clustering algorithm for k-anonymizing
The equivalent clustered graph Gc=(Vc; Ec) is the graph        tables was presented in [7]. It was shown there to be a
in which the set of nodes is Vc, which equals to C, and an     very efficient algorithm in terms of runtime as well as in
edge connects clusters Ct and Cs in Ec iff E contains an       terms of the utility of the output anonymization. We
edge from a node in cluster Ct to a node in cluster Cs.        proceed to describe an adaptation of it for anonymizing
Each node Ct  Vc is associated by two pieces of               social networks.
information. In that first one is the number of original V -
nodes that cluster Ct contains which is represented by
│Ct│, and second one is │ et │, which is the number of
edges in E that connect nodes within Ct. In addition, each
edge {Ct,Cs}  Ec is labeled by a weight et,s , which
represents number of edges in E that connect a node in
cluster Ct to a node in cluster Cs.




      Fig1: A network and corresponding clustering

Definition 2.2. Let us consider a social network graph
SN= (V, E, R) and let Ā1,… ĀI are generalization
taxonomies for the quasi identifier attributes A1, . . .,AI.
Then given a clustering C= {C1 … CT} of V, the
equivalent clustered social network graph is represented                Fig 2: Sequential Clustering Algorithm
as SN=(C,V,Ec,RI) where
                                                               This algorithm begins with an arbitrary division of the
                                                               records into some T number of clusters. Then it goes over
      Ec is a set of edges on Vc such that Ec ⊆ VC2 ,
                                                               the n records in a cyclic way and for each record the
       where {Ct ,Cs}  Ec iff there exist Vn  Ct and
                                                               algorithm verifies whether it may be moved from its
       Vn‫ ׀‬ Cs such that (Vn, Vn‫ ) ׀‬ E;
                                                               present cluster to another different cluster while
      The clusters in Vc are labeled by their size and the
                                                               increasing the utility of the induced anonymization.
       number of intra-cluster edges, while the edges in
                                                               This loop may be iterating until it reaches a local
       Ec are labeled by the corresponding number of
                                                               optimum (a point in which no single record move). As
       intercluster edges in E;
                                                               there is no guarantee that such process finds the global

Volume 3, Issue 3 May – June 2014                                                                              Page 243
    International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 3, May-June 2014                                            ISSN 2278-6856

optimum, it may be repeated many times with different             information loss since it is used only once at the
arbitrary partitions as the starting point in order to find       beginning.)
the finest local optimum among those repeated searches.
                                                                  Experimentation with various values of α and β revealed
The beginning clusters are chosen so that all of them are         that in all types of networks that we tested, and in all
of size k0 or k0+ 1,where k0 =αk is an integer and α is           sizes, β=1.5 and α = 0.5 gave good or best results.
some parameter that represents lower bound of the cluster         Sequential clustering, on the other hand, may be repeated
and that needs to be determined. During the main loop,            several times with different random partitions as the
we allow the size of the clusters to vary in the range            starting point, in order to find the best local minimum
[2,k1]where k1=βk for some predetermined fixed                    among those repeated searches. Sequential clustering is
parameter β. When a cluster becomes a singleton, we               known to perform better both in terms of runtime and
remove it and transfer the node that was in that cluster to       quality of the output [6].
the cluster where it fits best, in terms of information loss
measure (Step 2d). On the other hand, when a cluster              Sequential clustering achieves significantly better results
becomes too large (i.e., its size becomes larger than the         than SanGreeA algorithm. One reason is that greedy
upper bound k1), we split it into two equally sized               algorithms, such as SaNGreeA, do not have a mechanism
clusters in a random manner.                                      of correcting bad clustering decisions that were made in
                                                                  an earlier stage; sequential clustering, on the other hand,
The main loop of the algorithm is repeated until we reach         constantly allows the correction of previous clustering
a stage where an entire loop over all nodes in the network        decisions. Another advantage of sequential clustering
found no node that could be moved to another cluster in           over, SaNGreeA is that it may evaluate at each stage
order to decrease the information loss. That stopping             during its operation the actual measure of information
criterion may be replaced by another one, in order to             loss, since at each stage it has a full clustering of all
avoid iterations with negligible improvements. A natural          nodes.
criterion of that sort is to stop the main loop and continue
to the next stage if the improvement in the information           The latter advantage in terms of utility translates to a
loss in the last execution of the main loop was too small.        disadvantage in terms of runtime. While SaNGreeA
We used in our experiments the latter criterion, with a           requires O (N2) evaluations of the cost function, the
threshold of 0.5 percent.                                         number of cost function evaluations in the sequential
                                                                  clustering depends on N3. (The algorithm scans all N
Agglomerative step: Some of the clusters formed may be            nodes and for each one it considers O (N/k) alternative
large, in the sense that their size is at least k, while others   cluster allocations; the computation of the cost function
may be small. If there exist small clusters, we apply an          for each such candidate alternative clustering requires to
agglomerative procedure on them in the following                  update the intercluster costs Is,2 (.,.) for all O (N/k) pairs
manner (Step 5):                                                  of clusters that involve either the cluster of origin or the
                                                                  cluster of destination in that contemplated move.) Hence,
We arbitrarily select one cluster and then find any other         we proceed to describe a relaxed variant of sequential
cluster regardless of their size as well as close to it. which    clustering which requires only O (N2) evaluations of the
of the other clusters (of any size) is closest to it, in the      cost function.
sense that joining them will cause the negligible increase
in the information loss; after discovering the closest            4. MODIFIED MEASURE FOR
cluster, we amalgamate the two clusters and repeat this           STUCTURAL INFORMATION LOSS
procedure until all clusters are of size at least k.
                                                                  Let B be the N X N adjacency matrix of the graph
Here we use α and β as two separate parameters to control         G=(V,E)i.e., B(n,n I)=1 if {Vn,VnI}  E and B(n,nI)=0
the sizes of the clusters and, consequently, the final            otherwise. Then, a Hamming-like distance is defined on
output information loss. Our aim is to find a setting of          V as follows:
values to the parameters α and β that would yield lower
information losses. For example, higher values of β
would result in larger clusters at the output, what implies
higher information losses; on the other hand, a too small
β would lead to a greater number of small clusters at the
end of the first phase; those small clusters would need to
be unified in the agglomerative phase (Step 5) and that              This definition of distance induces the following
too might result in higher information losses since the           measure of structural information loss per cluster
agglomerative phase is more crude than the first stage, as
it involves the unification of whole clusters instead of
moving just one node. (α has a much smaller effect on the


Volume 3, Issue 3 May – June 2014                                                                                   Page 244
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 3, May-June 2014                                            ISSN 2278-6856

The corresponding overall structural information loss is       coauthorship graph. Table 1 describes all of the graphs
                                                               that we tested their type, number of nodes in the graph i.e
                                                               │V│, and total number of edges in that graph which is
                                                               denoted by │E│. The descriptive data was extracted from
Where                                                          the CENSUS2 data set; that data consists of 7 attributes
                                                               (age, gender, education level, marital status, race, work
                                                               class, country).

In the other words, IIS of a given cluster is the average      We experienced the two versions of sequential clustering;
distance between all pairs of nodes in that cluster, and IIS   the Original version (Is ) is denoted in the plots by Sq, as
of the whole clustering is the corresponding weighted          the Modified one ( Is1s) is denoted by SqM. We
average of structural information losses over all clusters.    compared them against the SaNGreeA algorithm due to
                                                               Campan and Truta [2] (denoted SNG) and the cluster-
There are two information loss measures. Those are             edge algorithm of Zheleva and Getoor [3] (denoted ZhG).
Generalization information loss measure and Structural
information loss measure. The first measure describes                      TABLE 1:Graph Types and Sizes
how much descriptive data details are lost during
generalization. The second measure describes how much
structural details are lost during generalization.

For calculating descriptive information loss we use here
LM metric. The Loss Measure (LM) metric associates the
following loss of information with each and every node in
that cluster is given by ID (Ct),



                                                               6. INFORMATION LOSS MEASURE
The overall Loss Measure information loss is the result of     The average information losses I (.) are displayed below
averaging those losses of information over all nodes in V,     in Fig 3. These were achieved by the algorithms on the
i.e.                                                           WS (upper left), BA (upper right), and DBLP (lower left)
                                                               graphs with w = 0:5, and the WS graph with w = 0 (lower
                                                               right), where the average is over the results that were
                                                               obtained on the graphs with 1,000, 2,000, and 4,000
                                                               nodes. As can be seen, both variants of sequential
Finally total weighted measure of information loss is then     clustering always achieve considerably better results than
                                                               SaNGreeA and the Zheleva-Getoor algorithm when w =
                                                               0.5.

Where w  [0, 1]. Whenever the sequential clustering           The sequential clustering algorithm still issues the most
algorithm implements one of its decisions—be it moving         excellent results when w =0, but the tailored version
a node from one cluster to another, splitting a large          issues similar results to SaNGreeA.(The Zheleva-Getoor
cluster, or unifying two small clusters—all that is needed     algorithm is unrelated for the case w =0 since it totally
in order to update II is to update the intracluster            ignores the structural data.)
information loss measures of the two clusters that are
involved in such an action; there is no need to update also
the inter-cluster information loss measures that involve
all other clusters (as is the case when using I). This is
why the number of cost function evaluations that
sequential clustering needs to perform reduces from
O(N3) to O(N2), when switching from I to II, in similarity
to the SaNGreeA algorithm.

5. EXPERIMENTAL RESULTS
By testing our proposed sequential clustering algorithm
on three types of graphs: initially an arbitrary graph
generated by the Watts-Strogatz (WS) model [9];                 Fig. 3. Average information losses in the WS, BA, and
secondly an arbitrary graph generated by the Baraba´si-                             DBLP graphs.
Albert (BA) model [10]; and finally a subset of the DBLP
Volume 3, Issue 3 May – June 2014                                                                               Page 245
   International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
       Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 3, Issue 3, May-June 2014                                            ISSN 2278-6856

7. CONCLUSION                                                  [9] S. Strogatz and D. Watts, “Collective Dynamics of
The key objective in releasing the anonymized database              ’Small-World’ Networks,” Nature, vol. 393, pp. 409-
for data mining is to infer methods of predicting the               410, 1998.
private data from the public data. This paper described        [10] R. Albert and A. Baraba´si “Emergence of Scaling
the sequential clustering algorithm for k-anonymization.            in Random Networks,” Science, vol. 286, pp. 509-
The objective of the proposed work is to arrive at an               512, 1999.
anonymized view of the social networks without
enlightening to any information about the nodes and links
between nodes in the graph that are controlled by the data     AUTHOR(s)
holders.                                                       K.Santhi, received the B.Tech degree in computer
                                                               science and Engineering from jntu ananthapur, 2009 and
Sequential clustering algorithm begins with an arbitrary       pursing M.Tech(2012-2014) in Sree Vidyanikethan
division of the records into some T number of clusters.        Engineering College, Tirupati.
Then it goes over the n records in a cyclic manner. For
each record it checks whether it may be moved from its         N.Sai Lohitha, received the M.Tech degree in 2011 in
present cluster to another one while increasing the utility    SRM University, Chennai. Currently she is working as
of the induced anonymization. This loop may be iterated        Assistant Professor in Sree Vidyanikethan Engineering
until it reaches a local optimum (a stage in which no          College, Tirupati.
single record move). As there is no guarantee that such
procedure finds the global optimum, it may be repeated
many times with distinct arbitrary partitions as the initial
point in order to discover the best local optimum among
those repeated searches. Sequential clustering algorithm
generates anonymization by means of clustering with
better utility than those achieved by presented algorithms.

REFERENCES
[1] M. Hay,D. Jensen, D.F. Towsley, G. Miklau, and
    P.Weis,“Resisting Structural Re-Identification in
    Anonymized        SocialNetworks,”   Proc.    VLDB
    Endowment, vol. 1 ,2008.
[2] A. Campan and T.M. Truta, “Data and Structural k-
    Anonymity in Social Networks,” Proc. 2nd ACM
    SIGKDD Int’l Workshop Privacy, Security, and
    Trust in KDD,pp. 33-54, 2008.
[3] E. Zheleva and L. Getoor, “Preserving the Privacy of
    Sensitive Relationship in Graph Data,” Proc. ACM
    SIGKDD Ist Int’l Conf.Privacy, Security, and Trust in
    KDD (PinKDD), pp. 153-171, 2007.
[4] L. Sweeney, “Uniqueness of Simple Demographics in
    the U.S.Population,” Laboratory for Int’l Data
    Privacy (LIDAP-WP4), 2000.
[5] J.Benaloh,”Secret Sharing Homomorphisms: keeping
    Shares of a Secret Secret,”Proc.Advances in
    Cryptology (Crypto),1986.
[6] A. Campan and T.M. Truta ,“Data and Structural k-
    Anonymity in Social Networks,” Proc. 2nd ACM
    SIGKDD Int’l Workshop Privacy, Security, and
    Trust in KDD (PinKDD), 2008.
[7] T. Tassa           and J. Goldberger, “Efficient
    nonymizations with Enhanced Utility,” Trans. Data
    Privacy, vol. 3, pp. 149-175, 2010.
[8] A. Gionis, F. Bonchi and T. Tassa, “Identity
    Obfuscation in Graphs Through the Information
    Theoretic Lens,” Proc. IEEE 27th Int’l Conf. Data
    Eng. (ICDE), 2011.




Volume 3, Issue 3 May – June 2014                                                                           Page 246

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:7/18/2014
language:English
pages:5
Description: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 3, Issue 3, May-June 2014 ISSN 2278-6856