POur-NIR: Modified Node Importance Representative for Clustering of Categorical Data by ijcsiseditor


More Info
									                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 9, No. 4, April 2011

         POur-NIR: Modified Node Importance
     Representative for Clustering of Categorical
S.Viswanadha Raju          N.Sudhakar Reddy H.Venkateswara Reddy G.Sreenivasulu C.NageswaraRaju      
Professor in CSE,SIT Professor in CSE             Assoc. Prof. in CSE             Assoc.Prof,CSE Lecturer&HOD of CS
JNTUH,Hyderabad, SVCE, Tirupati                   VCE, Hyderabad                 VCE,Hyderabad SVDC,Kadapa
India                India                        India                          India           India

Abstract - The problem of evaluating node importance                 also change based on time by the data drifting
in clustering has been active research in present days               concept [11, 16]. The clustering time-evolving data in
and many methods have been developed. Most of the                    the numerical domain [1, 5, 6, 10] has been explored
clustering algorithms deal with general similarity                   in the previous works, where as in categorical domain
measures. However In real situation most of the cases
data changes over time. But clustering this type of data
                                                                     not that much. Still it is a challenging problem in the
not only decreases the quality of clusters but also                  categorical domain.
disregards the expectation of users, when usually
require recent clustering results. In this regard we                      As a result, our contribution in modifying the
proposed Our-NIR method that is better than Ming-                    Our-NIR method which is proposed by us [17]
Syan Chen proposed a method and it has proven with                   utilizes any clustering algorithm to detect the drifting
the help of results of node importance, which is related             concepts. Our-NIR method is modified by help of
to calculate the node importance that is very useful in              probability distribution so that the name this method
clustering of categorical data, still it has deficiency that         is referred as POur-NIR.           We adopted sliding
is importance of data labeling and outlier detection. In
this paper we modified Our-NIR method for evaluating
                                                                     window technique and initial data (at time t=0) is
of node importance by introducing the probability                    used in initial clustering. These clusters are
distribution which will be better than by comparing the              represented by using POur-NIR (Our-NIR with the
results.                                                             probability), where each attribute value importance is
                                                                     measured. By this method we can find whether the
     Keywords- clustering, NIR,Our-NIR, Categorical                  data points in the next sliding window (current
data and node.                                                       sliding window) belongs to appropriate clusters of
                                                                     last clustering results or they are outliers which is
    I.        INTRODUCTION                                           future direction. We call this clustering result as a
                                                                     temporal and compare with last clustering result to
     Extracting Knowledge from large amount of data                  drift the data points or not. If the concept drift is not
is difficult which is known as data mining. Clustering               detected to update the POur-NIR otherwise dump
is a collection of similar objects from a given data set             attribute value based on importance and then
and objects in different collection are dissimilar.                  reclustering using clustering techniques. However we
Most of the algorithms developed for numerical data                  are comparing the node importance values of various
may be easy, but not in Categorical data [1, 2, 12,                  methods with POur-NIR this paper.
13]. It is challenging in categorical domain, where                            The rest of the paper is organized as follows.
the distance between data points is not defined. It is               In section 2 discussed related work, in section 3 basic
also not easy to find out the class label of unknown                 notations and Node representation provided, in
data point in categorical domain. Sampling                           section 4 POur-NIR method for node importance
techniques improve the speed of clustering and we                    discussed and also contains results with comparison
consider the data points that are not sampled to                     of Ming-Syan Chen method and Our-NIR method
allocate into proper clusters. The data which depends                and finally concluded with section 5.
on time called time evolving data. For example, the
buying preferences of customers may change with                                        II. RELATED WORK
time, depending on the current day of the week,
availability of alternatives, discounting rate etc. Since                In this section, we discuss various clustering
data evolve with time, the underlying clusters may                   algorithms on categorical data with cluster

                                                               146                               http://sites.google.com/site/ijcsis/
                                                                                                 ISSN 1947-5500
                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                    Vol. 9, No. 4, April 2011

representatives and data labeling. We studied many                                    III. NODE REPRESENTATION 
data clustering algorithms with time evolving.Cluster
representative is used to summarize and characterize                    For categorical or mixed data can have several
the clustering result, which is not fully discussed in             representations. But in our work we can take two
categorical domain unlike numerical domain.                        sorts of data representations. In first kind of
In K-modes which is an extension of K-means                        representation every data point present in the sliding
algorithm in categorical domain a cluster is                       window or data slide is divided into distinct points in
represented by ‘mode’ which is composed by the                     which every distinct point is considered as the new
most frequent attribute value in each attribute domain             node and each node has two parts, in this name or the
in that cluster. Although this cluster representative is           categorical value is placed in the pre-part of the node
simple, only use one attribute value in each attribute             where as the post-part contains the numerical value
domain to represent a cluster is questionable. It                  of that data point or the node. For example: nodes
composed of the attribute values with high co-                     with attribute name “COMPOSE” which is a
occurrence. In the statistical categorical clustering              categorical part and the number of occurrences in the
algorithms [3,4] such as COOLCAT and LIMBO,                        document ‘24’ is a numerical part. This node is
data points are grouped based on the statistics. In                represented as follows:
algorithm COOLCAT, data points are separated in                                       Node [COMPOSE: 24]
such a way that the expected entropy of the whole                       This representation eventually reduces the
arrangements is minimized. In algorithm LIMBO, the                 ambiguity that may prevail among the attributes, as
information bottleneck method is applied to minimize               many attributes may have same value. By introducing
the information lost which resulted from                           the categorical part into the node we eliminate the
summarizing data points into clusters. However, all                risk of confusion. There is another form of
of the above categorical clustering algorithms focus               representation of the data in our work. In this second
on performing clustering on the entire dataset and do              representation we use a data description file that
not consider the time-evolving trends and also the                 describes the data attributes and with a transitive
clustering representatives in these algorithms are not             relation we recognize the data attribute. This is the
clearly defined.                                                   simplification of the above mentioned representation
                                                                   the only difference is that categorical part is kept in
     In this paper, first object of our method which is            another file. This may look like the numerical
based on the idea of representing the clusters by the              representation of the data at an instance, but the value
importance of the attribute values. This                           that is used to represent an attribute may be a
representation is more efficient than using the                    numerical, binary, or categorical. This eases the
representative points.                                             effort that is required. This representation is also
After scanning the literature, it is clear that clustering         useful for the importance of node in the data set used
categorical data is un touched many ties due to the                in our work.
complexity involved in it. A time-evolving                                    IV. IMPORTANCE OF NODE: POur‐NIR 
categorical data is to be clustered within the due                                                         

course hence clustering data can be viewed as                           The distribution of Node that is described in
follows: there are a series of categorical data points D           above section represents a cluster. As mentioned
is given, where each data point is a vector of q                   every node has attribute value, the same value is used
attribute values, i.e., pj=(pj1,pj2,...,pjq). And A =              to find the distribution of the data points. Hence the
{A1,A2 ,..., Aq}, where Aa is the ath categorical                  importance of the node plays a great role in finding
attribute, 1 ≤ a ≤ q. The window size N is to be given             clusters the importance of the node is evaluated with
so that the data set D is separated into several                   the following rules such as rule 1 , rule 2 and rule 3.
continuous subsets St, where the number of data                    Here we considered a symbolic representation for the
points in each St is N. The superscript number t is the            ith node in cluster i is N [i, r], The number of data
identification number of the sliding window and t is               points in cluster Ci is mi , and k is number of clusters .
also called time stamp. Here in we consider the first
N data points of data set D this makes the first data              Rule1 (Positive Probability of node N [i, r]): The
slide or the first sliding window S0. Our intension is             probability of node (pi )in the cluster can be
to cluster every data slide and relate the clusters of
                                                                   calculated as follows:
every data slide with previous clusters formed by the
                                                                                  | N [i, r ] |
previous data slides. Several notations and                            pi =
representations are used in our work to ease the
                                                                                  z =1
                                                                                         | N [ z, r ] |
process of presentation:

                                                             147                                      http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                   Vol. 9, No. 4, April 2011

Rule 2 (Negative Frequency of node N [i, r]): The                                 provide in better manner for the clustering of data
negative frequency of the node in the cluster is                                  points on time based.
calculated based on the probability of node N [i, r].
              | N [ j, r ] |
 q j = 1−
                z =1
                       | N [ z, r ] |

Rule 3 (node distribution function): the node
distribution function value can be calculated by the
product of Rule 1 and Rule 2.
          d(N[r])= pI * πqj, j i

Rule 4 (Weighted Function): the importance of node
N [i, r]. can be calculated by the product of Rule 3 and
frequency of node in ith cluster.
                                                                                  Fig 1. Sample data points of categorical data and initially cluster
                                        N[i, r]                                   performed for the sliding window 1.
         W ( ci,    N [i, r])=                     * d(N [i, r])
                                         mi                                       Example 1: consider the data set in fig 1.cluster c11
                                                                                  contains three data points. The node {[A1=A]} occurs
The weighting function is designed to measure the                                 three times in c11and does not occurs in c12.
distribution of the node between clusters based on the                            The importance of node {[A1=A]} in c11 and in c21 is
information theorem [15].                                                         calculated as follows the weight of the node
     The weighting function measures the entropy of                               d({A1=A}) = ((3/3)* (1-0/3)=1 and therefore an
the node between clusters. Suppose that there is a                                importance of the node {A1=A} in cluster c11 is w(c1
node that occurs in all clusters uniformly. The node                              , {A1=A}) = (3/3)*1=1 and in cluster c21 , it is zero.
that contains the maximum uncertainty provides less                               Similarly the remaining nodes as follows : Weight of
clustering characteristics. Therefore, this node should                           the     node d({A2=M})= ((3/6)*(1-3/6)=0.25 and
have a small weight. Moreover, the maximum                                        therefore node importance in cluster c11            is
entropy value of a node between clusters equals. Our-                             w(c1,{A2=M})=(3/3)*0.25=0.25 , weight of the node
NIR minimized the computation time because of                                     d({A3=C})= ((2/2)*(1-0/2)=1 and node importance in
normalization is not required where as in Chan                                    cluster in cluster c11 is w (c1,{A3=C})=(2/3)*1=0.66,
proposed method considered normalization and                                      weight of the node d({A3=D})= ((1/1)*(1-0/1)=1 and
highest frequency node getting the zero importance                                node importance in cluster in cluster c11        is w
that give impurity clustering. Highest frequency node                             (c1,{A3=C})=(1/3)*1=0.33, similarly calculate the
may get the relative importance by proposed POur-                                 importance of the node values to the cluster c2,
NIR Method which will reduce the impurity.
     The importance of the node N [i, r] in cluster ci is
measured by multiplying the rule1 and rule 2 i.e., the
weighting function W ( ci, N [i, r]). Note that the range
of both the probability of N [i, r] being in ci and the
weighting function W ( ci, N [i, r]) is [0, 1], implying
that the range of the important value W ( ci, N [i, r])  is
also in [0, 1].
                                                                                     Fig 2. Node importance values for cluster C1 and C2
      The new method is related to the idea of
conceptual clustering [9], which creates a conceptual
structure to represent a concept (cluster) during
clustering. However, NIR only analyzes the
conceptual structure and does not perform clustering,
i.e., there is no objective function such as category
utility (CU) [11] in conceptual clustering to lead the
clustering procedure. In this aspect our method can

                                                                            148                                 http://sites.google.com/site/ijcsis/
                                                                                                                ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                Vol. 9, No. 4, April 2011

                                                                         In this paper, we considered the previous
                                                               work that is Our-NIR Method which is modified by
                                                               the Probability distribution that is to find node
                                                               importance of node. We analyzed by taking same
                                                               example in this find the differences in the node
                                                               importance values of attributes in same cluster which
                                                               plays an important role in clustering. The future work
                                                               deciding the class label of unclustered data point and
                                                               therefore the result demonstrates that POur-NIR
                                                               method is accurate as said in section 4, than by our
                                                               previous Method and also it improves the
                                                               performance of precision and recall of DCD.

                                                               [1] C. Aggarwal, J. Han, J. Wang, and P. Yu, “A
                                                               Framework for Clustering Evolving Data Streams,”
                                                               Proc. 29th Int'l Conf.Very Large Data Bases (VLDB)
                                                               [2] C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc,
                                                               and J.S. Park, “Fast Algorithms for Projected
                                                               Clustering,” Proc. ACM SIGMOD '99, pp. 61-72,
                                                               [3] P. Andritsos, P. Tsaparas, R.J. Miller, and K.C.
                                                               Sevcik, “Limbo: Scalable Clustering of Categorical
                                                               Data,” Proc. Ninth Int'l Conf. Extending Database
Fig 3 (a) NIR values of nodes for Cluster C1
                                                               Technology (EDBT), 2004.
Fig 3 (b) NIR Values of nodes for Cluster C2                   [4]D. Barbará, Y. Li, and J. Couto, “Coolcat: An
                                                               Entropy-Based       Algorithm     for    Categorical

          A. Comparison of Our‐NIR and POur‐NIR                Clustering,” Proc. ACM Int'l Conf. Information and
                                                               Knowledge Management (CIKM), 2002.
Fig 2 shows the importance value of attributes with            [5] F. Cao, M. Ester, W. Qian, and A. Zhou,
size of window is 5. This study fixes for the two              “Density-Based Clustering over an Evolving Data
slides of data points with the time evolving that is           Stream with Noise,” Proc. Sixth SIAM Int'l Conf.
from t1 to t2. As we said the importance of the node           Data Mining (SDM), 2006.
in that way we comparing the node importance                   [6] D. Chakrabarti, R. Kumar, and A. Tomkins,
values. The node importance values of our method is            “Evolutionary Clustering,” Proc. ACM SIGKDD '06,
provide in a different way that increase the purity of         pp. 554-560, 2006.
the cluster which is impact on the accuracy of                 [7] H.-L. Chen, K.-T. Chuang and M.-S. Chen,
clustering. In figure 3 (a) and (b), we present the            “Labeling Unclustered Categorical Data into Clusters
importance values of the each system maintained and            Based on the Important Attribute Values,” Proc. Fifth
over the 4 attributes of each cluster with sliding             IEEE Int'l Conf. Data Mining (ICDM), 2005.
window size of given on importance of attributes of            [8]H.-L. Chen, M.-S. Chen, and S-U Chen Lin
POur-NIR method is showing in range 0 to 1 and it              “Frame work for clustering Concept –Drifting
has one major drawback that is if node occurs in both          categorical data,” IEEE Transaction Knowledge and
the clusters with highest frequency, even though it            Data Engineering v21 no 5 , 2009.
may get zero importance ,Our-NIR showing in the                [9] D.H. Fisher, “Knowledge Acquisition via
range 0.16 to 0.5 that means all the attributes might          Incremental Conceptual Clustering,” Machine
be getting more or less importance .POur-NIR                   Learning, 1987.
Method showing the range 0.25 to 1 which is an                 [10]MM Gaber and PS Yu “Detection and
average to the above methods so it can be better for           Classification of Changes in Evolving Data Streams,”
the maintenance of the purity clustering.                      International .Journal .Information Technology and
                                                               Decision Making, v5 no 4, 2006.
                                                               [11] M.A. Gluck and J.E. Corter, “Information
                     V. CONCLUSION                             Uncertainty and        the Utility of Categories,”
                                                               Proc. Seventh Ann. Conf. Cognitive Science Soc.,
                                                               pp. 283-287, 1985.

                                                         149                               http://sites.google.com/site/ijcsis/
                                                                                           ISSN 1947-5500
                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                           Vol. 9, No. 4, April 2011

[12]G Hulton and Spencer,“Mining Time-Changing
Data Streams” Proc. ACM SIGKDD, 2001.
[13]AK Jain MN Murthy and P J Flyn “Data
Clustering: A Review,” ACM Computing Survey,
[14]O.Narsoui and C.Rojas,“Robust Clustering for
Tracking Noisy Evolving Data Streams” SIAM Int.
Conference Data Mining , 2006.
[ 15]C.E. Shannon, “A Mathematical Theory of
Communication,” Bell System Technical J., 1948.
[16]S.Viswanadha Raju, H.Venkateswara Reddy
andN.Sudhakar Reddy,” A Threshold for clustering
Concept – Drifting Categorical Data”, IEEE
Computer Society, ICMLC 2011.
[17]S.Viswanadha Raju, H.Venkateswara Reddy and
N.Sudhakar Reddy,” Our-NIR:Node Importance
Representation of Clustering Categorical Data ”,
International Journal of Computer Science and
Technology (Accepted) 2011.

                                                    150                               http://sites.google.com/site/ijcsis/
                                                                                      ISSN 1947-5500

To top