A Study of Various Clustering Algorithms on Retail Sales Data by warse1


									                                                                                                           ISSN 2319-2720

                                            Volume 1, No.2, September – and Networking,
Vishal Shrivastava et al, International Journal of Computing, Communications October 2012 1(2), September – October
2012, 68-74
                       International Journal of Computing, Communications and Networking
                                Available Online at http://warse.org/pdfs/ijccn04122012.pdf

                A Study of Various Clustering Algorithms on Retail Sales Data

                                     Vishal Shrivastava, 2Prem narayan Arya
                       M.Tech.(Software Systems), SATI, Vidisha, India. shrvishal@gmail.com
         Asst. Prof. Dept. of Computer Applications, SATI, Vidisha, India. Premnarayan.arya@rediffmail.com

ABSTRACT                                                           cluster have high similarity in comparison to one
                                                                   another and very dissimilar to object in other
Data mining is the process of extraction of Hidden                 clusters..Dissimilarity is due to the attributes values
knowledge from the databases. Clustering is one the                that describe the objects.
important functionality of the data mining Clustering
is an adaptive methodology in which objects are                    The objects are grouped on the basis of the principle
grouped together, based on the principle of                        of optimizing the intra-class similarity and reducing
optimizing the inside class similarity and minimizing              the inter-class similarity to the minimum. First of all
the class-class       similarity. Various clustering               the set of data is portioned into groups on the basis of
algorithms have been developed resulting in a better               data similarity (e g by clustering) and the then
performance on datasets for clustering. The paper                  assigning labels to the comparatively smaller number
discusses the four major clustering algorithms: K-                 of groups.
Means, Density based, Filtered, Farthest First                     Several clustering techniques are there: partitioning
clustering algorithm and comparing the performances                methods, hierarchical methods, density based
of these principle clustering algorithms on the aspect             methods, grid based methods, model based methods,
of correctly class wise cluster building ability of                methods for high dimensional data and constraint
algorithm .The results are tested on datasets of retail            based clustering. Clustering is also called data
sales using WEKA interface and compute the                         segmentation because clustering partitions large data
correctly cluster building instances in proportion with            sets into groups according to their similarity.
incorrectly formed cluster. A comparison of these
four algorithms is given on the basis of percentage of
incorrectly classified instances.

Keywords: Data mining, Clustering, k means, Retail


The process of Knowledge discovery executes in an
iterative sequence of steps such as cleaning of data,
its integration, its selection, & transformation of data,
data mining, evaluating patterns and presentation of
knowledge. Data mining features are characterization
and discrimination, mining Frequent patterns,
association, correlation, Classification and prediction,
cluster analysis, outlier analysis and evolution
analysis Clustering is the process of grouping the
data into classes or clusters, so that objects within a
                                                                                 Figure 1: Cluster Analysis

                                                                 Clustering can be utilized for detecting outlier where
                                                                 outliers are more interesting then common cases e g
@ 2012, IJCCN All Rights Reserved
Vishal Shrivastava et al, International Journal of Computing, Communications and Networking, 1(2), September – October 2012, 68-74

monitoring of criminal activities in electronic commerce ,             k-Means ClusterinG
Credit card fraud detection etc. Clustering is a pre-
processing step in the sequence for other algorithms of                We determine number of clusters N and we assume the
characterization,    attribute   subset     selection     and          centroid of these clusters. We can take any random objects
classification, which then operate on the detected clusters            as the initial centroids or the first N objects that can also
and the selected attributes or features. Research areas                serve as the initial centroids. Then the N means algorithm
include data mining, statistics, machine learning, biology,            will perform the three steps given below until convergence
special database technology and marketing. Clustering is an            occurs. Iterate until stability (= no object move group):
unsupervised learning. Different from classification, it does
not rely on predefined classes and class labels training               1.     Determine      the    coordinate     of     centroid
examples. So clustering is learning by observation and not             2. Determine the distance of each object from the centroids
learning by examples.                                                  3. Unite the Group of the objects based on minimum
                                                                       distance                    (finding                    the
A "clustering" is a set of such clusters, that usually contains        closest centroid simultaneously). This is showed in figure
all objects in the data set. Additionally, it also informs the         1.1 in steps.
relationship of the clusters with each other, for example a
chain hierarchy of clusters put inside or embedded in each             Density based Clustering
other Clustering can broadly be distinguished into:
                                                                       To discover clusters with arbitrary shape, density based
Hard clustering: each object belongs to a cluster or no.               clustering methodology have been developed hence
                                                                       typically regard clusters as dense region of objects in the
Soft clustering (fuzzy clustering): each object belongs to a           data space that are separated by regions of low density.
cluster to a certain degree (like the similarity in belonging
to the cluster)                                                        Filtered Clustering

There     are    also     minute      distinctions    possible,        A filter adds a new nominal attribute that represents the
like:                                                                  clusters assigned to every instance by specified clustering
                                                                       algorithm. Either the clustering algorithm is built with the
Strict partitioning clustering: Every object belongs to only           first batch of data or ones specifications are serialized
one cluster                                                            clustered      model      file     to     use,      instead.

Strict partitioning clustering with outliers: objects can also
be of no cluster, and are considered outliers.                         Farthest First Clustering

Overlapping clustering (also called: alternative clustering,           Farthest first is a variant of N Means that places all the
multi- view clustering): objects belong to more than one               cluster centre in turn at the point which is farthest from the
cluster.                                                               existing cluster centre. This point should be within the data
                                                                       area. This speed up the clustering in most of the cases
Hierarchical clustering: objects of a child cluster also               greatly since lesser reassignments and adjustments are
belong to the parent cluster                                           needed.

Subspace clustering: while an overlapping clustering,                  2. REVIEW OF LITERATURE
clusters are not expected to overlap.
                                                                       Xiaozhe Wang et al., in 2006 provided a method of
Desired Typical requirements of clustering in data mining              clustering of the time series based on their structural
are                                                                    characteristics, rather it groups based on global features
(i) Scalability (ii) ability to deal with various types of             extracted from the time series. Global measures explaining
attributes, (iii) Discovery of clusters with different                 the time series are achieved by applying statistical methods
shapes,(iv) Minimal requirement for domain knowledge to                that     capture       the      following      characteristics:
determine input parameters, (v)Ability to deal with noisy              trend, nonlinearity, skewness, seasonality, chaos, ,
data,(vi) Incremental Clustering and insensitivity to the              periodicity, serial correlation, kurtosis, and self-similarity.
order of input records, (vii)High dimensionality,(viii)                Since the method clusters use extracted global measures, it
Constraint based Clustering and Interpretability and (ix)              minimizes the dimensionality of time series and is very less
usability.                                                             sensitive to noisy data. A search mechanism is then

@ 2012, IJCCN All Rights Reserved
Vishal Shrivastava et al, International Journal of Computing, Communications and Networking, 1(2), September – October 2012, 68-74

provided to find the best selection from the feature set that         series data sets from a wide variety of application
should be used for the clustering inputs [2].                         perspectives. They provided an outline of these techniques
                                                                      and presented their comparative experimental results
Li Wei et al. in 2005 defined a tool for visualizing and data         corresponding to their effectiveness. Their experiments
mining of medical time series and found that increasing the           have provided both a unified validation of existing
interest in time series data mining has had astonishingly             achievements, and suggested that certain claims in the
minute impact on real world medical applications.                     literature may be hopeful.
Practitioners working with time series regularly, rarely take
advantage of the tools that the data mining community has             Ehsan Hajizadeh et al.,in 2010examined and provided an
made available. This approach finds features from a time              outline of application of data mining like decision trees,
series of random length and utilizes information about the            association rules, neural network, factor analysis and etc in
relative frequency of these features to color an image in a           the stock markets. Also, this tells progressive applications
principled way. By observing the similarities and                     known gap and less significant area and determined the
differences within a collection of image bitmaps, a user can          future works for researchers. This tells the problems of data
quickly find clusters, exceptions, and other regularities             mining in finance (stock market) and specific requirements
within the data collection.                                           for data mining methods including in making
                                                                      Interpretations ,incorporating relations and probabilistic
An Online Algorithm for Segmenting Time series was                    learning. The data mining techniques mentioned here
executed by Eamonn Keogh et al., in 2001. This was the                increases the performance in pattern discovery methods
first wide review and formulated comparison of time series            that deals with rigorous numeric and alpha numeric data,
segmentation      algorithms from        a    data    mining          that involves structured objects, text and data in a variety of
point of view. Thus emerged the most popular approach,                discontinuous and continuous scales (nominal, order,
Sliding Windows, which generally produces poor results,               absolute and so on).
and the second most popular approach, Top-Down, will
produce reasonable results, it is not scalable. On the                Also, this show benefits of using such techniques for stock
contrary, least known, Bottom-Up approach will produces               market forecast[7].      Jiangjiao Duan et al., in 2005
excellent results and it scales linearly with the size of the         introduced that Model-based clustering is one of the most
dataset. In addition, this introduced SWAB, a new                     important ways for time series data mining. However, the
algorithm, which also scales linearly with the size of the            process of clustering may encounter several problems. Here
dataset, and requires only constant space and produces high           a novel clustering of the dataset. In addition, this introduced
quality approximations of the data.                                   SWAB, a new online algorithm, which scales linearly with
                                                                      the      size     of      the            dataset,       requires
A Model Based Clustering for Time Series with Irregular               only constant space and produces high quality
Interval was proposed by Xiao-Tao Zhang et al., in 2004.              approximations of the data [4].
This focussed Clustering problems are central to many
knowledge discovery and data mining tasks. However, most
existing clustering methods can only work with fixed                  Jiangjiao Duan et al., in 2005 introduced that Model-based
interval representations of data patterns, ignoring the               clustering which is one of the important ways for time
variance of time axis. This studied the clustering of data            series data mining. However, the process of clustering may
patterns that are sample in irregular interval. A model-based         face several problems. Here a novel clustering algo of time-
approach.This focused on the clustering of data patterns              series incorporating recursive Hidden Markov Model
which are sampled in irregular interval. A model-based                (HMM) training was proposed. It contributed in following
approach that use cepstnun distance metric and                        aspects:
Autoregressive Conditional Duration (ACD) model has                   1) It recursively trains models and use this model
proposed. Experimented results on real datasets endorses              information             in            the            process
that this method is effective in clustering irregular space           agglomerative hierarchical clustering.
time series, and also results inferred from experimental              2) It built HMM of time series clusters to describe clusters.
values agrees with the market microstructure theories.
                                                                      To evaluate the effectiveness of the algorithm, so many
Hui Ding et al., in 2008 used experiments to compare the              experiments have been conducted on both synthetic and real
representations and distance measures of querying and                 world data. The inferences shows that this approach can
mining of Time Series Data. This led to conducting an                 achieve nicer performance in correctness rate than the
extensive set of time series experiments re-implementing 8            conventional HMMbased clustering algorithm [8].
different representation methods and 9 similarity measures            Information Mining Over Heterogeneous and High-
and their variants, and testing their effectiveness on 38 time        Dimensional Time-Series Data in Clinical Trials Databases

@ 2012, IJCCN All Rights Reserved
Vishal Shrivastava et al, International Journal of Computing, Communications and Networking, 1(2), September – October 2012, 68-74

was carried out by Fatih Altiparmak et al., in 2006. They             Retail sales is an important issue in the market. Various
gave a good approach for data mining involving two steps:             research are done in this area to understand the market
(i) applying a data mining algorithm over homogeneous                 pattern, customer behaviour, demand & supply, inventory
subsets of data, and identifying common or (ii) distinct              etc. in this thesis we also used retaile sales data. The
patterns over the information gathered in the first step. This        original data contain 126 attribute & 3067 intances but in
approach is implemented only for different and high                   our experiments we take the following 26 attribute with all
dimensional time series clinical trials of data. Using the            3067 instances. This is the sales per capita data of different
framework, they propose a new way of utilizing frequent               states of america.
item set mining, as well as clustering and declustering
techniques with better distance metrics for measuring                                Table 1: All Variable Names
similarity among time series data. By grouping the data, it
find groups of analyze (substances present in blood) which                     Variable
is most strongly correlated. Most of these known                       S.No.   Name              Definition
relationships are verified by the clinical panels, and, in             1       STATE             State Identifier ICPSR
addition, they identify novel groups that require further
biomedical analysis. A slight change in the method results             2       PCTBLK3           % black in 1930
in an effective declustering of high dimensional time series           3       PCTURB3           % urban in 1930
data, which is then used in “feature selection.” Using                 4       PCTFRM3           % of land on farms in 1929
industry-sponsored clinical trials data sets, they are able to
                                                                                                 % illiterate over age 10,
find out a smaller set of analytes that effectively models the
                                                                       5       PCTILL3           1930
status of normal health [9].
                                                                       6       PFORB3            % foreign born
                                                                                                 County code based on
3. COMPARATIVE PERFORMANCE ANALYSIS OF                                                           ICPSR county code with
CLUSTERING ALGORITHMS                                                                            adjusted values for combined
                                                                       7       NDMTCODE          counties
For performance evaluation of the four most popular                    8       AREA              area in square miles
clustering techniques K-Mean clustering, Density based
Clustering, Filtered clustering and farthest first clustering,         9       LATITUDE          Latitude of county seat
we have taken datasets containing numerical attribute then             10      LONGITUD          Longitude of county seat
convert these into nominal attributes type that is all these           11      STAB              State Name
datasets contains the continuous attributes. This dataset
contain 26 attribute on the basis of assumption class                  12      CNAMESR           County Name
attribute as No. of bays, the cluster are generating by                                          % population      aged 10-19,
applying the below mentioned algorithms using the Weka                 13      PP301019          1930
interface. Weka is a landmark system in the history of the                                       % population      aged 20-29,
data mining and machine learning research communities,                 14      PP302029          1930
because it is the only toolkit that has gained such                                              % population      aged 30-34,
widespread adoption and survived for an extended period of             15      PP303034          1930
time (the first version of Weka was released 11 years ago).                                      % population      aged 35-44,
                                                                       16      PP303544          1930
We have performed various experiment on the retail sales                                         % population      aged 45-54,
data. We have performed following 4 clustering method on               17      PP304554          1930
the data.                                                                                        % population      aged 55-64,
                                                                       18      PP305564          1930
i. Simple kMeans Clustring                                                                       % population    aged 65 and
ii.Density Based Clustring                                             19      PP3065UP          over, 1930
iii.Filtered Clustring                                                 20      BAY               Number of Bays in County
iv.Farthest First Clustring
                                                                       21      BEACH             Number of Beaches
We have already discussed various clustring methods, in                22      LAKE              Number of Lakes
previous section. All the above clustering methods are                                           Retail Sales per Capita in
applied on the data detailed as in the next section. These             23      RRTSAP39          1939 in 1967 $
methods are applied with the help of weka data mining tool                                       Retail Sales per Capita in
& take the results.                                                    24      RRTSAP35          1935 in 1967 $
@ 2012, IJCCN All Rights Reserved
Vishal Shrivastava et al, International Journal of Computing, Communications and Networking, 1(2), September – October 2012, 68-74

                           Retail Sales per Capita in                                  Table 2: Clustering Performance
 25       RRTSAP33         1933 in 1967 $
                           Retail Sales per Capita in                                                                     %          of
 26       RRTSAP29         1929 in 1967 $                                                     No of                       Incorrectly
                                                                           Method                        Clustered
                                                                                              cluster                     Clustered
4.. EXPERIMENTAL SIMULATION AND RESULTS                                    Simple kmeans      4          1500             48.9077
Above four algorithms have their implemented source code                   Simple kmeans      3          1436             46.821
in the Weka upon which simulations have carried out in                     Simple kmeans      2          964              31.4314
order to measure the performance parameters of the                         Density Based
algorithms over the datasets. The results are summarized in                Cluster            2          965              31.464
the following tables & Graph.                                              Filtered Cluster   2          964              31.4314
                                                                           Farthest First     2          804              26.2145
                Table 2: Cluster Distribution                              Farthest First     3          816              26.6058
                                                                           Farthest First     4          814              26.5406
Name of the          No. of       Instance in       Instance     in
Cluster Method       cluster      cluster 1         cluster 2
Simple kMeans        2            1675              1392
Density     based    2            1688              1379                       60
Cluster                                                                        50
Filtered Cluster     2            1675              1392                       40
Farthest First       2            2904              163                        30

   1500                                                                                No of cluster
    500                                                                                % of Incorectly Clustered instances
             Simple      Density         Filtered     Farthest                         Figure 3: Clustering Performance
             kMeans       based          Cluster        First

          Instance in cluster 1     Instance in cluster 2
          Instance in cluster 3     Instance in cluster 4

               Figure 2: Cluster Distribution

@ 2012, IJCCN All Rights Reserved
Vishal Shrivastava et al, International Journal of Computing, Communications and Networking, 1(2), September – October 2012, 68-74

                                                                     4. Eamonn Keogh, Selina Chu,David Hart and Michael
   1400                                                              Pazzani. An online algorithm for segmenting time series,
   1200                                                              0-7695-1 119-8/01 IEEE, 2001.
    600                                                              5. Xiao-Tao Zhang, Wei Zhang and Xiong Xiong. A
    400                                                              model based clustering for time-series with irregular
    200                                                              interval, Proceedings of the Third International Conference
      0                                                              on Machine Learhg and Cybernetics, Shanghai, pp.26-29,
                                                                     August 2004.

                                                                     6. Hui Ding, Goce Trajcevski and Eamonn Keogh.
                                                                     Querying and mining of time series data: Experimental
                                                                     comparison of representations and distance measures,
                                                                     PVLDB '08, August, pp. 23-28, 2008, Auckland, New
               No of cluster                                         Zealand, 2008.
               Incorectly Clustered instances
                                                                     7. Ehsan Hajizadeh, Hamed Davari Ardakani and Jamal
               % of Incorectly Clustered instances                   Shahrabi. Appilication of data mining techniques in
                                                                     stock market, Journal of Economics and International
                                                                     Finance Vol. 2(7), pp. 109-118, July 2010.
             Figure 4: Clustering Performance
                                                                     8.Jiangjiao Duan, WeiWang , Bing Liu and Baile Shi.
                                                                     Incorporating with recursive model training in time
                                                                     series clustering, Proceedings of the 2005 The Fifth
Performance of the clustering method is measured by the
                                                                     International Conference on Computer and Information
percentage of the incorrectly classified instances. As the
                                                                     Technology (CIT’05), IEEE2005.
percentage of the incorrectly classified attribute is low
performance of the clustering is as good. Farthest first
                                                                     9. Fatih Altiparmak, Hakan Ferhatosmanoglu, Selnur Erdal,
clustering gives better performance compared to k means
                                                                     and Donald C. TrostFaith Altipar. Information Mining
clustering, Density based clustering & filtered clustering.
                                                                     Over Heterogeneous and High-Dimensional Time-
Also this algorithm’s result is independent of number of             Series Data in Clinical Trials Databases, IEEE
cluster while k means algorithm result is highly dependent
                                                                     Transactions On Information Technology In BioMedicine,
on the number of cluster. Farthest first clustering though
                                                                     Vol.10, pp.215-239, April 2006.
gives a fast analysis when taken an account of time domain,
but makes comparatively high error rate. We can see                  10. Jinfei Xie and Wei-Yong Yan. A Qualitative Feature
farthest first algorithm give lowest 26.2145 % of incorrectly
                                                                     Extraction Method for Time Series Analysis,
classified instances.
                                                                     Proceedings of the 25th Chinese Control Conference,
                                                                     pp. 7–11, August, 2006, Harbin, Heilongjiang, 2006.
                                                                     11. Xiaoming lin, Yuchang Lu and Chunyi Shi. Cluster
1. Han J. and Kamber M. Data Mining: Concepts and
                                                                     Time Series Based on Partial Information, IEEE SMC
Techniques, Morgan Kaufmann Publishers, San Francisco,               TPUl, pp. 254-262, 2002.
                                                                     12. Yuan F, Meng Z. H, Zhang H. X and Dong C. R. A
2. Xiaozhe Wang, Kate Smith and Rob Hyndman.                         New Algorithm to Get the Initial Centroids, Proc. of the
Characteristic-Based Clustering for Time Series Data,                3rd International Conference on Machine Learning and
Data Mining and Knowledge Discovery, Springer Science
                                                                     Cybernetics, pp. 26–29, August 2004.
+ Business Media, LLC Manufactured in the United States,
pp. 335–364, 2006.
                                                                     13. Sun Jigui, Liu Jie and Zhao Lianyu. Clustering
                                                                     algorithms Research, Journal of Software ,Vol 19,No 1,
3. Li Wei, Nitin Kumar, Venkata Lolla and Helga Van
                                                                     pp.48- 61,January 2008.
Herle. A practical tool for visualizing and data mining
medical time series, Proceedings of the 18th IEEE
Symposium on Computer-Based Medical Systems
(CBMS’05), pp. 106- 125, 2005.

@ 2012, IJCCN All Rights Reserved
Vishal Shrivastava et al, International Journal of Computing, Communications and Networking, 1(2), September – October 2012, 68-74

14. Sun Shibao and Qin Keyun. Research on Modified
kmeans Data Cluster Algorithm, I. S. Jacobs and C. P.
Bean, Fine particles, thin films and exchange anisotropy,
Computer Engineering, Vol.33, No.13, pp.200–201,July

15. Merz C and Murphy P. UCI Repository of Machine
Learning      databases,     Available:ftp://ftp.ics.uci.edu
/pub/machinelearning- databases

16. Fahim A M,Salem A M and Torkey F A. An efficient
enhanced k-means clustering algorithm, Journal of
Zhejiang University Science A, Vol.10, pp:1626-1633,July

@ 2012, IJCCN All Rights Reserved

To top