A Study of Various Clustering Algorithms on Retail Sales Data
Document Sample


ISSN 2319-2720
Volume 1, No.2, September – and Networking,
Vishal Shrivastava et al, International Journal of Computing, Communications October 2012 1(2), September – October
2012, 68-74
International Journal of Computing, Communications and Networking
Available Online at http://warse.org/pdfs/ijccn04122012.pdf
A Study of Various Clustering Algorithms on Retail Sales Data
1
Vishal Shrivastava, 2Prem narayan Arya
1
M.Tech.(Software Systems), SATI, Vidisha, India. shrvishal@gmail.com
2
Asst. Prof. Dept. of Computer Applications, SATI, Vidisha, India. Premnarayan.arya@rediffmail.com
ABSTRACT cluster have high similarity in comparison to one
another and very dissimilar to object in other
Data mining is the process of extraction of Hidden clusters..Dissimilarity is due to the attributes values
knowledge from the databases. Clustering is one the that describe the objects.
important functionality of the data mining Clustering
is an adaptive methodology in which objects are The objects are grouped on the basis of the principle
grouped together, based on the principle of of optimizing the intra-class similarity and reducing
optimizing the inside class similarity and minimizing the inter-class similarity to the minimum. First of all
the class-class similarity. Various clustering the set of data is portioned into groups on the basis of
algorithms have been developed resulting in a better data similarity (e g by clustering) and the then
performance on datasets for clustering. The paper assigning labels to the comparatively smaller number
discusses the four major clustering algorithms: K- of groups.
Means, Density based, Filtered, Farthest First Several clustering techniques are there: partitioning
clustering algorithm and comparing the performances methods, hierarchical methods, density based
of these principle clustering algorithms on the aspect methods, grid based methods, model based methods,
of correctly class wise cluster building ability of methods for high dimensional data and constraint
algorithm .The results are tested on datasets of retail based clustering. Clustering is also called data
sales using WEKA interface and compute the segmentation because clustering partitions large data
correctly cluster building instances in proportion with sets into groups according to their similarity.
incorrectly formed cluster. A comparison of these
four algorithms is given on the basis of percentage of
incorrectly classified instances.
Keywords: Data mining, Clustering, k means, Retail
Sales.
1. INTRODUCTION
The process of Knowledge discovery executes in an
iterative sequence of steps such as cleaning of data,
its integration, its selection, & transformation of data,
data mining, evaluating patterns and presentation of
knowledge. Data mining features are characterization
and discrimination, mining Frequent patterns,
association, correlation, Classification and prediction,
cluster analysis, outlier analysis and evolution
analysis Clustering is the process of grouping the
data into classes or clusters, so that objects within a
Figure 1: Cluster Analysis
Clustering can be utilized for detecting outlier where
outliers are more interesting then common cases e g
68
@ 2012, IJCCN All Rights Reserved
Vishal Shrivastava et al, International Journal of Computing, Communications and Networking, 1(2), September – October 2012, 68-74
monitoring of criminal activities in electronic commerce , k-Means ClusterinG
Credit card fraud detection etc. Clustering is a pre-
processing step in the sequence for other algorithms of We determine number of clusters N and we assume the
characterization, attribute subset selection and centroid of these clusters. We can take any random objects
classification, which then operate on the detected clusters as the initial centroids or the first N objects that can also
and the selected attributes or features. Research areas serve as the initial centroids. Then the N means algorithm
include data mining, statistics, machine learning, biology, will perform the three steps given below until convergence
special database technology and marketing. Clustering is an occurs. Iterate until stability (= no object move group):
unsupervised learning. Different from classification, it does
not rely on predefined classes and class labels training 1. Determine the coordinate of centroid
examples. So clustering is learning by observation and not 2. Determine the distance of each object from the centroids
learning by examples. 3. Unite the Group of the objects based on minimum
distance (finding the
A "clustering" is a set of such clusters, that usually contains closest centroid simultaneously). This is showed in figure
all objects in the data set. Additionally, it also informs the 1.1 in steps.
relationship of the clusters with each other, for example a
chain hierarchy of clusters put inside or embedded in each Density based Clustering
other Clustering can broadly be distinguished into:
To discover clusters with arbitrary shape, density based
Hard clustering: each object belongs to a cluster or no. clustering methodology have been developed hence
typically regard clusters as dense region of objects in the
Soft clustering (fuzzy clustering): each object belongs to a data space that are separated by regions of low density.
cluster to a certain degree (like the similarity in belonging
to the cluster) Filtered Clustering
There are also minute distinctions possible, A filter adds a new nominal attribute that represents the
like: clusters assigned to every instance by specified clustering
algorithm. Either the clustering algorithm is built with the
Strict partitioning clustering: Every object belongs to only first batch of data or ones specifications are serialized
one cluster clustered model file to use, instead.
Strict partitioning clustering with outliers: objects can also
be of no cluster, and are considered outliers. Farthest First Clustering
Overlapping clustering (also called: alternative clustering, Farthest first is a variant of N Means that places all the
multi- view clustering): objects belong to more than one cluster centre in turn at the point which is farthest from the
cluster. existing cluster centre. This point should be within the data
area. This speed up the clustering in most of the cases
Hierarchical clustering: objects of a child cluster also greatly since lesser reassignments and adjustments are
belong to the parent cluster needed.
Subspace clustering: while an overlapping clustering, 2. REVIEW OF LITERATURE
clusters are not expected to overlap.
Xiaozhe Wang et al., in 2006 provided a method of
Desired Typical requirements of clustering in data mining clustering of the time series based on their structural
are characteristics, rather it groups based on global features
(i) Scalability (ii) ability to deal with various types of extracted from the time series. Global measures explaining
attributes, (iii) Discovery of clusters with different the time series are achieved by applying statistical methods
shapes,(iv) Minimal requirement for domain knowledge to that capture the following characteristics:
determine input parameters, (v)Ability to deal with noisy trend, nonlinearity, skewness, seasonality, chaos, ,
data,(vi) Incremental Clustering and insensitivity to the periodicity, serial correlation, kurtosis, and self-similarity.
order of input records, (vii)High dimensionality,(viii) Since the method clusters use extracted global measures, it
Constraint based Clustering and Interpretability and (ix) minimizes the dimensionality of time series and is very less
usability. sensitive to noisy data. A search mechanism is then
69
@ 2012, IJCCN All Rights Reserved
Vishal Shrivastava et al, International Journal of Computing, Communications and Networking, 1(2), September – October 2012, 68-74
provided to find the best selection from the feature set that series data sets from a wide variety of application
should be used for the clustering inputs [2]. perspectives. They provided an outline of these techniques
and presented their comparative experimental results
Li Wei et al. in 2005 defined a tool for visualizing and data corresponding to their effectiveness. Their experiments
mining of medical time series and found that increasing the have provided both a unified validation of existing
interest in time series data mining has had astonishingly achievements, and suggested that certain claims in the
minute impact on real world medical applications. literature may be hopeful.
Practitioners working with time series regularly, rarely take
advantage of the tools that the data mining community has Ehsan Hajizadeh et al.,in 2010examined and provided an
made available. This approach finds features from a time outline of application of data mining like decision trees,
series of random length and utilizes information about the association rules, neural network, factor analysis and etc in
relative frequency of these features to color an image in a the stock markets. Also, this tells progressive applications
principled way. By observing the similarities and known gap and less significant area and determined the
differences within a collection of image bitmaps, a user can future works for researchers. This tells the problems of data
quickly find clusters, exceptions, and other regularities mining in finance (stock market) and specific requirements
within the data collection. for data mining methods including in making
Interpretations ,incorporating relations and probabilistic
An Online Algorithm for Segmenting Time series was learning. The data mining techniques mentioned here
executed by Eamonn Keogh et al., in 2001. This was the increases the performance in pattern discovery methods
first wide review and formulated comparison of time series that deals with rigorous numeric and alpha numeric data,
segmentation algorithms from a data mining that involves structured objects, text and data in a variety of
point of view. Thus emerged the most popular approach, discontinuous and continuous scales (nominal, order,
Sliding Windows, which generally produces poor results, absolute and so on).
and the second most popular approach, Top-Down, will
produce reasonable results, it is not scalable. On the Also, this show benefits of using such techniques for stock
contrary, least known, Bottom-Up approach will produces market forecast[7]. Jiangjiao Duan et al., in 2005
excellent results and it scales linearly with the size of the introduced that Model-based clustering is one of the most
dataset. In addition, this introduced SWAB, a new important ways for time series data mining. However, the
algorithm, which also scales linearly with the size of the process of clustering may encounter several problems. Here
dataset, and requires only constant space and produces high a novel clustering of the dataset. In addition, this introduced
quality approximations of the data. SWAB, a new online algorithm, which scales linearly with
the size of the dataset, requires
A Model Based Clustering for Time Series with Irregular only constant space and produces high quality
Interval was proposed by Xiao-Tao Zhang et al., in 2004. approximations of the data [4].
This focussed Clustering problems are central to many
knowledge discovery and data mining tasks. However, most
existing clustering methods can only work with fixed Jiangjiao Duan et al., in 2005 introduced that Model-based
interval representations of data patterns, ignoring the clustering which is one of the important ways for time
variance of time axis. This studied the clustering of data series data mining. However, the process of clustering may
patterns that are sample in irregular interval. A model-based face several problems. Here a novel clustering algo of time-
approach.This focused on the clustering of data patterns series incorporating recursive Hidden Markov Model
which are sampled in irregular interval. A model-based (HMM) training was proposed. It contributed in following
approach that use cepstnun distance metric and aspects:
Autoregressive Conditional Duration (ACD) model has 1) It recursively trains models and use this model
proposed. Experimented results on real datasets endorses information in the process
that this method is effective in clustering irregular space agglomerative hierarchical clustering.
time series, and also results inferred from experimental 2) It built HMM of time series clusters to describe clusters.
values agrees with the market microstructure theories.
To evaluate the effectiveness of the algorithm, so many
Hui Ding et al., in 2008 used experiments to compare the experiments have been conducted on both synthetic and real
representations and distance measures of querying and world data. The inferences shows that this approach can
mining of Time Series Data. This led to conducting an achieve nicer performance in correctness rate than the
extensive set of time series experiments re-implementing 8 conventional HMMbased clustering algorithm [8].
different representation methods and 9 similarity measures Information Mining Over Heterogeneous and High-
and their variants, and testing their effectiveness on 38 time Dimensional Time-Series Data in Clinical Trials Databases
70
@ 2012, IJCCN All Rights Reserved
Vishal Shrivastava et al, International Journal of Computing, Communications and Networking, 1(2), September – October 2012, 68-74
was carried out by Fatih Altiparmak et al., in 2006. They Retail sales is an important issue in the market. Various
gave a good approach for data mining involving two steps: research are done in this area to understand the market
(i) applying a data mining algorithm over homogeneous pattern, customer behaviour, demand & supply, inventory
subsets of data, and identifying common or (ii) distinct etc. in this thesis we also used retaile sales data. The
patterns over the information gathered in the first step. This original data contain 126 attribute & 3067 intances but in
approach is implemented only for different and high our experiments we take the following 26 attribute with all
dimensional time series clinical trials of data. Using the 3067 instances. This is the sales per capita data of different
framework, they propose a new way of utilizing frequent states of america.
item set mining, as well as clustering and declustering
techniques with better distance metrics for measuring Table 1: All Variable Names
similarity among time series data. By grouping the data, it
find groups of analyze (substances present in blood) which Variable
is most strongly correlated. Most of these known S.No. Name Definition
relationships are verified by the clinical panels, and, in 1 STATE State Identifier ICPSR
addition, they identify novel groups that require further
biomedical analysis. A slight change in the method results 2 PCTBLK3 % black in 1930
in an effective declustering of high dimensional time series 3 PCTURB3 % urban in 1930
data, which is then used in “feature selection.” Using 4 PCTFRM3 % of land on farms in 1929
industry-sponsored clinical trials data sets, they are able to
% illiterate over age 10,
find out a smaller set of analytes that effectively models the
5 PCTILL3 1930
status of normal health [9].
6 PFORB3 % foreign born
County code based on
3. COMPARATIVE PERFORMANCE ANALYSIS OF ICPSR county code with
CLUSTERING ALGORITHMS adjusted values for combined
7 NDMTCODE counties
For performance evaluation of the four most popular 8 AREA area in square miles
clustering techniques K-Mean clustering, Density based
Clustering, Filtered clustering and farthest first clustering, 9 LATITUDE Latitude of county seat
we have taken datasets containing numerical attribute then 10 LONGITUD Longitude of county seat
convert these into nominal attributes type that is all these 11 STAB State Name
datasets contains the continuous attributes. This dataset
contain 26 attribute on the basis of assumption class 12 CNAMESR County Name
attribute as No. of bays, the cluster are generating by % population aged 10-19,
applying the below mentioned algorithms using the Weka 13 PP301019 1930
interface. Weka is a landmark system in the history of the % population aged 20-29,
data mining and machine learning research communities, 14 PP302029 1930
because it is the only toolkit that has gained such % population aged 30-34,
widespread adoption and survived for an extended period of 15 PP303034 1930
time (the first version of Weka was released 11 years ago). % population aged 35-44,
16 PP303544 1930
We have performed various experiment on the retail sales % population aged 45-54,
data. We have performed following 4 clustering method on 17 PP304554 1930
the data. % population aged 55-64,
18 PP305564 1930
i. Simple kMeans Clustring % population aged 65 and
ii.Density Based Clustring 19 PP3065UP over, 1930
iii.Filtered Clustring 20 BAY Number of Bays in County
iv.Farthest First Clustring
21 BEACH Number of Beaches
We have already discussed various clustring methods, in 22 LAKE Number of Lakes
previous section. All the above clustering methods are Retail Sales per Capita in
applied on the data detailed as in the next section. These 23 RRTSAP39 1939 in 1967 $
methods are applied with the help of weka data mining tool Retail Sales per Capita in
& take the results. 24 RRTSAP35 1935 in 1967 $
71
@ 2012, IJCCN All Rights Reserved
Vishal Shrivastava et al, International Journal of Computing, Communications and Networking, 1(2), September – October 2012, 68-74
Retail Sales per Capita in Table 2: Clustering Performance
25 RRTSAP33 1933 in 1967 $
Retail Sales per Capita in % of
Incorrectly
26 RRTSAP29 1929 in 1967 $ No of Incorrectly
Method Clustered
cluster Clustered
instances
instances
4.. EXPERIMENTAL SIMULATION AND RESULTS Simple kmeans 4 1500 48.9077
Above four algorithms have their implemented source code Simple kmeans 3 1436 46.821
in the Weka upon which simulations have carried out in Simple kmeans 2 964 31.4314
order to measure the performance parameters of the Density Based
algorithms over the datasets. The results are summarized in Cluster 2 965 31.464
the following tables & Graph. Filtered Cluster 2 964 31.4314
Farthest First 2 804 26.2145
Table 2: Cluster Distribution Farthest First 3 816 26.6058
Farthest First 4 814 26.5406
Name of the No. of Instance in Instance in
Cluster Method cluster cluster 1 cluster 2
Simple kMeans 2 1675 1392
Density based 2 1688 1379 60
Cluster 50
Filtered Cluster 2 1675 1392 40
Farthest First 2 2904 163 30
20
10
0
3500
3000
2500
2000
1500 No of cluster
1000
500 % of Incorectly Clustered instances
0
Simple Density Filtered Farthest Figure 3: Clustering Performance
kMeans based Cluster First
Cluster
Instance in cluster 1 Instance in cluster 2
Instance in cluster 3 Instance in cluster 4
Figure 2: Cluster Distribution
72
@ 2012, IJCCN All Rights Reserved
Vishal Shrivastava et al, International Journal of Computing, Communications and Networking, 1(2), September – October 2012, 68-74
4. Eamonn Keogh, Selina Chu,David Hart and Michael
1600
1400 Pazzani. An online algorithm for segmenting time series,
1200 0-7695-1 119-8/01 IEEE, 2001.
1000
800
600 5. Xiao-Tao Zhang, Wei Zhang and Xiong Xiong. A
400 model based clustering for time-series with irregular
200 interval, Proceedings of the Third International Conference
0 on Machine Learhg and Cybernetics, Shanghai, pp.26-29,
August 2004.
6. Hui Ding, Goce Trajcevski and Eamonn Keogh.
Querying and mining of time series data: Experimental
comparison of representations and distance measures,
PVLDB '08, August, pp. 23-28, 2008, Auckland, New
No of cluster Zealand, 2008.
Incorectly Clustered instances
7. Ehsan Hajizadeh, Hamed Davari Ardakani and Jamal
% of Incorectly Clustered instances Shahrabi. Appilication of data mining techniques in
stock market, Journal of Economics and International
Finance Vol. 2(7), pp. 109-118, July 2010.
Figure 4: Clustering Performance
8.Jiangjiao Duan, WeiWang , Bing Liu and Baile Shi.
5. CONCLUSION
Incorporating with recursive model training in time
series clustering, Proceedings of the 2005 The Fifth
Performance of the clustering method is measured by the
International Conference on Computer and Information
percentage of the incorrectly classified instances. As the
Technology (CIT’05), IEEE2005.
percentage of the incorrectly classified attribute is low
performance of the clustering is as good. Farthest first
9. Fatih Altiparmak, Hakan Ferhatosmanoglu, Selnur Erdal,
clustering gives better performance compared to k means
and Donald C. TrostFaith Altipar. Information Mining
clustering, Density based clustering & filtered clustering.
Over Heterogeneous and High-Dimensional Time-
Also this algorithm’s result is independent of number of Series Data in Clinical Trials Databases, IEEE
cluster while k means algorithm result is highly dependent
Transactions On Information Technology In BioMedicine,
on the number of cluster. Farthest first clustering though
Vol.10, pp.215-239, April 2006.
gives a fast analysis when taken an account of time domain,
but makes comparatively high error rate. We can see 10. Jinfei Xie and Wei-Yong Yan. A Qualitative Feature
farthest first algorithm give lowest 26.2145 % of incorrectly
Extraction Method for Time Series Analysis,
classified instances.
Proceedings of the 25th Chinese Control Conference,
pp. 7–11, August, 2006, Harbin, Heilongjiang, 2006.
REFERENCES
11. Xiaoming lin, Yuchang Lu and Chunyi Shi. Cluster
1. Han J. and Kamber M. Data Mining: Concepts and
Time Series Based on Partial Information, IEEE SMC
Techniques, Morgan Kaufmann Publishers, San Francisco, TPUl, pp. 254-262, 2002.
2000.
12. Yuan F, Meng Z. H, Zhang H. X and Dong C. R. A
2. Xiaozhe Wang, Kate Smith and Rob Hyndman. New Algorithm to Get the Initial Centroids, Proc. of the
Characteristic-Based Clustering for Time Series Data, 3rd International Conference on Machine Learning and
Data Mining and Knowledge Discovery, Springer Science
Cybernetics, pp. 26–29, August 2004.
+ Business Media, LLC Manufactured in the United States,
pp. 335–364, 2006.
13. Sun Jigui, Liu Jie and Zhao Lianyu. Clustering
algorithms Research, Journal of Software ,Vol 19,No 1,
3. Li Wei, Nitin Kumar, Venkata Lolla and Helga Van
pp.48- 61,January 2008.
Herle. A practical tool for visualizing and data mining
medical time series, Proceedings of the 18th IEEE
Symposium on Computer-Based Medical Systems
(CBMS’05), pp. 106- 125, 2005.
73
@ 2012, IJCCN All Rights Reserved
Vishal Shrivastava et al, International Journal of Computing, Communications and Networking, 1(2), September – October 2012, 68-74
14. Sun Shibao and Qin Keyun. Research on Modified
kmeans Data Cluster Algorithm, I. S. Jacobs and C. P.
Bean, Fine particles, thin films and exchange anisotropy,
Computer Engineering, Vol.33, No.13, pp.200–201,July
2007.
15. Merz C and Murphy P. UCI Repository of Machine
Learning databases, Available:ftp://ftp.ics.uci.edu
/pub/machinelearning- databases
16. Fahim A M,Salem A M and Torkey F A. An efficient
enhanced k-means clustering algorithm, Journal of
Zhejiang University Science A, Vol.10, pp:1626-1633,July
2006.
74
@ 2012, IJCCN All Rights Reserved
Related docs
Other docs by warse1
Get documents about "