Docstoc

A Novel Method of Estimating the Number of Clusters in a Dataset

Document Sample
A Novel Method of Estimating the Number of Clusters in a Dataset Powered By Docstoc
					    A Novel Method of Estimating the Number of Clusters in a Dataset
                                                                Reza Zafarani and Ali A. Ghorbani
                                        Faculty of Computer Science, University of New Brunswick, Fredericton, NB, Canada
                                                                 {r.zafarani, ghorbani} @ unb.ca


                       What is Clustering?                                                          The Oracle of Clustering                                                                 Discussions
                                                                                 Definition:
                                                                                 • A function, called the Oracle, that can predict whether two
•   The unsupervised division of patterns                                           random data points belong to the same cluster or not.                         •   It's simple to see that given this Oracle function, the clustering can
    (data points, feature vectors, instances,…)                                                                                                                       be done within a O(mn) time complexity where m is the number of
     into groups of similar objects.                                                                            1 if x1 and x2 belong to the same cluster           clusters and n is the number of datapoints.
                                                                                                 O( x1 , x2 )                                            
                                                                                                                0       otherwise                         
•   Objects in the same group are more                                                                                                                            •   The Oracle function can be approximated for different clustering
    similar whereas objects in different                                                                                                                              algorithms (preferably for those with quadratic running times).
    groups are more dissimilar.
                                                                                                       Oracle Approximation                                       •   Their running times can be reduced to O(mn), if this
                                                                                                                                                                      approximation can take place.
                                                                                 •       Thresholding:
                                                                                     •     A simple yet effective approach to predict the Oracle is to use        •   Transfer Learning can be used in order to approximate this Oracle
                                                                                           thresholding on the similarity function between the data.                  for algorithms with quadratic running times (the relation between
                   The Need of Clustering                                                                                                                             oracles of different clustering algorithms can be learnt or
                                                                                                                        1 if  ( x1 , x2 )  threshold              approximated).
                                                                                                         O( x1 , x2 )                                 
•   Analysis of Data / Finding (dis)similar data.                                                                       0        otherwise             
•   Need of abstraction / Reducing redundancy.
•   Alleviate the effect of low computation power.                                   •     A justifiable threshold could be a linear combination of the
•   Cost efficiency and business use-cases.                                                mean and standard deviation of the similarities between the
                                                                                           data.
                                                                                     •     In order to make this more accurate dimensionality reduction
                                                                                           methods can be used on the data first.                                                            Future Work
                  Challenges in Clustering
                                                                                                                                                                  •   A reasonable way to predict the number of clusters (m) from these
•   Dynamism                                                                                                                                                          probabilities is to use methods in Partition Theory along with the
•   Validity                                                                                                                                                          methods in Convex Optimization.
•   High Dimensions (curse of dimensionality)
•   Subjectivity                                                                                                                                                  •   Gap statistics is another area worth investigating in order to detect
       e.g. the set {ship, bird, fish} can be clustered in two different ways.                                                                                        the number of clusters here. The methods in gap statistics can be
•   Large Data Sets                                                                                                                                                   used to refine the threshold values which are used in oracle
•   Complexity                                                                                                                                                        prediction.
                                                                                 •   Ensemble Clustering
•   Proximity Measures
•   Initial Conditions                                                               •     Two points are considered to be in the same cluster if a               •   The transfer learning function can be learnt and the optimum
                                                                                           majority of different clustering algorithms consider them to be in         conditions under which the function is learnable can be discovered.
•   Ensemble Techniques
                                                                                           the same cluster.
                                                                                     •     This method can be computationally inefficient.



                             Related Work
•    Early literature in the area of dynamic clustering have attempted               Predicting the Number of Clusters Using                                                                  References
    to solve this by running algorithms for several Ks                                               the Oracle
    (number of clusters).                                                        Monte Carlo Sampling:
                                                                                                                                                                      •   Milligan, G., Cooper, M.: An examination of procedures for
                                                                                   • Given this Oracle function and using Monte Carlo sampling of
•   Best K among them is determined based on some coefficients or                     this Oracle function, the probability of random points being in                     determining the number of clusters in a data set.
    statistics.                                                                       the same cluster can be estimated.                                                  Psychometrika 50(2) (1985) 159-179
                                                                                                                                                                      •   Pelleg, D., Moore, A.: X-means: Extending K-means with
                                                                                                                                  i1 ai                                efficient estimation of the number of clusters. Proceedings of
                                                                                                                                     m
•   Distance between two cluster centroids normalized by cluster's
                                                                                                                   P( r   )                                            the 17th International Conf. on Machine Learning (2000) 727-
    standard deviation could be used as a coefficient.                                                                              n
                                                                                                                                                                          734
                                                                                     •     ai , m, and n are the size of the cluster i, the number of clusters,       •   Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of
•   Silhouette coefficient, which compares the average distance                            and the dataset size, respectively.                                            clusters in a data set via the gap statistic. Journal of the Royal
    Value between points in the same cluster and the average                         •     The sampling is controlled using chernoff bounds:                              Statistical Society. Series B (Statistical Methodology) 63(2)
    Distance value between points in different clusters.                                                                                                                  (2001) 411-423
                                                                                                           P( pN  p   )  2 exp(2 2 N )                        •   Guan, Y., Ghorbani, A., Belacel, N.: Y-means: A clustering
•   These coefficients are plotted as a function of K (number of                     •     Where p is the actual probability, N is the sample size, and                   method for intrusion detection. Proceedings of Canadian
    clusters) and the best K is selected.                                                    ,  are two prefixed constants.                                             Conference on Electrical and Computer Engineering (2003)
                                                                                     •     For Example: N  26500 if   0.01 and   0.01                            •   Erdos, P., Lehner, J.: The distribution of the number of
•    Probabilistic measures which determine the best model                           •     This problem links this area to the research avenues in Partition              summands in the partitions of a positive integer. Duke Math. J
    in mixture models can also be used.                                                    Theory, and more specifically, variations of Kloosterman sums                  8(2) (1941) 335-345
                                                                                           and summand distributions in integer partitions.                           •   Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge
•   In this area, an optimal K corresponds to the best fitting model.                                                                                                     University Press (2004)
    Some famous criteria in this area are BIC, MDL, and MML.




                                                         Intelligent and Adaptive Systems Research Group

				
DOCUMENT INFO