Document Sample

A Novel Method of Estimating the Number of Clusters in a Dataset Reza Zafarani and Ali A. Ghorbani Faculty of Computer Science, University of New Brunswick, Fredericton, NB, Canada {r.zafarani, ghorbani} @ unb.ca What is Clustering? The Oracle of Clustering Discussions Definition: • A function, called the Oracle, that can predict whether two • The unsupervised division of patterns random data points belong to the same cluster or not. • It's simple to see that given this Oracle function, the clustering can (data points, feature vectors, instances,…) be done within a O(mn) time complexity where m is the number of into groups of similar objects. 1 if x1 and x2 belong to the same cluster clusters and n is the number of datapoints. O( x1 , x2 ) 0 otherwise • Objects in the same group are more • The Oracle function can be approximated for different clustering similar whereas objects in different algorithms (preferably for those with quadratic running times). groups are more dissimilar. Oracle Approximation • Their running times can be reduced to O(mn), if this approximation can take place. • Thresholding: • A simple yet effective approach to predict the Oracle is to use • Transfer Learning can be used in order to approximate this Oracle thresholding on the similarity function between the data. for algorithms with quadratic running times (the relation between The Need of Clustering oracles of different clustering algorithms can be learnt or 1 if ( x1 , x2 ) threshold approximated). O( x1 , x2 ) • Analysis of Data / Finding (dis)similar data. 0 otherwise • Need of abstraction / Reducing redundancy. • Alleviate the effect of low computation power. • A justifiable threshold could be a linear combination of the • Cost efficiency and business use-cases. mean and standard deviation of the similarities between the data. • In order to make this more accurate dimensionality reduction methods can be used on the data first. Future Work Challenges in Clustering • A reasonable way to predict the number of clusters (m) from these • Dynamism probabilities is to use methods in Partition Theory along with the • Validity methods in Convex Optimization. • High Dimensions (curse of dimensionality) • Subjectivity • Gap statistics is another area worth investigating in order to detect e.g. the set {ship, bird, fish} can be clustered in two different ways. the number of clusters here. The methods in gap statistics can be • Large Data Sets used to refine the threshold values which are used in oracle • Complexity prediction. • Ensemble Clustering • Proximity Measures • Initial Conditions • Two points are considered to be in the same cluster if a • The transfer learning function can be learnt and the optimum majority of different clustering algorithms consider them to be in conditions under which the function is learnable can be discovered. • Ensemble Techniques the same cluster. • This method can be computationally inefficient. Related Work • Early literature in the area of dynamic clustering have attempted Predicting the Number of Clusters Using References to solve this by running algorithms for several Ks the Oracle (number of clusters). Monte Carlo Sampling: • Milligan, G., Cooper, M.: An examination of procedures for • Given this Oracle function and using Monte Carlo sampling of • Best K among them is determined based on some coefficients or this Oracle function, the probability of random points being in determining the number of clusters in a data set. statistics. the same cluster can be estimated. Psychometrika 50(2) (1985) 159-179 • Pelleg, D., Moore, A.: X-means: Extending K-means with i1 ai efficient estimation of the number of clusters. Proceedings of m • Distance between two cluster centroids normalized by cluster's P( r ) the 17th International Conf. on Machine Learning (2000) 727- standard deviation could be used as a coefficient. n 734 • ai , m, and n are the size of the cluster i, the number of clusters, • Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of • Silhouette coefficient, which compares the average distance and the dataset size, respectively. clusters in a data set via the gap statistic. Journal of the Royal Value between points in the same cluster and the average • The sampling is controlled using chernoff bounds: Statistical Society. Series B (Statistical Methodology) 63(2) Distance value between points in different clusters. (2001) 411-423 P( pN p ) 2 exp(2 2 N ) • Guan, Y., Ghorbani, A., Belacel, N.: Y-means: A clustering • These coefficients are plotted as a function of K (number of • Where p is the actual probability, N is the sample size, and method for intrusion detection. Proceedings of Canadian clusters) and the best K is selected. , are two prefixed constants. Conference on Electrical and Computer Engineering (2003) • For Example: N 26500 if 0.01 and 0.01 • Erdos, P., Lehner, J.: The distribution of the number of • Probabilistic measures which determine the best model • This problem links this area to the research avenues in Partition summands in the partitions of a positive integer. Duke Math. J in mixture models can also be used. Theory, and more specifically, variations of Kloosterman sums 8(2) (1941) 335-345 and summand distributions in integer partitions. • Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge • In this area, an optimal K corresponds to the best fitting model. University Press (2004) Some famous criteria in this area are BIC, MDL, and MML. Intelligent and Adaptive Systems Research Group

DOCUMENT INFO

Shared By:

Categories:

Tags:
universities Spain, postgraduate research, Cambridge ESOL, electronic apparatus, Continuing Education, PDF Filename, Elementary Education, Adult Education, Nursery Education, Higher Education

Stats:

views: | 7 |

posted: | 4/8/2010 |

language: | English |

pages: | 1 |

OTHER DOCS BY rt3463df

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.