ABSTRACT
The K-means clustering algorithm can take a lot of time to converge, especially for large datasets in high dimension and a large number of clusters. By applying several enhancements it is possible to improve the performance without significantly changing the quality of the clustering. In this paper we first find a good clustering in a reduced-dimension version of the dataset, before fine-tuning the clustering in the original dimension. This saves time because accelerated K-means algorithms are fastest in low dimension, and the initial low-dimensional clustering bring us close to a good solution for the original data. We use random projection to reduce the dimension, as it is fast and maintains the cluster properties we want to preserve. In our experiments, we see that this approach significantly reduces the time needed for clustering a dataset and in most cases produces better results.
- Dimitris Achlioptas. 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences 66, 4 (2003), 671–687.Google ScholarDigital Library
- David Arthur and Sergei Vassilvitskii. 2006. k-means++: The advantages of careful seeding. Technical Report. Stanford.Google Scholar
- Amit Banerjee and Rajesh N Dave. 2004. Validating clusters using the Hopkins statistic. In 2004 IEEE International conference on fuzzy systems (IEEE Cat. No. 04CH37542), Vol. 1. IEEE, 149–153.Google ScholarCross Ref
- Sanjoy Dasgupta. 2013. Experiments with random projection. arXiv preprint arXiv:1301.3849(2013).Google Scholar
- Sanjoy Dasgupta and Anupam Gupta. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms 22, 1 (2003), 60–65.Google ScholarDigital Library
- Li Deng. 2012. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29, 6 (2012), 141–142.Google ScholarCross Ref
- Charles Elkan. 2003. Using the triangle inequality to accelerate k-means. In Proceedings of the 20th international conference on Machine Learning (ICML-03). 147–153.Google ScholarDigital Library
- Greg Hamerly. 2010. Making k-means even faster. In Proceedings of the 2010 SIAM international conference on data mining. SIAM, 130–140.Google ScholarCross Ref
- Brian Hopkins and John Gordon Skellam. 1954. A new method for determining the type of distribution of plant individuals. Annals of Botany 18, 2 (1954), 213–227.Google ScholarCross Ref
- Alex Krizhevsky, Geoffrey Hinton, 2009. Learning multiple layers of features from tiny images. (2009).Google Scholar
- Ping Li, Trevor J Hastie, and Kenneth W Church. 2006. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 287–296.Google ScholarDigital Library
- Santosh S Vempala. 2005. The random projection method. Vol. 65. American Mathematical Soc.Google Scholar
Index Terms
- Clustering Faster and Better with Projected Data
Recommendations
Iterative random projections for high-dimensional data clustering
In this text we propose a method which efficiently performs clustering of high-dimensional data. The method builds on random projection and the K-means algorithm. The idea is to apply K-means several times, increasing the dimensionality of the data ...
Ant clustering algorithm with K-harmonic means clustering
Clustering is an unsupervised learning procedure and there is no a prior knowledge of data distribution. It organizes a set of objects/data into similar groups called clusters, and the objects within one cluster are highly similar and dissimilar with ...
Improving a Centroid-Based Clustering by Using Suitable Centroids from Another Clustering
AbstractFast centroid-based clustering algorithms such as k-means usually converge to a local optimum. In this work, we propose a method for constructing a better clustering from two such suboptimal clustering solutions based on the fact that each ...
Comments