research-article

Clustering Faster and Better with Projected Data

Authors:
Alibek Zhakubayev

Department of Computer Science, Baylor University, United States

Department of Computer Science, Baylor University, United States
View Profile

,
Greg Hamerly

Department of Computer Science, Baylor University, United States

Department of Computer Science, Baylor University, United States
View Profile

ICISDM '22: Proceedings of the 6th International Conference on Information System and Data MiningMay 2022Pages 1–6https://doi.org/10.1145/3546157.3546158

Published:22 August 2022Publication History

ICISDM '22: Proceedings of the 6th International Conference on Information System and Data Mining

Pages 1–6

ABSTRACT

The K-means clustering algorithm can take a lot of time to converge, especially for large datasets in high dimension and a large number of clusters. By applying several enhancements it is possible to improve the performance without significantly changing the quality of the clustering. In this paper we first find a good clustering in a reduced-dimension version of the dataset, before fine-tuning the clustering in the original dimension. This saves time because accelerated K-means algorithms are fastest in low dimension, and the initial low-dimensional clustering bring us close to a good solution for the original data. We use random projection to reduce the dimension, as it is fast and maintains the cluster properties we want to preserve. In our experiments, we see that this approach significantly reduces the time needed for clustering a dataset and in most cases produces better results.

References

Dimitris Achlioptas. 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences 66, 4 (2003), 671–687.Google ScholarDigital Library
David Arthur and Sergei Vassilvitskii. 2006. k-means++: The advantages of careful seeding. Technical Report. Stanford.Google Scholar
Amit Banerjee and Rajesh N Dave. 2004. Validating clusters using the Hopkins statistic. In 2004 IEEE International conference on fuzzy systems (IEEE Cat. No. 04CH37542), Vol. 1. IEEE, 149–153.Google ScholarCross Ref
Sanjoy Dasgupta. 2013. Experiments with random projection. arXiv preprint arXiv:1301.3849(2013).Google Scholar
Sanjoy Dasgupta and Anupam Gupta. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms 22, 1 (2003), 60–65.Google ScholarDigital Library
Li Deng. 2012. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29, 6 (2012), 141–142.Google ScholarCross Ref
Charles Elkan. 2003. Using the triangle inequality to accelerate k-means. In Proceedings of the 20th international conference on Machine Learning (ICML-03). 147–153.Google ScholarDigital Library
Greg Hamerly. 2010. Making k-means even faster. In Proceedings of the 2010 SIAM international conference on data mining. SIAM, 130–140.Google ScholarCross Ref
Brian Hopkins and John Gordon Skellam. 1954. A new method for determining the type of distribution of plant individuals. Annals of Botany 18, 2 (1954), 213–227.Google ScholarCross Ref
Alex Krizhevsky, Geoffrey Hinton, 2009. Learning multiple layers of features from tiny images. (2009).Google Scholar
Ping Li, Trevor J Hastie, and Kenneth W Church. 2006. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 287–296.Google ScholarDigital Library
Santosh S Vempala. 2005. The random projection method. Vol. 65. American Mathematical Soc.Google Scholar

Index Terms

Clustering Faster and Better with Projected Data
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms

Recommendations

Iterative random projections for high-dimensional data clustering

In this text we propose a method which efficiently performs clustering of high-dimensional data. The method builds on random projection and the K-means algorithm. The idea is to apply K-means several times, increasing the dimensionality of the data ...
Read More
Ant clustering algorithm with K-harmonic means clustering

Clustering is an unsupervised learning procedure and there is no a prior knowledge of data distribution. It organizes a set of objects/data into similar groups called clusters, and the objects within one cluster are highly similar and dissimilar with ...
Read More
Improving a Centroid-Based Clustering by Using Suitable Centroids from Another Clustering
Abstract
Fast centroid-based clustering algorithms such as k-means usually converge to a local optimum. In this work, we propose a method for constructing a better clustering from two such suboptimal clustering solutions based on the fact that each ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICISDM '22: Proceedings of the 6th International Conference on Information System and Data Mining
May 2022
144 pages
ISBN:9781450396257
DOI:10.1145/3546157

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 August 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Hopkins statistic
clustering
k-means
random projection
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 54
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Clustering Faster and Better with Projected Data

ICISDM '22: Proceedings of the 6th International Conference on Information System and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Iterative random projections for high-dimensional data clustering

Ant clustering algorithm with K-harmonic means clustering

Improving a Centroid-Based Clustering by Using Suitable Centroids from Another Clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Clustering Faster and Better with Projected Data

ICISDM '22: Proceedings of the 6th International Conference on Information System and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Iterative random projections for high-dimensional data clustering

Ant clustering algorithm with K-harmonic means clustering

Improving a Centroid-Based Clustering by Using Suitable Centroids from Another Clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media