skip to main content
10.1145/3546157.3546158acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicisdmConference Proceedingsconference-collections
research-article

Clustering Faster and Better with Projected Data

Authors Info & Claims
Published:22 August 2022Publication History

ABSTRACT

The K-means clustering algorithm can take a lot of time to converge, especially for large datasets in high dimension and a large number of clusters. By applying several enhancements it is possible to improve the performance without significantly changing the quality of the clustering. In this paper we first find a good clustering in a reduced-dimension version of the dataset, before fine-tuning the clustering in the original dimension. This saves time because accelerated K-means algorithms are fastest in low dimension, and the initial low-dimensional clustering bring us close to a good solution for the original data. We use random projection to reduce the dimension, as it is fast and maintains the cluster properties we want to preserve. In our experiments, we see that this approach significantly reduces the time needed for clustering a dataset and in most cases produces better results.

References

  1. Dimitris Achlioptas. 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences 66, 4 (2003), 671–687.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. David Arthur and Sergei Vassilvitskii. 2006. k-means++: The advantages of careful seeding. Technical Report. Stanford.Google ScholarGoogle Scholar
  3. Amit Banerjee and Rajesh N Dave. 2004. Validating clusters using the Hopkins statistic. In 2004 IEEE International conference on fuzzy systems (IEEE Cat. No. 04CH37542), Vol. 1. IEEE, 149–153.Google ScholarGoogle ScholarCross RefCross Ref
  4. Sanjoy Dasgupta. 2013. Experiments with random projection. arXiv preprint arXiv:1301.3849(2013).Google ScholarGoogle Scholar
  5. Sanjoy Dasgupta and Anupam Gupta. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms 22, 1 (2003), 60–65.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Li Deng. 2012. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29, 6 (2012), 141–142.Google ScholarGoogle ScholarCross RefCross Ref
  7. Charles Elkan. 2003. Using the triangle inequality to accelerate k-means. In Proceedings of the 20th international conference on Machine Learning (ICML-03). 147–153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Greg Hamerly. 2010. Making k-means even faster. In Proceedings of the 2010 SIAM international conference on data mining. SIAM, 130–140.Google ScholarGoogle ScholarCross RefCross Ref
  9. Brian Hopkins and John Gordon Skellam. 1954. A new method for determining the type of distribution of plant individuals. Annals of Botany 18, 2 (1954), 213–227.Google ScholarGoogle ScholarCross RefCross Ref
  10. Alex Krizhevsky, Geoffrey Hinton, 2009. Learning multiple layers of features from tiny images. (2009).Google ScholarGoogle Scholar
  11. Ping Li, Trevor J Hastie, and Kenneth W Church. 2006. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 287–296.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Santosh S Vempala. 2005. The random projection method. Vol. 65. American Mathematical Soc.Google ScholarGoogle Scholar

Index Terms

  1. Clustering Faster and Better with Projected Data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICISDM '22: Proceedings of the 6th International Conference on Information System and Data Mining
      May 2022
      144 pages
      ISBN:9781450396257
      DOI:10.1145/3546157

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 August 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)20
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format