CPI-model-based analysis of sparse k-means clustering algorithms

Aoyama, Kazuo; Saito, Kazumi; Ikeda, Tetsuo

doi:10.1007/s41060-021-00270-4

CPI-model-based analysis of sparse k-means clustering algorithms

Regular Paper
Published: 15 June 2021

Volume 12, pages 229–248, (2021)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

191 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Standard k-means clustering algorithms have been widely used to solve the partitioning problems of a given data set into k disjoint subsets. When a data set is large-scale and high-dimensional sparse, such as text data with a bag-of-words representation, it is not trivial which representations are adopted for both the data and mean sets. Additionally, algorithms that differ only in their representations need distinct elapsed times until their convergences, despite starting at an identical initial state and executing an identical number of similarity calculations, which is a conventional indicator of speed performance. We design sparse k-means clustering algorithms that utilize distinct representations, each of which is a pair of a data structure and an expression. Our purpose is to clarify the cause of their performance differences and identify the best algorithm when they are executed in a modern computer system. We analyze the algorithms with a simple yet practical clock-cycle per instruction (CPI) model that is expressed as a linear combination of four performance degradation factors in a modern computer system: the completed instructions, the level-1 and last-level cache misses, and the branch mispredictions. We also optimize the model parameters by a newly introduced procedure and demonstrate that CPIs calculated with our model agree well with experimental results when the algorithms are applied to large-scale and high-dimensional real document data sets. Furthermore, our model clarifies that the best algorithm among them suppresses the performance degradation factors of the number of cache misses, the branch mispredictions, and the completed instructions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A vectorized k-means algorithm for compressed datasets: design and experimental analysis

Article 10 March 2018

Abdullah Al Hasib, Juan M. Cebrian & Lasse Natvig

Comparison of K-means Clustering Initialization Approaches with Brute-Force Initialization

A Fast and Memory-Efficient Hierarchical Graph Clustering Algorithm

Notes

If mean feature vectors are not normalized by their \(L_2\) norms, i.e., they are not points on the unit hypersphere, a solution by the spherical k-means algorithm does not always coincide with that by the standard k-means algorithm.
Even if the algorithms start at an identical initial state, they might have different solutions when the similarities between an object and plural centroids are identical. To avoid this problem, our algorithms adopt the tie-breaking rule where an object belongs to a cluster whose centroid has the smallest ID.
In our preliminary experiments, a mean-update step using object feature vectors with inverted-file data structure required much more CPU time than that with the standard data structure.
IVFD differs from IVF in the positions in the source codes at which the final assignment of each object is executed to a cluster. IVFD executes the assignment outside the triple loop; IVF does it inside.
We assumed that \(w_0\) depended on an algorithm as \(w_{0[algo]}\) so that \(w_0\) contained the number of clock cycles by delay factors, except the foregoing DFs.
Regarding the memory consumption, the algorithms (except IVF) required a large memory size proportional to k due to the mean full expression. The required memory size for NYT reached 79.2 GB at \(k=20000\) while IVF used only 3.5 GB.
Regarding both algorithms, the instructions executed in the triple loop were identical in the corresponding assembly codes.
Actually, since the term order sorted by the number of centroids does not always meet that sorted by the number of objects, both the numbers of centroids and objects do not decrease monotonically (Fig. 10b).
Analysis of the IVFD and IVF assembly codes showed that both algorithms used the identical number of instructions for each multiplication and addition operation.

References

Aloise, D., Deshpande, A., Hansen, P., Popat, P.: NP-hardness of Euclidean sum-of-squares clustering. Mach. Learn. 75, 245–248 (2009)
Article Google Scholar
Aoyama, K., Saito, K., Ikeda, T.: Accelerating a Lloyd-type k-means clustering algorithm with summable lower bounds in a lower-dimensional space. IEICE Trans. Inf. Syst. E101–D(11), 2773–2782 (2018)
Article Google Scholar
Bhimani, J., Leeser, M., Mi, N.: Accelerating K-means clustering with parallel implementations and GPU computing. In: Proceedings of IEEE High Performance Extreme Computing Conference (HPEC), pp. 233–242 (2015)
Büttcher, S., Clarke, C.L.A., Cormack, G.V. (eds.): Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press, Cambridge (2010)
MATH Google Scholar
Broder, A., Garcia-Pueyo, L., Josifovski, V., Vassilvitskii, S., Venkatesan, S.: Scalable k-means by ranked retrieval. In: Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), pp. 233–242 (2014)
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42(1–2), 143–175 (2001)
Article Google Scholar
Ding, Y., Zhao, Y., Shen, X., Musuvathi, M., Mytkowicz, T.: Yinyang k-means: a drop-in replacement of the classic k-means with consistent speedup. In: Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 579–587 (2015)
Drake, J., Hamerly, G.: Accelerated k-means with adaptive distance bounds. In: Proceedings of 5th NIPS Workshop on Optimization for Machine Learning (2012)
Dua, D., Taniskidou, E.K.: Bag of words data set (PubMed abstracts) in UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Edelkamp, S., Weiß, A.: BlockQuicksort: avoiding branch mispredictions in quicksort. ACM J. Exp. Algorithmics (JEA) 24(1), 1.4:1–1.4:22 (2019)
MathSciNet MATH Google Scholar
Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of 20th International Conference on Machine Learning (ICML), pp. 147–153 (2003)
Evers, M., Yeh, T.Y.: Understanding branches and designing branch predictors for high-performance microprocessors. Proc. IEEE 89(11), 1610–1620 (2001)
Article Google Scholar
Eyerman, S., Smith, J.E., Eeckhout, L.: Characterizing the branch misprediction penalty. In: Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 48–58 (2006)
Frigo, M., Leiserson, C., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. ACM Trans. Algorithms 8(1, article 4) (2012)
Ghoting, A., Buehrer, G., Parthasarathy, S., Kim, D., Nguyen, A., Chen, Y.K., Dubey, P.: Cache-conscious frequent pattern mining on modern and emerging processors. VLDB J. 16(1), 77–96 (2007)
Article Google Scholar
Green, O., Dukhan, M., Vuduc, R.: Branch-avoiding graph algorithms. In: Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 212–223 (2015)
Hamerly, G.: Making k-means even faster. In: Proceedings SIAM International Conference on Data Mining (SDM), pp. 130–140 (2010)
Hammarlund, P., Martinez, A.J., Bajwa, A.A., Hill, D.L., Hallnor, E., Jiang, H., Dixon, M., Derr, M., Hunsaker, M., Kumar, R., Osborne, R.B., Rajwar, R., Singhal, R., D’Sa, R., Chappell, R., Kaushik, S., Chennupaty, S., Jourdan, S., Gunther, S., Piazza, T., Burton, T.: Haswell: the fourth-generation Intel core processor. IEEE Micro 34(2), 6–20 (2014)
Article Google Scholar
Harman, D., Fox, E., Baeza-Yates, R., Lee, W.: Inverted files. In: W.B. Frakes, R. Baeza-Yates (eds.) Information Retrieval: Data Structures & Algorithms, chap. 3, pp. 28–43. Prentice Hall, New Jersey (1992)
Hattori, T., Aoyama, K., Saito, K., Ikeda, T., Kobayashi, E.: Pivot-based k-means algorithm for numerous-class data sets. In: Proceedings of SIAM International Conference on Data Mining (SDM), pp. 333–341 (2016)
Hennessy, J.L., Patterson, D.A. (eds.): Computer Architecture, Sixth Edition: A Quantitative Approach. Morgan Kaufmann, San Mateo (2017)
Google Scholar
Jian, L., Wang, C., Liu, Y., Liang, S., Yi, W., Shi, Y.: Parallel data mining techniques on graphics processing unit with compute unified device architecture (CUDA). J. Supercomput. 64, 942–967 (2013)
Article Google Scholar
Jongerius, R., Anghel, A., Dittmann, G., Mariani, G., Vermij, E., Corporaal, H.: Analytic multi-core processor model for fast design-space exploration. IEEE Trans. Comput. 67(6), 755–770 (2018)
Article MathSciNet Google Scholar
Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349(6245), 255–260 (2015)
Article MathSciNet Google Scholar
Kaligosi, K., Sanders, P.: How branch mispredictions affect quicksort. In: Azar, Y., Erlebach, T. (eds.) Algorithms-ESA2006. Lecture Notes in Computer Science, pp. 780–791. Springer, Berlin (2006)
Google Scholar
Knuth, D.E.: Retrieval on secondary keys. In: The Art of Computer Programming: Volume 3: Sorting and Searching, chap. 5.2.4 and 6.5. Addison-Wesley Professional (1998)
Kowarschik, M., Weiß, C.: An overview of cache optimization techniques and cache-aware numerical algorithms. In: Meyer, U., Sanders, P., Sibeyn, J. (eds.) Algorithms for Memory Hierarchies. Lecture Notes in Computer Science, chap. 10, pp. 213–232. Springer, Berlin (2003)
Chapter Google Scholar
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Article MathSciNet Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Intel Corp.: Disclosure of hardware prefetcher control on some Intel processors (2014). https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
Intel Corp.: Intel memory latency checker v3.9 (2020). https://software.intel.com/content/www/us/en/develop/articles/intelr-memory-latency-checker.html
Newling, J., Fleuret, F.: Fast k-means with accurate bounds. In: Proceedings of 33rd International Conference on Machine Learning (ICML) (2016)
Perdacher, M., Plant, C., Böhm, C.: Cache-oblivious high-performance similarity join. In: Proceedings of International Conference on Management of Data (SIGMOD), pp. 87–104 (2019)
Perf: Linux profiling with performance counters (2019). https://perf.wiki.kernel.org/index.php
Samet, H. (ed.): Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., San Francisco (2006)
MATH Google Scholar
Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1470–1478 (2003)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Article Google Scholar
Zobel, J., Moffat, A.: Inverted files for text search. ACM Comput. Surv. 38(2, article 6) (2006)

Download references

Acknowledgements

This work was partly supported by JSPS KAKENHI Grant Number JP17K00159.

Author information

Authors and Affiliations

NTT Communication Science Laboratories, 2-4, Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237, Japan
Kazuo Aoyama
Kanagawa University, 2946, Tsuchiya, Hiratsuka-shi, Kanagawa, 259-1293, Japan
Kazumi Saito
University of Shizuoka, 52-1, Yada, Suruga-ku, Shizuoka, 422-8526, Japan
Tetsuo Ikeda

Authors

Kazuo Aoyama
View author publications
You can also search for this author in PubMed Google Scholar
Kazumi Saito
View author publications
You can also search for this author in PubMed Google Scholar
Tetsuo Ikeda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kazuo Aoyama.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aoyama, K., Saito, K. & Ikeda, T. CPI-model-based analysis of sparse k-means clustering algorithms. Int J Data Sci Anal 12, 229–248 (2021). https://doi.org/10.1007/s41060-021-00270-4

Download citation

Received: 02 August 2020
Accepted: 03 June 2021
Published: 15 June 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s41060-021-00270-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

CPI-model-based analysis of sparse k-means clustering algorithms

Abstract

Access this article

Similar content being viewed by others

A vectorized k-means algorithm for compressed datasets: design and experimental analysis

Comparison of K-means Clustering Initialization Approaches with Brute-Force Initialization

A Fast and Memory-Efficient Hierarchical Graph Clustering Algorithm

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CPI-model-based analysis of sparse k-means clustering algorithms

Abstract

Access this article

Similar content being viewed by others

A vectorized k-means algorithm for compressed datasets: design and experimental analysis

Comparison of K-means Clustering Initialization Approaches with Brute-Force Initialization

A Fast and Memory-Efficient Hierarchical Graph Clustering Algorithm

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation