Abstract
Standard k-means clustering algorithms have been widely used to solve the partitioning problems of a given data set into k disjoint subsets. When a data set is large-scale and high-dimensional sparse, such as text data with a bag-of-words representation, it is not trivial which representations are adopted for both the data and mean sets. Additionally, algorithms that differ only in their representations need distinct elapsed times until their convergences, despite starting at an identical initial state and executing an identical number of similarity calculations, which is a conventional indicator of speed performance. We design sparse k-means clustering algorithms that utilize distinct representations, each of which is a pair of a data structure and an expression. Our purpose is to clarify the cause of their performance differences and identify the best algorithm when they are executed in a modern computer system. We analyze the algorithms with a simple yet practical clock-cycle per instruction (CPI) model that is expressed as a linear combination of four performance degradation factors in a modern computer system: the completed instructions, the level-1 and last-level cache misses, and the branch mispredictions. We also optimize the model parameters by a newly introduced procedure and demonstrate that CPIs calculated with our model agree well with experimental results when the algorithms are applied to large-scale and high-dimensional real document data sets. Furthermore, our model clarifies that the best algorithm among them suppresses the performance degradation factors of the number of cache misses, the branch mispredictions, and the completed instructions.
Similar content being viewed by others
Notes
If mean feature vectors are not normalized by their \(L_2\) norms, i.e., they are not points on the unit hypersphere, a solution by the spherical k-means algorithm does not always coincide with that by the standard k-means algorithm.
Even if the algorithms start at an identical initial state, they might have different solutions when the similarities between an object and plural centroids are identical. To avoid this problem, our algorithms adopt the tie-breaking rule where an object belongs to a cluster whose centroid has the smallest ID.
In our preliminary experiments, a mean-update step using object feature vectors with inverted-file data structure required much more CPU time than that with the standard data structure.
IVFD differs from IVF in the positions in the source codes at which the final assignment of each object is executed to a cluster. IVFD executes the assignment outside the triple loop; IVF does it inside.
We assumed that \(w_0\) depended on an algorithm as \(w_{0[algo]}\) so that \(w_0\) contained the number of clock cycles by delay factors, except the foregoing DFs.
Regarding the memory consumption, the algorithms (except IVF) required a large memory size proportional to k due to the mean full expression. The required memory size for NYT reached 79.2 GB at \(k=20000\) while IVF used only 3.5 GB.
Regarding both algorithms, the instructions executed in the triple loop were identical in the corresponding assembly codes.
Actually, since the term order sorted by the number of centroids does not always meet that sorted by the number of objects, both the numbers of centroids and objects do not decrease monotonically (Fig. 10b).
Analysis of the IVFD and IVF assembly codes showed that both algorithms used the identical number of instructions for each multiplication and addition operation.
References
Aloise, D., Deshpande, A., Hansen, P., Popat, P.: NP-hardness of Euclidean sum-of-squares clustering. Mach. Learn. 75, 245–248 (2009)
Aoyama, K., Saito, K., Ikeda, T.: Accelerating a Lloyd-type k-means clustering algorithm with summable lower bounds in a lower-dimensional space. IEICE Trans. Inf. Syst. E101–D(11), 2773–2782 (2018)
Bhimani, J., Leeser, M., Mi, N.: Accelerating K-means clustering with parallel implementations and GPU computing. In: Proceedings of IEEE High Performance Extreme Computing Conference (HPEC), pp. 233–242 (2015)
Büttcher, S., Clarke, C.L.A., Cormack, G.V. (eds.): Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press, Cambridge (2010)
Broder, A., Garcia-Pueyo, L., Josifovski, V., Vassilvitskii, S., Venkatesan, S.: Scalable k-means by ranked retrieval. In: Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), pp. 233–242 (2014)
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42(1–2), 143–175 (2001)
Ding, Y., Zhao, Y., Shen, X., Musuvathi, M., Mytkowicz, T.: Yinyang k-means: a drop-in replacement of the classic k-means with consistent speedup. In: Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 579–587 (2015)
Drake, J., Hamerly, G.: Accelerated k-means with adaptive distance bounds. In: Proceedings of 5th NIPS Workshop on Optimization for Machine Learning (2012)
Dua, D., Taniskidou, E.K.: Bag of words data set (PubMed abstracts) in UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Edelkamp, S., Weiß, A.: BlockQuicksort: avoiding branch mispredictions in quicksort. ACM J. Exp. Algorithmics (JEA) 24(1), 1.4:1–1.4:22 (2019)
Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of 20th International Conference on Machine Learning (ICML), pp. 147–153 (2003)
Evers, M., Yeh, T.Y.: Understanding branches and designing branch predictors for high-performance microprocessors. Proc. IEEE 89(11), 1610–1620 (2001)
Eyerman, S., Smith, J.E., Eeckhout, L.: Characterizing the branch misprediction penalty. In: Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 48–58 (2006)
Frigo, M., Leiserson, C., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. ACM Trans. Algorithms 8(1, article 4) (2012)
Ghoting, A., Buehrer, G., Parthasarathy, S., Kim, D., Nguyen, A., Chen, Y.K., Dubey, P.: Cache-conscious frequent pattern mining on modern and emerging processors. VLDB J. 16(1), 77–96 (2007)
Green, O., Dukhan, M., Vuduc, R.: Branch-avoiding graph algorithms. In: Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 212–223 (2015)
Hamerly, G.: Making k-means even faster. In: Proceedings SIAM International Conference on Data Mining (SDM), pp. 130–140 (2010)
Hammarlund, P., Martinez, A.J., Bajwa, A.A., Hill, D.L., Hallnor, E., Jiang, H., Dixon, M., Derr, M., Hunsaker, M., Kumar, R., Osborne, R.B., Rajwar, R., Singhal, R., D’Sa, R., Chappell, R., Kaushik, S., Chennupaty, S., Jourdan, S., Gunther, S., Piazza, T., Burton, T.: Haswell: the fourth-generation Intel core processor. IEEE Micro 34(2), 6–20 (2014)
Harman, D., Fox, E., Baeza-Yates, R., Lee, W.: Inverted files. In: W.B. Frakes, R. Baeza-Yates (eds.) Information Retrieval: Data Structures & Algorithms, chap. 3, pp. 28–43. Prentice Hall, New Jersey (1992)
Hattori, T., Aoyama, K., Saito, K., Ikeda, T., Kobayashi, E.: Pivot-based k-means algorithm for numerous-class data sets. In: Proceedings of SIAM International Conference on Data Mining (SDM), pp. 333–341 (2016)
Hennessy, J.L., Patterson, D.A. (eds.): Computer Architecture, Sixth Edition: A Quantitative Approach. Morgan Kaufmann, San Mateo (2017)
Jian, L., Wang, C., Liu, Y., Liang, S., Yi, W., Shi, Y.: Parallel data mining techniques on graphics processing unit with compute unified device architecture (CUDA). J. Supercomput. 64, 942–967 (2013)
Jongerius, R., Anghel, A., Dittmann, G., Mariani, G., Vermij, E., Corporaal, H.: Analytic multi-core processor model for fast design-space exploration. IEEE Trans. Comput. 67(6), 755–770 (2018)
Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349(6245), 255–260 (2015)
Kaligosi, K., Sanders, P.: How branch mispredictions affect quicksort. In: Azar, Y., Erlebach, T. (eds.) Algorithms-ESA2006. Lecture Notes in Computer Science, pp. 780–791. Springer, Berlin (2006)
Knuth, D.E.: Retrieval on secondary keys. In: The Art of Computer Programming: Volume 3: Sorting and Searching, chap. 5.2.4 and 6.5. Addison-Wesley Professional (1998)
Kowarschik, M., Weiß, C.: An overview of cache optimization techniques and cache-aware numerical algorithms. In: Meyer, U., Sanders, P., Sibeyn, J. (eds.) Algorithms for Memory Hierarchies. Lecture Notes in Computer Science, chap. 10, pp. 213–232. Springer, Berlin (2003)
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Intel Corp.: Disclosure of hardware prefetcher control on some Intel processors (2014). https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
Intel Corp.: Intel memory latency checker v3.9 (2020). https://software.intel.com/content/www/us/en/develop/articles/intelr-memory-latency-checker.html
Newling, J., Fleuret, F.: Fast k-means with accurate bounds. In: Proceedings of 33rd International Conference on Machine Learning (ICML) (2016)
Perdacher, M., Plant, C., Böhm, C.: Cache-oblivious high-performance similarity join. In: Proceedings of International Conference on Management of Data (SIGMOD), pp. 87–104 (2019)
Perf: Linux profiling with performance counters (2019). https://perf.wiki.kernel.org/index.php
Samet, H. (ed.): Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., San Francisco (2006)
Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1470–1478 (2003)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Zobel, J., Moffat, A.: Inverted files for text search. ACM Comput. Surv. 38(2, article 6) (2006)
Acknowledgements
This work was partly supported by JSPS KAKENHI Grant Number JP17K00159.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Aoyama, K., Saito, K. & Ikeda, T. CPI-model-based analysis of sparse k-means clustering algorithms. Int J Data Sci Anal 12, 229–248 (2021). https://doi.org/10.1007/s41060-021-00270-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-021-00270-4