Abstract
The particle swarm optimization-K-Means algorithm is proposed by the related researchers to improve the clustering accuracy of the K-Means algorithm. However, the particle swarm optimization-K-Means algorithm brings more burden to the computation, and the computational efficiency is low when dealing with large data sets. To solve this problem, a parallel particle swarm K-Means algorithm based on MapReduce with multi-threading is proposed. The algorithm performs parallel computation by dividing the particle swarm into several equal-sized sub-populations based on the number of available nodes in the cluster and distributing them to each node. It uses a multi-threaded execution in the evaluation stage, which has the highest computational complexity in the evolutionary process. Experiments show that although splitting the population will affect the optimization effect to some extent, the proposed still can effectively optimize the clustering results of the K-Means algorithm, and the computational efficiency is significantly improved compared with serial particle swarm optimization k-means algorithm and MapReduce-based non-multithreaded particle swarm optimization k-means algorithm, in the experiment with the largest dataset and a configuration of 16 nodes, the proposed algorithm is 58 times faster than the serial algorithm. Furthermore, the computing efficiency can be improved in the clusters with more CPU cores.









Similar content being viewed by others
Data availability
The datasets employed in the experiments are publicly accessible through the UCI Machine Learning Repository. These datasets are available for non-commercial use and can be found at http://archive.ics.uci.edu/ml/index.php.
Change history
31 May 2024
A Correction to this paper has been published: https://doi.org/10.1007/s10586-024-04572-7
References
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1(14), 281–297 (1967)
Ahmed, M., Seraj, R., Islam, S.M.: The k-means algorithm: a comprehensive survey and performance evaluation. Electronics 9(8), 1295 (2020)
Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In Soda. Vol. 7, pp. 1027–1035
Rdusseeun LK, Kaufman P: Clustering by means of medoids. In Proceedings of the statistical data analysis based on the L1 norm conference. Vol. 31(1987)
Holland JH: Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press (1992)
Kennedy, J., Eberhart, R.: Particle swarm optimization. In Proceedings of ICNN’95-International Conference on Neural Networks 4, 1942–1948 (1995)
Shami, T.M., El-Saleh, A.A., Alswaitti, M., Al-Tashi, Q., Summakieh, M.A., Mirjalili, S.: Particle swarm optimization: a comprehensive survey. IEEE Access 10, 10031–10061 (2022)
Gad, A.G.: Particle swarm optimization algorithm and its applications: a systematic review. Arch. Computat. Methods Eng. (2022). https://doi.org/10.1007/s11831-021-09694-4
Ahmadyfard A, Modares H: Combining PSO and k-means to enhance data clustering. In 2008 international symposium on telecommunications pp. 688–691(2008).
Zhang, H., Peng, Q.: PSO and K-means-based semantic segmentation toward agricultural products. Futur. Gener. Comput. Syst. 126, 82–87 (2022)
Yuan, Y., Li, Y.: A modified hybrid method based on PSO, GA, and K-means for network anomaly detection. Math. Probl. Eng. (2022). https://doi.org/10.1155/2022/5985426
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Handa, Ma., Xiaoyu, He., Renqing, Ma.: Parallel PSO-kmeans algorithm implementing web log minging based on Hadoop. Compt. Sci. S1, 470–473 (2015)
Ferrucci, F., Salza, P., Sarro, F.: Using hadoop mapreduce for parallel genetic algorithms: a comparison of the global, grid and island models. Evol. Comput. 26(4), 535–567 (2018)
Papazoglou, G., Biskas, P.: Review and comparison of genetic algorithm and particle swarm optimization in the optimal power flow problem. Energies 16(3), 1152 (2023)
Charilogis, V., Tsoulos, I.G., Tzallas, A.: An improved parallel particle swarm optimization. SN Compt. Sci. 4(6), 766 (2023)
Tripathi, S.L., Mahmud, M.: Explainable machine learning models and architectures. Wiley, Hoboken (2023)
Yang, Y., et al.: Application of multi-objective particle swarm optimization based on short-term memory and K-means clustering in multi-modal multi-objective optimization. Eng. Appl. Artif. Intell. 112, 104866 (2022)
Li, Y., et al.: Customer segmentation using K-means clustering and the adaptive particle swarm optimization algorithm. Appl. Soft Compt. 113, 107924 (2021)
Xiaoqiong, W., Zhang, Y.E.: Image segmentation algorithm based on dynamic particle swarm optimization and K-means clustering. Int. J. Compt. Appl. 42(7), 649–654 (2020)
Paul, Shouvik, Sourav De, and Sandip Dey.: A novel approach of data clustering using an improved particle swarm optimization based k–means clustering algorithm. 2020 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT). IEEE, (2020).
Sheikhhosseini, Z., et al.: Delineation of potential seismic sources using weighted K-means cluster analysis and particle swarm optimization (PSO). Acta Geophysica 69, 2161–2172 (2021)
Li, J.Y., et al.: Generation-level parallelism for evolutionary computation: a pipeline-based parallel particle swarm optimization. IEEE Transactions on Cybernetics 51(10), 4848–4859 (2020)
Cao, B., et al.: RFID reader anticollision based on distributed parallel particle swarm optimization. IEEE Int. Things J. 8(5), 3099–3107 (2020)
Rodríguez-García, Javier, et al. 2020 Maximizing the profit for industrial customers of providing operation services in electric power systems via a parallel particle swarm optimization algorithm. IEEE Access. 8: 24721–24733.
Kumar, L., Pandey, M., Ahirwal, M.K.: Parallel global best-worst particle swarm optimization algorithm for solving optimization problems. Appl. Soft Compt. 142, 110329 (2023)
Hussain, M.M., Fujimoto, N.: GPU-based parallel multi-objective particle swarm optimization for large swarms and high dimensional problems. Parallel Compt. 92, 102589 (2020)
Mardi M, Keyvanpour MR: GBKM: a new genetic based k-means clustering algorithm. In 2021 7th international conference on web research (ICWR) pp. 222–226 (2021)
Kapil S, Chawla M, Ansari MD: On K-means data clustering algorithm with genetic algorithm. In2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC) pp. 202–206(2016)
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Statis.-Theory and Methods 3(1), 1–27 (1974)
Shvachko K, Kuang H, Radia S, Chansler R: The hadoop distributed file system. In2010 IEEE 26th symposium on mass storage systems and technologies (MSST) pp. 1–10 (2010)
Usman, S., Mehmood, R., Katib, I., Albeshri, A.: Data locality in high performance computing, big data, and converged systems: an analysis of the cutting edge and a future system architecture. Electronics 12(1), 53 (2022)
Arfat, Y., Usman, S., Mehmood, R., Katib, I.: Big data for smart infrastructure design: Opportunities and challenges. In: Mehmood, Rashid, See, Simon, Katib, Iyad, Chlamtac, Imrich (eds.) Smart Infrastructure and Applications: Foundations for Smarter Cities and Societies. Springer, Cham (2020)
Lea D. A java fork/join framework. InProceedings of the ACM 2000 conference on Java Grande. pp 36–43 (2000)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Davies DL, Bouldin DW: A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence. 224–7(1979)
Shi, Guolong, et al.: DANTD: A deep abnormal network traffic detection model for security of industrial internet of things using high-order features. IEEE Internet of Things Journal pp. 21143–21153 (2023)
Shi, Guolong, et al.: Multipath Interference Analysis for Low-power RFID-Sensor under metal medium envi-ronment. IEEE Sensors Journal pp. 20561–20569 (2023)
Shi, Guolong, et al.: Passive Wireless Detection for Ammonia Based on 2.4 GHz Square Carbon Nanotube-Loaded Chipless RFID-Inspired Tag. IEEE Transac-tions on Instrumentation and Measurement pp. 1–12 (2023)
Unhelkar, B., et al.: Enhancing supply chain performance using RFID technology and decision support systems in the industry 4.0–A systematic literature review. Int. J. Inf. Manag. Data Insights 2, 100084 (2022)
Kaiwartya, O., et al.: Virtualization in wireless sensor networks: Fault tolerant embedding for internet of things. IEEE Internet Things J. 2, 571–580 (2017)
Trivedi, V., Prakash, S., Ramteke, M.: Optimized on-line control of MMA polymerization using fast multi-objective DE. Mater. Manuf. Process. 32(10), 1144–1151 (2017)
Kalia, K., Gupta, N.: Analysis of hadoop MapReduce scheduling in heterogeneous environment. Ain Shams Eng. J. 1, 1101–1110 (2021)
Funding
This work is partially supported by the National Natural Science Foundation of China (62276032).
Author information
Authors and Affiliations
Contributions
The authors confirm contribution to the paper as follows: study conception and design: Xikang Wang, Tongxi Wang; data collection: Hua Xiang; analysis and interpretation of results: Xikang Wang; draft manuscript preparation: Xikang Wang. Tongxi Wang. All authors reviewed the results and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this article was revised: Section headings in the article were formatted incorrectly, the section headings are formatted correctly now.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, X., Wang, T. & Xiang, H. A multi-threaded particle swarm optimization-kmeans algorithm based on MapReduce. Cluster Comput 27, 8031–8044 (2024). https://doi.org/10.1007/s10586-024-04456-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-024-04456-w