Maxmin distance sort heuristic-based initial centroid method of partitional clustering for big data mining

Pandey, Kamlesh Kumar; Shukla, Diwakar

doi:10.1007/s10044-021-01045-0

Maxmin distance sort heuristic-based initial centroid method of partitional clustering for big data mining

Theoretical Advances
Published: 24 January 2022

Volume 25, pages 139–156, (2022)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

302 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

The revolution of digital and communication technologies is producing an enormous amount of data. Therefore, the nature of classical data changes into big data, and mining techniques have to face high computation cost, performance and scalability-related challenges. The K-means (KM) algorithm is the most widely used partitional clustering approach that depends on K clusters, initial centroid, distance measures and central tendency statistical approaches. The initial centroid determines the computational effectiveness, efficiency and local optima issues in big data clustering due to the gradient descent nature of the KM algorithm. The existing centroid initialization algorithm has achieved low cluster quality with high computational complexity due to iterations, distance computation, data and result comparison. To overcome these deficiencies, this paper presents the Maxmin Distance Sort Heuristic (MDSH) algorithm for big data clustering through a stratified sampling process. The performance of the MDSHKM algorithm is compared with the KM and KM++ algorithms through R square, Root-Mean-Square Standard Deviation, Davies–Bouldin score, Calinski Harabasz score, Silhouette Coefficient, Number of Iterations and CPU time validation indices using eight real datasets. The experimental evaluation shows that the MDSHKM algorithm achieves better cluster quality, computing cost, efficiency and stable convergence than the KM and KM++ algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Salvador García, Sergio Ramírez-Gallego, … Francisco Herrera

References

Sivarajah U, Kamal MM, Irani Z, Weerakkody V (2017) Critical analysis of big data challenges and analytical methods. J Bus Res 70:263–286. https://doi.org/10.1016/j.jbusres.2016.08.001
Article Google Scholar
Gandomi A, Haider M (2015) Beyond the hype: big data concepts methods and analytics. Int J Inf Manag 35(2):137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007
Article Google Scholar
Lee I (2017) Big data: dimensions, evolution, impacts and challenges. Bus Horiz 60(3):293–303. https://doi.org/10.1016/j.bushor.2017.01.004
Article Google Scholar
Njah H, Jamoussi S, Mahdi W (2019) Deep Bayesian network architecture for big data mining. Concurr Comput 31(2):1–17. https://doi.org/10.1002/cpe.4418
Article Google Scholar
Zhou K, Yang S (2020) Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering. Pattern Anal Appl 23(1):455–466. https://doi.org/10.1007/s10044-019-00783-6
Article MathSciNet Google Scholar
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666. https://doi.org/10.1016/j.patrec.2009.09.011
Article Google Scholar
Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-dimensional and large datasets. ACM Trans Knowl Discov Data 12(2):1–68. https://doi.org/10.1145/3132088
Article Google Scholar
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193. https://doi.org/10.1007/s40745-015-0040-1
Article MathSciNet Google Scholar
Sharma DK, Dhurandher SK, Agarwal D, Arora K (2019) KROP: k-means clustering based routing protocol for opportunistic networks. J Ambient Intell Humaniz Comput 10(4):1289–1306. https://doi.org/10.1007/s12652-018-0697-3
Article Google Scholar
Duwairi R, Abu-Rahmeh M (2015) A novel approach for initializing the spherical k-means clustering algorithm. Simul Model Pract Theory 54:49–63. https://doi.org/10.1016/j.simpat.2015.03.007
Article Google Scholar
Ilango SS, Vimal S, Kaliappan M, Subbulakshmi P (2019) Optimization using artificial bee colony based clustering approach for big data. Cluster Comput 22:12169–12177. https://doi.org/10.1007/s10586-017-1571-3
Article Google Scholar
Bakhthemmat A, Izadi M (2020) Decreasing the execution time of reducers by revising clustering based on the futuristic greedy approach. J Big Data 7(1):6. https://doi.org/10.1186/s40537-019-0279-z
Article Google Scholar
Khondoker MR (2018) Big data clustering. In: Wiley StatsRef: statistics reference online. John Wiley & Sons Ltd, Chichester, pp 1–10. https://doi.org/10.1002/9781118445112.stat07978
Chen M, Ludwig SA, Li K (2017) Clustering in big data. In: Li K-C, Jiang H, Zomaya AY (eds) Big data management and processing. Chapman and Hall/CRC, New York, pp 333–346. https://doi.org/10.1201/9781315154008
Dafir Z, Lamari Y, Slaoui SC (2021) A survey on parallel clustering algorithms for big data. Artif Intell Rev 54(4):2411–2443. https://doi.org/10.1007/s10462-020-09918-2
Article Google Scholar
HajKacem MA Ben, N’Cir C-E Ben, Essoussi N (2019) Overview of scalable partitional methods for big data clustering. In: Nasraoui O, N’Cir C-E Ben (eds) Clustering methods for big data analytics, unsupervised and semi-supervised learning. Springer Nature, Switzerland, pp 1–23. https://doi.org/10.1007/978-3-319-97864-2_1
Kwedlo W, Iwanowicz P (2010) Using genetic algorithm for selection of initial cluster centers for the k-means method. In: Rutkowski L (eds) Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), 2nd ed., Verlag Berlin Heidelberg, Springer, pp 165–172. https://doi.org/10.1007/978-3-642-13232-2_20
Tzortzis G, Likas A (2014) The minmax k-means clustering algorithm. Pattern Recognit 47(7):2505–2516. https://doi.org/10.1016/j.patcog.2014.01.015
Article Google Scholar
Torrente A, Romo J (2021) Initializing k-means clustering by bootstrap and data depth. J Classif 38(2):232–256. https://doi.org/10.1007/s00357-020-09372-3
Article MathSciNet MATH Google Scholar
Reddy D, Mishra D, Jana PK (2011) MST-based cluster initialization for k-means. In: Proceedings of the international conference on computer science and information technology. Springer, pp 329–338. https://doi.org/10.1007/978-3-642-17857-3_33
Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48(12):4743–4759. https://doi.org/10.1007/s10489-018-1238-7
Article MATH Google Scholar
Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. ACM Digital Library, pp 1027–1035.
Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112. https://doi.org/10.1016/j.patcog.2019.04.014
Article Google Scholar
Peña J, Lozano J, Larrañaga P (1999) An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognit Lett 20(10):1027–1040. https://doi.org/10.1016/S0167-8655(99)00069-0
Article Google Scholar
Mousavian Anaraki SA, Haeri A, Moslehi F (2021) A hybrid reciprocal model of pca and k-means with an innovative approach of considering sub-datasets for the improvement of k-means initialization and step-by-step labeling to create clusters with high interpretability. Pattern Anal Appl 24(3):1387–1402. https://doi.org/10.1007/s10044-021-00977-x
Article Google Scholar
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210. https://doi.org/10.1016/j.eswa.2012.07.021
Article Google Scholar
Celebi ME, Kingravi HA (2015) Linear, deterministic and order-invariant initialization methods for the k-means clustering algorithm. In: Celebi ME (ed) Partitional clustering algorithms. Springer, Cham, pp 79–98. https://doi.org/10.1007/978-3-319-09259-1_3
von Luxburg U (2010) Clustering stability: an overview. Found Trends Mach Learn 2(3):235–274. https://doi.org/10.1561/2200000008
Article MATH Google Scholar
He J, Lan M, Tan CL, et al (2004) Initialization of cluster refinement algorithms: a review and comparative study. In: Proceedings of the IEEE international conference on neural networks. IEEE Xplore, pp 297–302. https://doi.org/10.1109/ijcnn.2004.1379917
Jothi R, Mohanty SK, Ojha A (2019) DK-means: a deterministic k-means clustering algorithm for gene expression analysis. Pattern Anal Appl 22(2):649–667. https://doi.org/10.1007/s10044-017-0673-0
Article MathSciNet Google Scholar
Wang S, Liu X, Xiang L (2021) An improved initialisation method for k-means algorithm optimised by Tissue-like P system. Int J Parallel, Emergent Distrib Syst 36(1):3–10. https://doi.org/10.1080/17445760.2019.1682144
Article Google Scholar
Ji S, Xu D, Guo L et al (2020) The seeding algorithm for spherical k-means clustering with penalties. J Comb Optim. https://doi.org/10.1007/s10878-020-00569-1
Article Google Scholar
Murugesan VP, Murugesan P (2020) A new initialization and performance measure for the rough k-means clustering. Soft Comput 24(15):11605–11619. https://doi.org/10.1007/s00500-019-04625-9
Article Google Scholar
Chowdhury K, Chaudhuri D, Pal AK (2020) An entropy-based initialization method of k-means clustering on the optimal number of clusters. Neural Comput Appl 33(12):6965–6982. https://doi.org/10.1007/s00521-020-05471-9
Article Google Scholar
Fahad A, Alshatri N, Tari Z et al (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519
Article Google Scholar
Sharma SK (2020) An empirical model (EM: CCO) for clustering, convergence and center optimization in distributive databases. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-020-01955-7
Article Google Scholar
Xiao Y, Yu J (2012) Partitive clustering (k-means family). Wiley Interdiscip Rev Data Min Knowl Discov 2(3):209–225. https://doi.org/10.1002/widm.1049
Article Google Scholar
Dasgupta S (2013) Algorithms for k-means clustering. In: Geometric algorithms Lecture. University of California, San Diego, pp 3:1–3:7
Kanagaraj R, Rajkumar N, Srinivasan K (2020) Multiclass normalized clustering and classification model for electricity consumption data analysis in machine learning techniques. J Ambient Intell Humaniz Comput 12(5):5093–5103. https://doi.org/10.1007/s12652-020-01960-w
Article Google Scholar
Aggarwal CC, Reddy CK (2014) Data custering algorithms and applications. CRC Press, United States, pp 589–601
Google Scholar
Zahra S, Ghazanfar MA, Khalid A, Azam MA, Naeem U, Prugel-Bennett A (2015) Novel centroid selection approaches for k-means clustering based recommender systems. Inf Sci 320:156–189. https://doi.org/10.1016/j.ins.2015.03.062
Article Google Scholar
Li Y, Cai J, Yang H et al (2019) A novel algorithm for initial cluster center selection. IEEE Access 7:74683–74693. https://doi.org/10.1109/ACCESS.2019.2921320
Article Google Scholar
Yeh W-C, Lai C-M (2015) Accelerated simplified swarm optimization with exploitation search scheme for data clustering. PLoS ONE 10(9):e0137246. https://doi.org/10.1371/journal.pone.0137246
Article Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Deeb H, Sarangi A, Mishra D, Sarangi SK (2020) Improved black hole optimization algorithm for data clustering. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2020.12.013
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Applications, Dr. Hari Singh Gour Vishwavidyalaya, Sagar, Madhya Pradesh, India
Kamlesh Kumar Pandey & Diwakar Shukla

Authors

Kamlesh Kumar Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Diwakar Shukla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamlesh Kumar Pandey.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pandey, K.K., Shukla, D. Maxmin distance sort heuristic-based initial centroid method of partitional clustering for big data mining. Pattern Anal Applic 25, 139–156 (2022). https://doi.org/10.1007/s10044-021-01045-0

Download citation

Received: 09 February 2021
Accepted: 13 December 2021
Published: 24 January 2022
Issue Date: February 2022
DOI: https://doi.org/10.1007/s10044-021-01045-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Maxmin distance sort heuristic-based initial centroid method of partitional clustering for big data mining

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation