Skip to main content
Log in

Min max kurtosis distance based improved initial centroid selection approach of K-means clustering for big data mining on gene expression data

  • Original Paper
  • Published:
Evolving Systems Aims and scope Submit manuscript

Abstract

Genome clustering is one of the big data applications that identify the prognosis of terrifying diseases and biological processes across enormous sets of genes. The K-Means (KM) algorithm is the most commonly used clustering algorithm for gene expression data that extracts hidden knowledge, patterns and trends from gene expression profiles for decision-making strategies. Unfortunately, the KM algorithm is extremely sensitive to initial centroid selection since the initial centroid of clusters influences computational effectiveness, efficiency, cost and local optima issues. The existing initial centroid initialization algorithm attains high computational complexity due to extensive iterations, distance computation, data and result comparison on high dimensional data. To overcome these weaknesses, this study suggested the Min–Max Kurtosis Distance (MMKD) algorithm for big data clustering in a single machine environment. The MMKD algorithm resolves the KM clustering weaknesses by the distance between data points of origin and minimum–maximum kurtosis dimension. The performance of the proposed algorithm is compared to KM, KM++ , ADV, MKM, Mean-KM, NFD, K-MAM, NRKM2, FMNN and MuKM algorithms by internal and external effectiveness validation metrics with efficiency measurement on sixteen gene expression datasets. The experimental evaluation demonstrates that the MMKDKM algorithm reduces iterations, local optima, computation costs, and improves cluster performance, effectiveness and efficiency with stable convergence than other algorithms. The statistical analysis of this study promised that the proposed MMKDKM algorithm achieves a significant difference.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

This study used a publicly available dataset for experiments and results analysis. The sources of gene expression data are listed in Table 1, where gene datasets DB1 to DB9 are available at https://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi and DB10 to DB16 are available at https://sbcb.inf.ufrgs.br/cumida.

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kamlesh Kumar Pandey.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pandey, K.K., Shukla, D. Min max kurtosis distance based improved initial centroid selection approach of K-means clustering for big data mining on gene expression data. Evolving Systems 14, 207–244 (2023). https://doi.org/10.1007/s12530-022-09447-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12530-022-09447-z

Keywords

Navigation