Abstract
Biological interaction databases accommodate information about interacted proteins or genes. Clustering on the networks formed by the interaction information for finding regions highly connected could reveal the functional affinities or structural similarities between protein or gene entities. With the ever-increasing amounts of information in these databases, the runtime of a clustering task is more and more unaffordable. In this paper, we propose a heterogeneous parallel algorithm focusing on accelerating clustering tasks using distributed CPU–GPU clusters. Our parallel implementation is based on the original serial algorithm of the Markov clustering (MCL). In our parallel implementation, we utilize both the CPUs and GPUs to exploit the power of heterogeneous platforms. With the BioGRID biological interaction database, we have tested the proposed algorithm on a computer cluster equipped with NVIDIA Tesla P100 GPU accelerators. The result shows that, the algorithm is efficient in GPU memory usage and inter-node data transmission, and it can complete the clustering task in 3.2 minutes with the best speedup of 70.02 times compared to the serial counterpart.We believe our work can provide key insights for realizing fast MCL analyses on large-scale biological data, with distributed CPU–GPU computer clusters.
Similar content being viewed by others
References
Barabasi AL, Oltvai ZN (2004) Network biology: understanding the cell’s functional organization. Nature Rev Genet 5(2):101
Van Dongen SM (2000) Graph clustering by flow simulation. Ph.D. thesis
Brohee S, Van Helden J (2006) Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinform 7(1):488
Sharan R, Ulitsky I, Shamir R (2007) Network-based prediction of protein function. Mol Syst Biol 3(1)
Vlasblom J, Wodak SJ (2009) Markov clustering versus affinity propagation for the partitioning of protein interaction graphs. BMC Bioinform 10(1):99
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proceed Nat Acad Sci 98(8):4569–4574
Stoll D, Templin M, Bachmann J, Joos T (2005) Protein microarrays: applications and future challenges. Current Opin Drug Discov Develop 8(2):239–252
Cheng JR, Gen M (2019) Accelerating genetic algorithms with GPU computing: a selective overview. Comput Ind Eng 128:514–525
Shukur H, Zeebaree SR, Ahmed AJ, Zebari RR, Ahmed O, Tahir BSA, Sadeeq MA (2020) A state of art survey for concurrent computation and clustering of parallel computing for distributed systems. J Appl Sci Technol Trends 1(4):148–154
Pantoja M, Weyrich M, Fernández-Escribano G (2020) Acceleration of MRI analysis using multicore and manycore paradigms. J Supercomput 1–12
Dafir Z, Lamari Y, Slaoui SC (2021) A survey on parallel clustering algorithms for big data. Artif Intell Rev 54(4):2411–2443
Huang LT, Wei KC, Wu CC, Chen CY, Wang JA (2021) A lightweight BLASTP and its implementation on CUDA GPUs. J Supercomput 77(1):322–342
Bustamam A, Burrage K, Hamilton NA (2012) Fast parallel markov clustering in bioinformatics using massively parallel computing on gpu with cuda and ellpack-r sparse format. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 9(3):679–692
NVIDIA: Nvidia cuda c programming guide v11.4.1. Retrieved September, 2021. http://docs.nvidia.com/cuda/pdf/CUDA C Programming Guide.pdf (2021)
Vazquez F, Ortega G, Fernandez JJ, Garzon EM (2010) Improving the performance of the sparse matrix vector product with gpus. In: 2010 10th IEEE International Conference on Computer and Information Technology, pp 1146-1151. IEEE
Fu Y, Zhou W (2020) A novel parallel markov clustering method in biological interaction network analysis under multi-gpu computing environment. J Supercomput pp 1–18
Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M (2006) Biogrid: a general repository for interaction datasets. Nucleic Acids Res 34(suppl 1):D535–D539
(2019) The top500 systems. Retrieved Jan, 2020. https://www.top500.org/lists/2019/11/
Butenhof DR (1997) Programming with POSIX threads. Addison-Wesley Professional
Mpich (2019) Retrieved Jan, 2020. http://www.mpich.org/
Hennessy JL, Patterson DA (2019) Computer architecture: a quantitative approach (Sixth Edition). Elsevier
Cheng J, Grossman M, McKercher T (2014) Professional Cuda C Programming. Wiley
Saad Y (2003) Iterative methods for sparse linear systems, vol. 82. siam
Van Ravenzwaaij D, Cassey P, Brown SD (2018) A simple introduction to Markov Chain Monte-Carlo sampling. Psychonomic Bull Review 25(1):143–154
He L, Lu L, Wang Q (2017) An optimal parallel implementation of Markov Clustering based on the coordination of CPU and GPU. J Intell Fuzzy Syst 32(5):3609–3617
Lim Y, Yu I, Seo D et al (2019) PS-MCL: parallel shotgun coarsened markov clustering of protein interaction networks. BMC Bioinform 20(Suppl 13)
Azad A, Pavlopoulos GA, Ouzounis CA et al (2018) HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Res 46(6):e33–e33
Satuluri V, Parthasarathy S (2009) Scalable Graph Clustering Using Stochastic Flows: applications to Community Discovery. In: Acm Sigkdd International Conference on Knowledge Discovery and Data Mining ACM
Liu Y, Schmidt B (2018) Lightspmv: faster cuda-compatible sparse matrix-vector multiplication using compressed sparse rows. J Signal Process Syst 90(1):69–86
Rose Oughtred et al (2021) The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci 30:187–200
Acknowledgements
The authors would like to thank all the reviewers for their precious comments. This paper is supported by the National Key Research and Development Program of China (No. 2017YFB0202002).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Fu, Y., Zhou, W. A heterogeneous parallel implementation of the Markov clustering algorithm for large-scale biological networks on distributed CPU–GPU clusters. J Supercomput 78, 9017–9037 (2022). https://doi.org/10.1007/s11227-021-04204-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-04204-6