Abstract
In the domain of proteomics, an in-depth analysis of the 3D structure of a protein is of paramount importance for many biological studies and applications. At the secondary level, protein structure can be described in terms of motifs, recurrent patterns of smaller biological structures called secondary structure elements. In this paper, the focus is on the identification of geometrical motifs in different proteins using the Cross Motif Search (CMS) algorithm. Such task, due to the high computational cost of CMS with respect to traditional alignment algorithms, is very demanding, and thus, parallel processing is mandatory. In previous papers, CMS parallelization has been already studied from the HPC standpoint. Since cloud computing is emerging as an alternative to on-premise HPC systems, it is worthwhile examining the feasibility and possible advantages in terms of both performance and costs, of migrating to a cloud implementation. This paper is an extension of a preliminary work carried out on the cloud parallelization of CMS. The paper has two main contributions. First of all, an analytic model of the communication pattern of CMS is described, in order to get insights on the performance of the application when executed on a cloud infrastructure. Secondly, an optimized “location-aware” scheduling policy to assign workload to the application workers is introduced, in order to minimize internode communication in a cloud setting. Experiments are presented in order to validate the newly introduced scheduling policy and assess the performance of the cloud implementation of CMS. The results presented in this paper are general, in the sense that they can be applied to any other algorithm with a communication pattern similar to the one of the target applications.
Similar content being viewed by others
Change history
17 June 2019
Mirto Musci was not listed among the authors. The original article has been corrected.
Notes
Note that 1k32 and 1bgl share identical chains; using a priori biological information would greatly reduce the computational time. As stated before, however, CMS only focuses on geometrical information.
References
Ferretti M, Santangelo L (2018) Protein secondary structure analysis in the cloud. In: Vega-Rodrguez MA, Santander-Jimnez S, Granado-Criado JM, Badia RM (eds) Proceedings of the 6th International Workshop on Parallelism in Bioinformatics (PBio 2018). ACM, New York, pp 63–70
Yang H, Tate M (2012) A descriptive literature review and classification of cloud computing research. CAIS 31:2
Mell P, Grance T (2011) The NIST definition of cloud computing. Retrieved from http://faculty.winthrop.edu/domanm/csci411/Handouts/NIST.pdf
Carlyle G, Harrell SL, Smith PM (2010) Cost-effective HPC: the community or the cloud? In: IEEE 2nd International Conference on Cloud Computing Technology and Science, Indianapolis, IN, 2010, pp 169–176
Hassani R, Aiatullah Md, Luksch P (2014) Improving HPC application performance in public cloud. In: IERI Procedia 10:169–176, ISSN 2212-6678
Mancini M, Aloisio G (2015) How advanced cloud technologies can impact and change HPC environments for simulation. In: International Conference on High Performance Computing & Simulation (HPCS), Amsterdam, 2015, pp 667–668
Yang T, Ma X, Mueller F (2005) Predicting parallel applications performance across platforms using partial execution. In: ACM/IEEE Supercomputing Conference
Chakthranont N, Khunphet P, Takano R, Ikegami T (2014) Exploring the performance impact of virtualization on an HPC cloud. In: IEEE 6th International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, pp 426–432
Expsito RR, Taboada GL, Ramos S, Tourino J, Doallo R (2013) Performance analysis of HPC applications in the cloud. Fut Gen Comput Syst 29(1):218–229
Ferretti M, Musci M, Santangelo L (2014) A hybrid OpenMP and OpenMPI approach to geometrical motif search in proteins. In: Proceedings of the IEEE International Conference on Cluster Computing (IEEE Cluster 2014), IEEE Computer Society, 2014, pp 298–304
Ferretti M, Musci M, Santangelo L (2015) MPI-CMS: a hybrid parallel approach to geometrical motif search in proteins. Concurr Comput Pract Exp 27(18):5500–5516
Ferretti M, Santangelo L (2018) Hybrid OpenMP-MPI parallelism: porting experiments from small to large clusters. In: 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing, PDP 2018, Cambridge, UK, March 21–23, 2018. IEEE Computer Society 2018, pp 297–301
Ferretti M, Musci M (2013) Entire motifs search of secondary structures in proteins: a parallelization study. In: Proceedings of the 20th European MPI Users’ Group Meeting. ACM
Drago G, Ferretti M, Musci M (2013) CCMS: A greedy approach to motif extraction. In: International Conference on Image Analysis and Processing. Springer, Berlin
Ferretti M, Musci M (2015) Geometrical motifs search in proteins: a parallel approach. Paral Comput 42:60–74
Cantoni V et al (2016) Structural motifs identification and retrieval: a geometrical approach. In: Pattern Recognition in Computational Molecular Biology: Techniques and Approaches. Wiley
Casavant TL, Kuhl JG (1998) A taxonomy of scheduling in general-purpose distributed computing systems. IEEE Trans Soft Eng 14:141–154
Plastino A, Ribeiro CC, Rodriguez NR (2001) Load balancing algorithms for SPMD applications. Retrieved from https://pdfs.semanticscholar.org/f5d0/edd1e1e4268549e1f28f141347482ee56fea.pdf
Osman A, Ammar H (2002) Dynamic load balancing strategies for parallel computers. Sci Ann Cuza Univ 11:110–120
Amandeep K, Pawan LM (2018) A review on load balancing in cloud environment. Int J Comput Technol 17(1):7120–7125
Sarood O, Gupta A, Kal LV (2012) Cloud friendly load balancing for hpc applications: Preliminary work. In: 41st International Conference on Parallel Processing Workshops. IEEE
Rathore J, Keswani B, Rathore VS (2019) Analysis of load balancing algorithms using cloud analyst. In: Rathore V, Worring M, Mishra D, Joshi A, Maheshwari S (eds) Emerging Trends in Expert Applications and Security. Advances in Intelligent Systems and Computing, vol 841. Springer, Singapore
Hota A, Mohapatra S, Mohanty S (2019) Survey of different load balancing approach-based algorithms in cloud computing: a comprehensive review. In: Behera H, Nayak J, Naik B, Abraham A (eds) Computational Intelligence in Data Mining. Advances in Intelligent Systems and Computing, vol 711. Springer, Singapore
Gupta A et al (2013) Improving HPC application performance in cloud through dynamic load balancing. In: 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. IEEE
Benchara FZ et al (2016) A new efficient distributed computing middleware based on cloud micro-services for HPC. In: 5th International Conference on Multimedia Computing and Systems (ICMCS). IEEE
Suh E, Narahari B, Simha R (1998) Dynamic load balancing schemes for computing accessible surface area of Protein molecules. In: Proceedings of the 5th International Conference on High Performance Computing (Cat. No. 98EX238). IEEE
Young WS, Brooks III CL (1995) Dynamic load balancing algorithms for replicated data molecular dynamics. J Comput Chem 16(6):715–722
Mrozek D, Maysiak-Mrozek B, Kapciski A (2014) Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19):2822–2825
Auricchio F et al (2018) Benchmarking a hemodynamics application on Intel based HPC systems. Paral Comput Everywhere 32:57
Ferretti M, Santangelo L (2019) Profiling hemodynamic application for parallel computing in the cloud. in: 27th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP2019)
Auricchio F et al (2018) Parallelizing a finite element solver in computational hemodynamics: a black box approach. Int J High Perform Comput Appl 32(3):351–362
Auricchio F et al (2015) Assessment of a black-box approach for a parallel finite elements solver in computational hemodynamics. In: IEEE Trustcom/BigDataSE/ISPA, vol 3. IEEE
Do Chuong B, Katoh K (2009) Protein multiple sequence alignment. In: Functional Proteomics. Humana Press, pp 379–413
Holm L, Sander C (1993) Protein structure comparison by alignment of distance matrices. J Mol Biol 233(1):123–138
Shi S et al (2007) Searching for three-dimensional secondary structural patterns in proteins with ProSMoS. Bioinformatics 23(11):1331–1338
Shi S, Chitturi B, Grishin NV (2009) ProSMoS server: a pattern-based search using interaction matrix representation of protein structures. Nucl Acids Res 37(suppl2):W526–W531
Hutchinson EG, Thornton Janet M (1996) PROMOTIF—a program to identify and analyze structural motifs in proteins. Prot Sci 5(2):212–220
Dror O et al (2003) MASS: multiple structural alignment by secondary structures. Bioinformatics 19(suppl1):i95–i104
Krissinel E, Henrick K (2004) Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr Sect D 60(12):2256–2268
Aung Z, Li J (2007) Mining super-secondary structure motifs from 3d protein structures: a sequence order independent approach. Genome Inform 19:1526
Cantoni V et al (2014) Protein motif retrieval by secondary structure element geometry and biological features saliency. In: 25th International Workshop on Database and Expert Systems Applications. IEEE
Argentieri T, Cantoni V, Musci M (2017) Extending cross motif search with heuristic data mining. In: 28th International Workshop on Database and Expert Systems Applications (DEXA). IEEE
Musci M, Ferretti M (2018) Mining geometrical motifs co-occurrences in the CMS dataset. In: International Conference on Database and Expert Systems Applications. Springer, Cham
Ballard DH (1981) Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognit 13(2):111–122, ISSN 0031-3203,
Argentieri T, Cantoni V, Musci M (2016) MotifVisualizer: an interdisciplinary GUI for geometrical motif retrieval in proteins. In: 27th International Workshop on Database and Expert Systems Applications (DEXA). IEEE
Protein Data Bank. 2019, March 6. Retrieved from https://www.rcsb.org
Wesbrook J, Ito N, Nakamura H, Henrick K, Berman HM (2004) PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21(7):988–992
Tata S, Friedman JS, Swaroop A (2006) Declarative querying for biological sequences. In: 22nd International Conference on Data Engineering (ICDE’06). IEEE
Mrozek D et al (2016) An efficient and flexible scanning of databases of protein secondary structures. J Intell Inform Syst 46(1):213–233
Hammel L, Patel JM (2002) Searching on the secondary structure of protein sequences. In: VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Morgan Kaufmann
Wang Y, Sunderraman Rr, Tian H (2006) A domain specific data management architecture for protein structure data. In: International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE
Murzin Alexey G et al (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4):536–540
Marconi (2017) the new Tier-0 system. 2017, July 21. Retrieved from http://hpc.cineca.it/hardware/marconi
Kielmann T, Bal H E, Verstoep K (2000) Fast measurement of LogP parameters for message passing platforms. In: International Parallel and Distributed Processing Symposium. Springer, Berlin
Machined types. 2018, May 16. Retrieved from https://cloud.google.com/compute/docs/machine-types
Advanced VPC Concept. 2018, December 17. Retrieved from https://cloud.google.com/vpc/docs/advanced-vpc
Quota. 2019, March 06. Retrieved from https://cloud.google.com/vpc/docs/quota
Nomura A, Matsuba H, Ishikawa Y (2007) Network performance model for TCP/IP based cluster computing. In: IEEE International Conference on Cluster Computing, Austin, TX, 2007, pp 194–203
Li L, Zhang X, Feng J, Dong X (2010) mPlogP: a parallel computation model for heterogeneous multi-core computer. In: 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, Melbourne, VIC, 2010, pp 679–684
Hoefler T, Mehlan T, Lumsdaine A, Rehm W (2007) Netgauge: a network performance measurement framework. In: Perrott R, Chapman BM, Subhlok J, de Mello RF, Yang LT (eds) High Performance Computing and Communications. HPCC 2007. Lecture Notes in Computer Science, vol 4782. Springer, Berlin
Hockney R (1994) The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput 20(3):389–398
Alexandrov A, Ionescu MF, Schauser KE, Scheiman C (1995) LogGP: incorporating long messages into the LogP model. In: Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM Press, New York, pp 95–105
Culler D, Karp R, Patterson D, Sahay A, Schauser KE, Santos E, Subramonian R, von Eicken T (1993) LogP: towards a realistic model of parallel computation. In: Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM Press, New York, p 112
Steffenel LA, Mounie G (2008) A framework for adaptive collective communications for heterogeneous hierarchical computing systems. J Comput Syst Sci 74(6):1082–1093
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original version of this article was revised: Mirto Musci was not listed among the authors.
Rights and permissions
About this article
Cite this article
Ferretti, M., Santangelo, L. & Musci, M. Optimized cloud-based scheduling for protein secondary structure analysis. J Supercomput 75, 3499–3520 (2019). https://doi.org/10.1007/s11227-019-02859-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02859-w