Abstract
Machine learning is a widely used technique in structural biology, since the analysis of large conformational ensembles originated from single protein structures (e.g. derived from NMR experiments or molecular dynamics simulations) can be approached by partitioning the original dataset into sensible subsets, revealing important structural and dynamics behaviours. Clustering is a good unsupervised approach for dealing with these ensembles of structures, in order to identify stable conformations and driving characteristics shared by the different structures. A common problem of the applications that implement protein clustering is the scalability of the performance, in particular concerning the data load into memory. In this work we show how it is possible to improve the parallel performance of the GROMOS clustering algorithm by using Hadoop. The preliminary results show the validity of this approach, providing a hint for future development in this field.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
MaxCluster - A tool for Protein Structure Comparison and Clustering. http://www.sbg.bio.ic.ac.uk/maxcluster
Chiappori, F., Merelli, I., Milanesi, L., Marabotti, A.: Static and dynamic interactions between GALK enzyme and known inhibitors: guidelines to design new drugs for galactosemic patients. Eur. J. Med. Chem. 63, 423–434 (2013)
D’Ursi, P., Chiappori, F., Merelli, I., Cozzi, P., Rovida, E., Milanesi, L.: Virtual screening pipeline and ligand modelling for H5N1 neuraminidase. Biochem. Biophys. Res. Commun. 383(4), 445–449 (2009)
Hung, L.H., Samudrala, R.: fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data. Bioinformatics 30(12), 1774–1776 (2014)
Daura, X., Gademann, K., Jaun, B., Seebach, D., van Gunsteren, W.F., Mark, A.E.: Peptide folding: when simulation meets experiment. Angew. Chem. Int. Ed. 38(1–2), 236–240 (1999)
Berendsen, H.J.C., van der Spoel, D., van Drunen, R.: GROMACS: a message-passing parallel molecular dynamics implementation. Comput. Phys. Commun. 91, 43–56 (1995)
Altman, N.S.: An introduction to Kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1995)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2009)
Merelli, I., Prez-Snchez, H., Gesing, S., D’Agostino, D.: Managing, analysing, and integrating big data in medical bioinformatics: open problems and future perspectives. BioMed Res. Int. (2014). Article ID: 134023
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Mayer, M.P., Bukau, B.: Hsp70 chaperones: cellular functions and molecular mechanism. Cell. Mol. Life Sci. 62(6), 670–684 (2005)
Kityk, R., Kopp, J., Sinning, I., Mayer, M.P.: Structure and dynamics of the ATP-bound open conformation of Hsp70 chaperones. Mol. Cell. 48(6), 863–874 (2012)
van der Spoel, D., Lindahl, E., Hess, B., Groenhof, G., Mark, A.E., Berendsen, H.J.C.: GROMACS: fast, flexible, and free. J. Comput. Chem. 26, 1701–1718 (2005)
Chiappori, F., Merelli, I., Colombo, G., Milanesi, L., Morra, G.: Molecular mechanism of allosteric communication in Hsp70 revealed by molecular dynamics simulations. PLoS Comput. Biol. 8(12), e1002844 (2012)
Chiappori, F., Milanesi, L., Merelli, I.: HPC analysis of multiple binding sites communication and allosteric modulations in drug design: the HSP case study. Curr. Drug Targets (2015)
Eadline, D.: Is Hadoop the New HPC? http://www.admin-magazine.com/HPC/Articles/Is-Hadoop-the-New-HPC
Acknowledgments
This paper has been supported by the Italian Ministry of Education and Research (MIUR) through the Flagship (PB05) InterOmics, HIRMA (RBAP11YS7K), and the European MIMOMICS projects.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Paschina, G., Roverelli, L., D’Agostino, D., Chiappori, F., Merelli, I. (2016). Clustering Protein Structures with Hadoop. In: Angelini, C., Rancoita, P., Rovetta, S. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2015. Lecture Notes in Computer Science(), vol 9874. Springer, Cham. https://doi.org/10.1007/978-3-319-44332-4_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-44332-4_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44331-7
Online ISBN: 978-3-319-44332-4
eBook Packages: Computer ScienceComputer Science (R0)