Clustering Protein Structures with Hadoop

Paschina, Giacomo; Roverelli, Luca; D’Agostino, Daniele; Chiappori, Federica; Merelli, Ivan

doi:10.1007/978-3-319-44332-4_11

Giacomo Paschina¹⁶,
Luca Roverelli¹⁶,
Daniele D’Agostino¹⁶,
Federica Chiappori¹⁷ &
…
Ivan Merelli¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9874))

Included in the following conference series:

International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics

1067 Accesses
1 Citations

Abstract

Machine learning is a widely used technique in structural biology, since the analysis of large conformational ensembles originated from single protein structures (e.g. derived from NMR experiments or molecular dynamics simulations) can be approached by partitioning the original dataset into sensible subsets, revealing important structural and dynamics behaviours. Clustering is a good unsupervised approach for dealing with these ensembles of structures, in order to identify stable conformations and driving characteristics shared by the different structures. A common problem of the applications that implement protein clustering is the scalability of the performance, in particular concerning the data load into memory. In this work we show how it is possible to improve the parallel performance of the GROMOS clustering algorithm by using Hadoop. The preliminary results show the validity of this approach, providing a hint for future development in this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

MaxCluster - A tool for Protein Structure Comparison and Clustering. http://www.sbg.bio.ic.ac.uk/maxcluster
Chiappori, F., Merelli, I., Milanesi, L., Marabotti, A.: Static and dynamic interactions between GALK enzyme and known inhibitors: guidelines to design new drugs for galactosemic patients. Eur. J. Med. Chem. 63, 423–434 (2013)
Article Google Scholar
D’Ursi, P., Chiappori, F., Merelli, I., Cozzi, P., Rovida, E., Milanesi, L.: Virtual screening pipeline and ligand modelling for H5N1 neuraminidase. Biochem. Biophys. Res. Commun. 383(4), 445–449 (2009)
Article Google Scholar
Hung, L.H., Samudrala, R.: fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data. Bioinformatics 30(12), 1774–1776 (2014)
Article Google Scholar
Daura, X., Gademann, K., Jaun, B., Seebach, D., van Gunsteren, W.F., Mark, A.E.: Peptide folding: when simulation meets experiment. Angew. Chem. Int. Ed. 38(1–2), 236–240 (1999)
Article Google Scholar
Berendsen, H.J.C., van der Spoel, D., van Drunen, R.: GROMACS: a message-passing parallel molecular dynamics implementation. Comput. Phys. Commun. 91, 43–56 (1995)
Article Google Scholar
Altman, N.S.: An introduction to Kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1995)
MathSciNet Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2009)
Google Scholar
Merelli, I., Prez-Snchez, H., Gesing, S., D’Agostino, D.: Managing, analysing, and integrating big data in medical bioinformatics: open problems and future perspectives. BioMed Res. Int. (2014). Article ID: 134023
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Mayer, M.P., Bukau, B.: Hsp70 chaperones: cellular functions and molecular mechanism. Cell. Mol. Life Sci. 62(6), 670–684 (2005)
Article Google Scholar
Kityk, R., Kopp, J., Sinning, I., Mayer, M.P.: Structure and dynamics of the ATP-bound open conformation of Hsp70 chaperones. Mol. Cell. 48(6), 863–874 (2012)
Article Google Scholar
van der Spoel, D., Lindahl, E., Hess, B., Groenhof, G., Mark, A.E., Berendsen, H.J.C.: GROMACS: fast, flexible, and free. J. Comput. Chem. 26, 1701–1718 (2005)
Article Google Scholar
Chiappori, F., Merelli, I., Colombo, G., Milanesi, L., Morra, G.: Molecular mechanism of allosteric communication in Hsp70 revealed by molecular dynamics simulations. PLoS Comput. Biol. 8(12), e1002844 (2012)
Article Google Scholar
Chiappori, F., Milanesi, L., Merelli, I.: HPC analysis of multiple binding sites communication and allosteric modulations in drug design: the HSP case study. Curr. Drug Targets (2015)
Google Scholar
Eadline, D.: Is Hadoop the New HPC? http://www.admin-magazine.com/HPC/Articles/Is-Hadoop-the-New-HPC

Download references

Acknowledgments

This paper has been supported by the Italian Ministry of Education and Research (MIUR) through the Flagship (PB05) InterOmics, HIRMA (RBAP11YS7K), and the European MIMOMICS projects.

Author information

Authors and Affiliations

Institute of Applied Mathematics and Information Technologies “E. Magenes”, National Research Council of Italy, Genoa, Italy
Giacomo Paschina, Luca Roverelli & Daniele D’Agostino
Institute of Biomedical Technologies, National Research Council of Italy, Segrate, MI, Italy
Federica Chiappori & Ivan Merelli

Authors

Giacomo Paschina
View author publications
You can also search for this author in PubMed Google Scholar
Luca Roverelli
View author publications
You can also search for this author in PubMed Google Scholar
Daniele D’Agostino
View author publications
You can also search for this author in PubMed Google Scholar
Federica Chiappori
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Merelli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ivan Merelli .

Editor information

Editors and Affiliations

CNR, Istituto per le Applicazioni del Calcolo, Naples, Italy
Claudia Angelini
Center for Statistics in the Biomedical Sciences, Vita-Salute San Raffaele University, Milano, Italy
Paola MV Rancoita
DIBRIS, University of Genoa, Genova, Italy
Stefano Rovetta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paschina, G., Roverelli, L., D’Agostino, D., Chiappori, F., Merelli, I. (2016). Clustering Protein Structures with Hadoop. In: Angelini, C., Rancoita, P., Rovetta, S. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2015. Lecture Notes in Computer Science(), vol 9874. Springer, Cham. https://doi.org/10.1007/978-3-319-44332-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-44332-4_11
Published: 31 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44331-7
Online ISBN: 978-3-319-44332-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics