Skip to main content

Clustering Protein Structures with Hadoop

  • Conference paper
  • First Online:
Book cover Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2015)

Abstract

Machine learning is a widely used technique in structural biology, since the analysis of large conformational ensembles originated from single protein structures (e.g. derived from NMR experiments or molecular dynamics simulations) can be approached by partitioning the original dataset into sensible subsets, revealing important structural and dynamics behaviours. Clustering is a good unsupervised approach for dealing with these ensembles of structures, in order to identify stable conformations and driving characteristics shared by the different structures. A common problem of the applications that implement protein clustering is the scalability of the performance, in particular concerning the data load into memory. In this work we show how it is possible to improve the parallel performance of the GROMOS clustering algorithm by using Hadoop. The preliminary results show the validity of this approach, providing a hint for future development in this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. MaxCluster - A tool for Protein Structure Comparison and Clustering. http://www.sbg.bio.ic.ac.uk/maxcluster

  2. Chiappori, F., Merelli, I., Milanesi, L., Marabotti, A.: Static and dynamic interactions between GALK enzyme and known inhibitors: guidelines to design new drugs for galactosemic patients. Eur. J. Med. Chem. 63, 423–434 (2013)

    Article  Google Scholar 

  3. D’Ursi, P., Chiappori, F., Merelli, I., Cozzi, P., Rovida, E., Milanesi, L.: Virtual screening pipeline and ligand modelling for H5N1 neuraminidase. Biochem. Biophys. Res. Commun. 383(4), 445–449 (2009)

    Article  Google Scholar 

  4. Hung, L.H., Samudrala, R.: fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data. Bioinformatics 30(12), 1774–1776 (2014)

    Article  Google Scholar 

  5. Daura, X., Gademann, K., Jaun, B., Seebach, D., van Gunsteren, W.F., Mark, A.E.: Peptide folding: when simulation meets experiment. Angew. Chem. Int. Ed. 38(1–2), 236–240 (1999)

    Article  Google Scholar 

  6. Berendsen, H.J.C., van der Spoel, D., van Drunen, R.: GROMACS: a message-passing parallel molecular dynamics implementation. Comput. Phys. Commun. 91, 43–56 (1995)

    Article  Google Scholar 

  7. Altman, N.S.: An introduction to Kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1995)

    MathSciNet  Google Scholar 

  8. White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2009)

    Google Scholar 

  9. Merelli, I., Prez-Snchez, H., Gesing, S., D’Agostino, D.: Managing, analysing, and integrating big data in medical bioinformatics: open problems and future perspectives. BioMed Res. Int. (2014). Article ID: 134023

    Google Scholar 

  10. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  11. Mayer, M.P., Bukau, B.: Hsp70 chaperones: cellular functions and molecular mechanism. Cell. Mol. Life Sci. 62(6), 670–684 (2005)

    Article  Google Scholar 

  12. Kityk, R., Kopp, J., Sinning, I., Mayer, M.P.: Structure and dynamics of the ATP-bound open conformation of Hsp70 chaperones. Mol. Cell. 48(6), 863–874 (2012)

    Article  Google Scholar 

  13. van der Spoel, D., Lindahl, E., Hess, B., Groenhof, G., Mark, A.E., Berendsen, H.J.C.: GROMACS: fast, flexible, and free. J. Comput. Chem. 26, 1701–1718 (2005)

    Article  Google Scholar 

  14. Chiappori, F., Merelli, I., Colombo, G., Milanesi, L., Morra, G.: Molecular mechanism of allosteric communication in Hsp70 revealed by molecular dynamics simulations. PLoS Comput. Biol. 8(12), e1002844 (2012)

    Article  Google Scholar 

  15. Chiappori, F., Milanesi, L., Merelli, I.: HPC analysis of multiple binding sites communication and allosteric modulations in drug design: the HSP case study. Curr. Drug Targets (2015)

    Google Scholar 

  16. Eadline, D.: Is Hadoop the New HPC? http://www.admin-magazine.com/HPC/Articles/Is-Hadoop-the-New-HPC

Download references

Acknowledgments

This paper has been supported by the Italian Ministry of Education and Research (MIUR) through the Flagship (PB05) InterOmics, HIRMA (RBAP11YS7K), and the European MIMOMICS projects.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ivan Merelli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Paschina, G., Roverelli, L., D’Agostino, D., Chiappori, F., Merelli, I. (2016). Clustering Protein Structures with Hadoop. In: Angelini, C., Rancoita, P., Rovetta, S. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2015. Lecture Notes in Computer Science(), vol 9874. Springer, Cham. https://doi.org/10.1007/978-3-319-44332-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-44332-4_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-44331-7

  • Online ISBN: 978-3-319-44332-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics