Efficient 3D Protein Structure Alignment on Large Hadoop Clusters in Microsoft Azure Cloud

Małysiak-Mrozek, Bożena; Daniłowicz, Paweł; Mrozek, Dariusz

doi:10.1007/978-3-319-99987-6_3

Bożena Małysiak-Mrozek¹³,
Paweł Daniłowicz¹³ &
Dariusz Mrozek¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 928))

Included in the following conference series:

International Conference: Beyond Databases, Architectures and Structures

903 Accesses
1 Citations

Abstract

Exploration of 3D protein structures provides a broad potential for possible applications of its results in medical diagnostics, drug design, and treatment of patients. 3D protein structure similarity searching is one of the important exploration processes performed in structural bioinformatics. However, the process is time-consuming and requires increased computational resources when performed against large repositories. In this paper, we show that 3D protein structure similarity searching can be significantly accelerated by using modern processing techniques and computer architectures. Results of our experiments prove that by distributing computations on large Hadoop/HBase (HDInsight) clusters and scaling them out and up in the Microsoft Azure public cloud we can reduce the execution times of similarity search processes from hundred of hours to minutes. We will also show that the utilization of public clouds to perform scientific computations is very beneficial and can be successfully applied when scaling time-consuming computations over a mass of biological data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Berman, H.: The protein data bank. Nucleic Acids Res. 28, 235–242 (2000)
Article Google Scholar
BioSQL Homepage. http://biosql.org/. Accessed 20 Jan 2018
Bourne, P., Berman, H., Watenpaugh, K.: The macromolecular crystallographic information file (mmCIF). Methods Enzymol. 277, 571–590 (1997)
Article Google Scholar
George, L.: HBase: The Definitive Guide, 1st edn. O’Reilly Media, Sebastopol (2011)
Google Scholar
Gibrat, J., Madej, T., Bryant, S.: Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 6(3), 377–385 (1996)
Article Google Scholar
Hazelhurst, S.: PH2: an Hadoop-based framework for mining structural properties from the PDB database. In: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists, pp. 104–112 (2010)
Google Scholar
Holm, L., Kaariainen, S., Rosenstrom, P., Schenkel, A.: Searching protein structure databases with DaliLite v.3. Bioinformatics 24, 2780–2781 (2008)
Article Google Scholar
Hung, C.L., Lin, Y.L.: Implementation of a parallel protein structure alignment service on cloud. Int. J. Genom. Article ID 439681, pp. 1–8 (2008)
Google Scholar
Leinweber, M., et al.: GPU-based cloud computing for comparing the structure of protein binding sites. In: 2012 6th IEEE International Conference on Digital Ecosystems and Technologies, DEST, pp. 1–6 (2012)
Google Scholar
Leinweber, M., Fober, T., Freisleben, B.: GPU-based point cloud superpositioning for structural comparisons of protein binding sites. IEEE/ACM Trans. Comput. Biol. Bioinform. PP(99), 1–14 (2018)
Google Scholar
Leinweber, M., et al.: CavSimBase: a database for large scale comparison of protein binding sites. IEEE Trans. Knowl. Data Eng. 28(6), 1423–1434 (2016)
Article Google Scholar
Mell, P., Grance, T.: The NIST definition of Cloud Computing. Special Publication 800–145 (2011). http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf. Accessed 7 May 2018
Microsoft Azure Cloud Services Specification: Sizes for Cloud Services. https://azure.microsoft.com/pl-pl/documentation/articles/cloud-services-sizes-specs/. Accessed 7 May 2018
Mrozek, D., Brozek, M., Małysiak-Mrozek, B.: Parallel implementation of 3D protein structure similarity searches using a GPU and the CUDA. J. Mol. Model. 20, 2067 (2014)
Article Google Scholar
Mrozek, D., Daniłowicz, P., Małysiak-Mrozek, B.: HDInsight4PSi: boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Inf. Sci. 349–350, 77–101 (2016)
Article Google Scholar
Mrozek, D., Kutyła, T., Małysiak-Mrozek, B.: Accelerating 3D protein structure similarity searching on Microsoft Azure Cloud with local replicas of macromolecular data. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9574, pp. 254–265. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32152-3_24
Chapter Google Scholar
Mrozek, D., Małysiak-Mrozek, B.: CASSERT: a two-phase alignment algorithm for matching 3D structures of proteins. In: Kwiecień, A., Gaj, P., Stera, P. (eds.) CN 2013. CCIS, vol. 370, pp. 334–343. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38865-1_34
Chapter Google Scholar
Mrozek, D., Małysiak-Mrozek, B., Kłapciński, A.: Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics 30(19), 2822–2825 (2014)
Article Google Scholar
Mrozek, D., Suwała, M., Małysiak-Mrozek, B.: High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model. Knowl. Inf. Syst. (in Press). https://doi.org/10.1007/s10115-018-1245-3
Mrozek, D., Wieczorek, D., Malysiak-Mrozek, B., Kozielski, S.: PSS-SQL: protein secondary structure - structured query language. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp. 1073–1076 (2010)
Google Scholar
Mrozek, D., Małysiak-Mrozek, B., Adamek, R.: P3D-SQL: extending Oracle PL/SQL capabilities towards 3D protein structure similarity searching. In: Ortuño, F., Rojas, I. (eds.) IWBBIO 2015. LNCS, vol. 9043, pp. 548–556. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16483-0_53
Chapter Google Scholar
Mrozek, D., Socha, B., Kozielski, S., Małysiak-Mrozek, B.: An efficient and flexible scanning of databases of protein secondary structures. J. Intell. Inf. Syst. 46(1), 213–233 (2016). https://doi.org/10.1007/s10844-014-0353-0
Article Google Scholar
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247(4), 536–540 (1995). http://www.sciencedirect.com/science/article/pii/S0022283605801342
Google Scholar
National Research Council: Frontiers in Massive Data Analysis. National Academy Press, Washington, D.C. (2013)
Google Scholar
Pang, B., Zhao, N., Becchi, M., Korkin, D., Shyu, C.R.: Accelerating large-scale protein structure alignments with graphics processing units. BMC Res. Notes 5(1), 116 (2012). https://doi.org/10.1186/1756-0500-5-116
Article Google Scholar
Prlić, A., et al.: Pre-calculated protein structure alignments at the RCSB PDB website. Bioinformatics 26, 2983–2985 (2010)
Article Google Scholar
Prlić, A., Yates, A., Bliven, S.: BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics 28, 2693–2695 (2012)
Article Google Scholar
Shindyalov, I., Bourne, P.: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11(9), 739–747 (1998)
Article Google Scholar
Sosinsky, B.: Cloud Computing Bible, 1st edn. Wiley, New York (2011)
Google Scholar
Stivala, A.D., Stuckey, P.J., Wirth, A.I.: Fast and accurate protein substructure searching with simulated annealing and GPUs. BMC Bioinform. 11(1), 446 (2010). https://doi.org/10.1186/1471-2105-11-446
Article Google Scholar
Wesbrook, J., Ito, N., Nakamura, H., Henrick, K., Berman, H.: PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21(7), 988–992 (2005)
Article Google Scholar
Westbrook, J., Fitzgerald, P.: The PDB format, mmCIF, and other data formats. Methods Biochem. Anal. 44, 161–79 (2003)
Google Scholar
Ye, Y., Godzik, A.: Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19(2), 246–255 (2003)
Google Scholar

Download references

Acknowledgments

This work was supported by Microsoft Research within Microsoft Azure for Research Award grant, and Statutory Research funds of Institute of Informatics, Silesian University of Technology, Gliwice, Poland (grant No. BK/213/RAU2/2018).

Author information

Authors and Affiliations

Institute of Informatics, Silesian University of Technology, ul. Akademicka 16, 44-100, Gliwice, Poland
Bożena Małysiak-Mrozek, Paweł Daniłowicz & Dariusz Mrozek

Authors

Bożena Małysiak-Mrozek
View author publications
You can also search for this author in PubMed Google Scholar
Paweł Daniłowicz
View author publications
You can also search for this author in PubMed Google Scholar
Dariusz Mrozek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dariusz Mrozek .

Editor information

Editors and Affiliations

Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Stanisław Kozielski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Dariusz Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Paweł Kasprowski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Bożena Małysiak-Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Daniel Kostrzewa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Małysiak-Mrozek, B., Daniłowicz, P., Mrozek, D. (2018). Efficient 3D Protein Structure Alignment on Large Hadoop Clusters in Microsoft Azure Cloud. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety. BDAS 2018. Communications in Computer and Information Science, vol 928. Springer, Cham. https://doi.org/10.1007/978-3-319-99987-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-99987-6_3
Published: 31 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99986-9
Online ISBN: 978-3-319-99987-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics