The Use of Distributed Data Storage and Processing Systems in Bioinformatic Data Analysis

Bochenek, Michał; Folkert, Kamil; Jaksik, Roman; Krzesiak, Michał; Michalak, Marcin; Sikora, Marek; Stȩclik, Tomasz; Wróbel, Łukasz

doi:10.1007/978-3-319-99987-6_2

Michał Bochenek¹⁶,
Kamil Folkert¹⁶,
Roman Jaksik¹⁵,
Michał Krzesiak¹⁶,
Marcin Michalak¹³,
Marek Sikora¹⁴,
Tomasz Stȩclik¹³ &
…
Łukasz Wróbel¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 928))

Included in the following conference series:

International Conference: Beyond Databases, Architectures and Structures

927 Accesses

Abstract

The cancer and the cancer mortality may seem the sign of the present times. This leads hundreds of scientists to handle the issue of finding significant premises of cancer occurrence. In this paper a set of data mining tasks is defined that joins the observed genes mutation with the specific cancer type observation. Due to the high computational complexity of this kind of data a Hadoop ecosystem cluster was developed to perform the required calculations. The results may be satisfactory in the domains of distributed data storage (processing) and the genes mutation occurrence interpretation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Falco repository. https://github.com/VCCRI/Falco/. Accessed 11 Dec 2017
The Cancer Genome Atlas. https://cancergenome.nih.gov/
Ashburner, M., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000)
Article Google Scholar
Buchfink, B., Xie, C., Huson, D.: Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015)
Article Google Scholar
Gao, S., Li, L., Li, W., Janowicz, K., Zhang, Y.: Constructing gazetteers from volunteered big geo-data based on Hadoop. Comput. Environ. Urban Syst. 61(Part B), 172–186 (2017)
Article Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003)
Article Google Scholar
Hanahan, D., Weinberg, R.: Hallmarks of cancer: the next generation. Cell 144(5), 646–674 (2011)
Article Google Scholar
Knijnenburg, T.A., Bismeijer, T., et al.: A multilevel pan-cancer map links gene mutations to cancer hallmarks. Chin. J. Cancer 34(3), 439–449 (2015)
Article Google Scholar
Li, K.B.: ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19(12), 1585–1586 (2003)
Article Google Scholar
Mrozek, D., Gosk, P., Małysiak-Mrozek, B.: Scaling ab initio predictions of 3D protein structures in Microsoft Azure Cloud. J. Grid Comput. 13(4), 561–585 (2015)
Article Google Scholar
Mrozek, D., Kłapciński, A., Małysiak-Mrozek, B.: Orchestrating task execution in Cloud4PSi for scalable processing of macromolecular data of 3D protein structures. In: Nguyen, N.T., Tojo, S., Nguyen, L.M., Trawiński, B. (eds.) ACIIDS 2017. LNCS (LNAI), vol. 10192, pp. 723–732. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54430-4_69
Chapter Google Scholar
Natesan, P., Rajalaxmi, R.R., Gowrison, G., Balasubramanie, P.: Hadoop based parallel binary bat algorithm for network intrusion detection. Int. J. Parallel Program. 45(5), 1194–1213 (2017)
Article Google Scholar
Sandholm, T., Lai, K.: MapReduce optimization using regulated dynamic prioritization. SIGMETRICS Perform. Eval. Rev. 37(1), 299–310 (2009)
Google Scholar
Sarnovsky, M., Butka, P., Huzvarova, A.: Twitter data analysis and visualizations using the R language on top of the Hadoop platform. In: IEEE 15th International Symposium on Applied Machine Intelligence and Informatics, pp. 327–331 (2017)
Google Scholar
Schaefer, C.F., Anthony, K., et al.: PID: the pathway interaction database. Nucleic Acids Res. 37(Suppl. 1), D674–D679 (2009)
Article Google Scholar
Schnase, J.L., Duffy, D.Q., et al.: MERRA analytic services: meeting the big data challenges of climate science through cloud-enabled climate analytics-as-a-service. Comput. Environ. Urban Syst. 61(B), 198–211 (2017)
Article Google Scholar
Shah, S.P., Huang, Y., Xu, T., et al.: Atlas-a data warehouse for integrative bioinformatics. BMC Bioinform. 6(1), 34 (2005)
Article Google Scholar
Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22), 4673–4680 (1994)
Article Google Scholar
Thoralf, T.T., Kormeier, B., Klassen, A., Hofestädt, R.: BioDWH: a data warehouse kit for life science data integration. J. Integr. Bioinform. 5(2), 49–57 (2008)
Google Scholar
Wan, S., Zou, Q.: HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing. Algorithms Mol. Biol. 12(1), 25 (2017)
Article Google Scholar
White, T.: The Definitive Guide. O’Reilly Media, Newton (2009)
Google Scholar
Yang, A., Troup, M., Lin, P., Ho, J.: Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud. Bioinformatics 33(5), 767–769 (2017)
Google Scholar
Yang, M., Mei, H., Huang, D.: An effective detection of satellite images via k-means clustering on Hadoop system. Int. J. Innov. Comput. Inf. Control 13(3), 1037–1046 (2017)
Google Scholar
Yu, J., Blom, J., Sczyrba, A., Goesmann, A.: Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism. J. Biotechnol. 257(Suppl. C), 58–60 (2017)
Article Google Scholar
Zou, Q., Hu, Q., et al.: HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15), 2475–2481 (2015)
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by Polish National Centre for Research and Development (NCBiR) within the programme Prevention and Treatment of Civilization Diseases—STRATEGMED III.

Grant No. STRATEGMED3/304586/5/NCBR/2017 (PersonALL). The work was carried out in part (especially the participation of the fifth author) within the statutory research project of the Institute of Informatics, BK-213/RAU2/2018.

Author information

Authors and Affiliations

Institute of Innovative Technologies EMAG, ul. Leopolda 31, 40-189, Katowice, Poland
Marcin Michalak & Tomasz Stȩclik
Institute of Informatics, Silesian University of Technology, ul. Akademicka 16, 44-100, Gliwice, Poland
Marek Sikora & Łukasz Wróbel
Institute of Automatic Control, Silesian University of Technology, ul. Akademicka 16, 44-100, Gliwice, Poland
Roman Jaksik
3 Soft S.A., ul. Porcelanowa 23, 40-246, Katowice, Poland
Michał Bochenek, Kamil Folkert & Michał Krzesiak

Authors

Michał Bochenek
View author publications
You can also search for this author in PubMed Google Scholar
Kamil Folkert
View author publications
You can also search for this author in PubMed Google Scholar
Roman Jaksik
View author publications
You can also search for this author in PubMed Google Scholar
Michał Krzesiak
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Michalak
View author publications
You can also search for this author in PubMed Google Scholar
Marek Sikora
View author publications
You can also search for this author in PubMed Google Scholar
Tomasz Stȩclik
View author publications
You can also search for this author in PubMed Google Scholar
Łukasz Wróbel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcin Michalak .

Editor information

Editors and Affiliations

Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Stanisław Kozielski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Dariusz Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Paweł Kasprowski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Bożena Małysiak-Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Daniel Kostrzewa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bochenek, M. et al. (2018). The Use of Distributed Data Storage and Processing Systems in Bioinformatic Data Analysis. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety. BDAS 2018. Communications in Computer and Information Science, vol 928. Springer, Cham. https://doi.org/10.1007/978-3-319-99987-6_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-99987-6_2
Published: 31 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99986-9
Online ISBN: 978-3-319-99987-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics