Skip to main content

The Use of Distributed Data Storage and Processing Systems in Bioinformatic Data Analysis

  • Conference paper
  • First Online:
Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety (BDAS 2018)

Abstract

The cancer and the cancer mortality may seem the sign of the present times. This leads hundreds of scientists to handle the issue of finding significant premises of cancer occurrence. In this paper a set of data mining tasks is defined that joins the observed genes mutation with the specific cancer type observation. Due to the high computational complexity of this kind of data a Hadoop ecosystem cluster was developed to perform the required calculations. The results may be satisfactory in the domains of distributed data storage (processing) and the genes mutation occurrence interpretation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Falco repository. https://github.com/VCCRI/Falco/. Accessed 11 Dec 2017

  2. The Cancer Genome Atlas. https://cancergenome.nih.gov/

  3. Ashburner, M., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000)

    Article  Google Scholar 

  4. Buchfink, B., Xie, C., Huson, D.: Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015)

    Article  Google Scholar 

  5. Gao, S., Li, L., Li, W., Janowicz, K., Zhang, Y.: Constructing gazetteers from volunteered big geo-data based on Hadoop. Comput. Environ. Urban Syst. 61(Part B), 172–186 (2017)

    Article  Google Scholar 

  6. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003)

    Article  Google Scholar 

  7. Hanahan, D., Weinberg, R.: Hallmarks of cancer: the next generation. Cell 144(5), 646–674 (2011)

    Article  Google Scholar 

  8. Knijnenburg, T.A., Bismeijer, T., et al.: A multilevel pan-cancer map links gene mutations to cancer hallmarks. Chin. J. Cancer 34(3), 439–449 (2015)

    Article  Google Scholar 

  9. Li, K.B.: ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19(12), 1585–1586 (2003)

    Article  Google Scholar 

  10. Mrozek, D., Gosk, P., Małysiak-Mrozek, B.: Scaling ab initio predictions of 3D protein structures in Microsoft Azure Cloud. J. Grid Comput. 13(4), 561–585 (2015)

    Article  Google Scholar 

  11. Mrozek, D., Kłapciński, A., Małysiak-Mrozek, B.: Orchestrating task execution in Cloud4PSi for scalable processing of macromolecular data of 3D protein structures. In: Nguyen, N.T., Tojo, S., Nguyen, L.M., Trawiński, B. (eds.) ACIIDS 2017. LNCS (LNAI), vol. 10192, pp. 723–732. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54430-4_69

    Chapter  Google Scholar 

  12. Natesan, P., Rajalaxmi, R.R., Gowrison, G., Balasubramanie, P.: Hadoop based parallel binary bat algorithm for network intrusion detection. Int. J. Parallel Program. 45(5), 1194–1213 (2017)

    Article  Google Scholar 

  13. Sandholm, T., Lai, K.: MapReduce optimization using regulated dynamic prioritization. SIGMETRICS Perform. Eval. Rev. 37(1), 299–310 (2009)

    Google Scholar 

  14. Sarnovsky, M., Butka, P., Huzvarova, A.: Twitter data analysis and visualizations using the R language on top of the Hadoop platform. In: IEEE 15th International Symposium on Applied Machine Intelligence and Informatics, pp. 327–331 (2017)

    Google Scholar 

  15. Schaefer, C.F., Anthony, K., et al.: PID: the pathway interaction database. Nucleic Acids Res. 37(Suppl. 1), D674–D679 (2009)

    Article  Google Scholar 

  16. Schnase, J.L., Duffy, D.Q., et al.: MERRA analytic services: meeting the big data challenges of climate science through cloud-enabled climate analytics-as-a-service. Comput. Environ. Urban Syst. 61(B), 198–211 (2017)

    Article  Google Scholar 

  17. Shah, S.P., Huang, Y., Xu, T., et al.: Atlas-a data warehouse for integrative bioinformatics. BMC Bioinform. 6(1), 34 (2005)

    Article  Google Scholar 

  18. Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22), 4673–4680 (1994)

    Article  Google Scholar 

  19. Thoralf, T.T., Kormeier, B., Klassen, A., Hofestädt, R.: BioDWH: a data warehouse kit for life science data integration. J. Integr. Bioinform. 5(2), 49–57 (2008)

    Google Scholar 

  20. Wan, S., Zou, Q.: HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing. Algorithms Mol. Biol. 12(1), 25 (2017)

    Article  Google Scholar 

  21. White, T.: The Definitive Guide. O’Reilly Media, Newton (2009)

    Google Scholar 

  22. Yang, A., Troup, M., Lin, P., Ho, J.: Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud. Bioinformatics 33(5), 767–769 (2017)

    Google Scholar 

  23. Yang, M., Mei, H., Huang, D.: An effective detection of satellite images via k-means clustering on Hadoop system. Int. J. Innov. Comput. Inf. Control 13(3), 1037–1046 (2017)

    Google Scholar 

  24. Yu, J., Blom, J., Sczyrba, A., Goesmann, A.: Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism. J. Biotechnol. 257(Suppl. C), 58–60 (2017)

    Article  Google Scholar 

  25. Zou, Q., Hu, Q., et al.: HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15), 2475–2481 (2015)

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by Polish National Centre for Research and Development (NCBiR) within the programme Prevention and Treatment of Civilization Diseases—STRATEGMED III.

Grant No. STRATEGMED3/304586/5/NCBR/2017 (PersonALL). The work was carried out in part (especially the participation of the fifth author) within the statutory research project of the Institute of Informatics, BK-213/RAU2/2018.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcin Michalak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bochenek, M. et al. (2018). The Use of Distributed Data Storage and Processing Systems in Bioinformatic Data Analysis. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety. BDAS 2018. Communications in Computer and Information Science, vol 928. Springer, Cham. https://doi.org/10.1007/978-3-319-99987-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99987-6_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99986-9

  • Online ISBN: 978-3-319-99987-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics