Skip to main content

ParSMURF-NG: A Machine Learning High Performance Computing System for the Analysis of Imbalanced Big Omics Data

  • Conference paper
  • First Online:
Artificial Intelligence Applications and Innovations. AIAI 2022 IFIP WG 12.5 International Workshops (AIAI 2022)

Abstract

In the context of Genomic and Precision Medicine, prediction problems are often characterized by a high imbalance between classes and Big Data. This requires specialized tools, as traditional Machine Learning approaches may struggle with big datasets and often fail to predict the minority class with unbalanced classification problems.

In this work we present ParSMURF-NG, a High Performance Computing-oriented Machine Learning approach designed to scale well on big omics data. We measured its performance capabilities on three current-generation HPC systems and we showed its usefulness in the context of Genomic Medicine, providing a powerful model for the detection of pathogenic single nucleotide variants in the non-coding regions of the human genome.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Code and Data Availability

ParSMURF-NG is distributed as source code, and it is available at https://github.com/AnacletoLAB/parSMURF-NG

Notes

  1. 1.

    https://github.com/AnacletoLAB/parSMURF-NG.

References

  1. Abecasis, G.R., et al.: An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422), 56–65 (2012)

    Article  Google Scholar 

  2. Adzhubei, I., Jordan, D.M., Sunyaev, S.R.: Predicting functional effect of human missense mutations using polyphen-2. Curr. Protoc. Hum. Genet. 76(1), 7–20 (2013)

    Google Scholar 

  3. Amberger, J.S., Bocchini, C.A., Scott, A.F., Hamosh, A.: Omim.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res. 47(D1), D1038–D1043 (2019)

    Google Scholar 

  4. Caron, B., Luo, Y., Rausell, A.: NCBoost classifies pathogenic non-coding variants in mendelian diseases through supervised learning on purifying selection signals in humans. Genome Biol. 20(1), 32 (2019)

    Article  Google Scholar 

  5. Edwards, S.L., Beesley, J., French, J.D., Dunning, A.M.: Beyond GWASs: illuminating the dark road from association to function. Am. J. Hum. Genet. 93(5), 779–97 (2013)

    Article  Google Scholar 

  6. Ghandi, M., Lee, D., Mohammad-Noori, M., Beer, M.A.: Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 10(7), e1003711 (2014)

    Article  Google Scholar 

  7. Grama, A., Karypis, G., Kumar, V., Gupta, A.: Introduction to Parallel Computing, 2nd edn. Addison Wesley, Boston (2003)

    MATH  Google Scholar 

  8. Kircher, M., Witten, D.M., Jain, P., O’Roak, B.J., Cooper, G.M., Shendure, J.: A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46(3), 310–315 (2014)

    Article  Google Scholar 

  9. Kumar, P., Henikoff, S., Ng, P.: Predicting the effects of coding non-synonymous variants on protein function using the sift algorithm. Nat. Protoc. 4(7), 1073–81 (2009)

    Article  Google Scholar 

  10. Leung, M.K.K., Delong, A., Alipanahi, B., Frey, B.J.: Machine learning in genomic medicine: a review of computational problems and data sets. Proc. IEEE 104, 176–197 (2016)

    Article  Google Scholar 

  11. Petrini, A., et al.: parSMURF, a high-performance computing tool for the genome-wide detection of pathogenic variants. GigaScience 9(5), giaa052 (2020)

    Google Scholar 

  12. Petrini, A., et al.: Parameters tuning boosts hyperSMURF predictions of rare deleterious non-coding genetic variants. In: NETTAB 2017, Methods, Tools and Platforms for Personalized Medicine in the Big Data Era, Palermo, Italy, October 2017

    Google Scholar 

  13. Quang, D., Xie, X., Chen, Y.: DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31(5), 761–763 (2014)

    Google Scholar 

  14. Rentzsch, P., Witten, D., Cooper, G., Shendure, J., Kircher, M.: CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47(D1), D886–D894 (2019)

    Article  Google Scholar 

  15. Ritchie, G.R.S., Dunham, I., Zeggini, E., Flicek, P.: Functional annotation of noncoding sequence variants. Nat. Methods 11(3), 294–296 (2014)

    Article  Google Scholar 

  16. Schubach, M., Re, M., Robinson, P.N., Valentini, G.: Imbalance-aware machine learning for predicting rare and common disease-associated non-coding variants. Sci. Rep. 7(1), 2959 (2017)

    Article  Google Scholar 

  17. Shihab, H.A., et al.: An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics 31(10), 1536–1543 (2015)

    Article  Google Scholar 

  18. Smedley, D., et al.: A whole-genome analysis framework for effective identification of pathogenic regulatory variants in mendelian disease. Am. J. Hum. Genet. 99(3), 595–606 (2016)

    Google Scholar 

  19. Turnbull, C., et al.: The 100 000 genomes project: bringing whole genome sequencing to the NHS. BMJ 361 (2018)

    Google Scholar 

  20. Zhou, J., Troyanskaya, O.G.: Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12(10), 931–934 (2015)

    Article  Google Scholar 

Download references

Acknowledgements

This study has been developed and performed in the context of the project “ParBigMen: ParSMURF application to Big genomic and epigenomic data for the detection of pathogenic variants in Mendelian diseases”. This project had been awarded by the Partnership for Advanced Computing in Europe in its 21st Call for Proposal. We acknowledge PRACE for awarding us access to SuperMUC-NG at LRZ, Germany.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alessandro Petrini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Petrini, A. et al. (2022). ParSMURF-NG: A Machine Learning High Performance Computing System for the Analysis of Imbalanced Big Omics Data. In: Maglogiannis, I., Iliadis, L., Macintyre, J., Cortez, P. (eds) Artificial Intelligence Applications and Innovations. AIAI 2022 IFIP WG 12.5 International Workshops. AIAI 2022. IFIP Advances in Information and Communication Technology, vol 652. Springer, Cham. https://doi.org/10.1007/978-3-031-08341-9_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-08341-9_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-08340-2

  • Online ISBN: 978-3-031-08341-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics