Abstract
In the context of Genomic and Precision Medicine, prediction problems are often characterized by a high imbalance between classes and Big Data. This requires specialized tools, as traditional Machine Learning approaches may struggle with big datasets and often fail to predict the minority class with unbalanced classification problems.
In this work we present ParSMURF-NG, a High Performance Computing-oriented Machine Learning approach designed to scale well on big omics data. We measured its performance capabilities on three current-generation HPC systems and we showed its usefulness in the context of Genomic Medicine, providing a powerful model for the detection of pathogenic single nucleotide variants in the non-coding regions of the human genome.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Code and Data Availability
ParSMURF-NG is distributed as source code, and it is available at https://github.com/AnacletoLAB/parSMURF-NG
References
Abecasis, G.R., et al.: An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422), 56–65 (2012)
Adzhubei, I., Jordan, D.M., Sunyaev, S.R.: Predicting functional effect of human missense mutations using polyphen-2. Curr. Protoc. Hum. Genet. 76(1), 7–20 (2013)
Amberger, J.S., Bocchini, C.A., Scott, A.F., Hamosh, A.: Omim.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res. 47(D1), D1038–D1043 (2019)
Caron, B., Luo, Y., Rausell, A.: NCBoost classifies pathogenic non-coding variants in mendelian diseases through supervised learning on purifying selection signals in humans. Genome Biol. 20(1), 32 (2019)
Edwards, S.L., Beesley, J., French, J.D., Dunning, A.M.: Beyond GWASs: illuminating the dark road from association to function. Am. J. Hum. Genet. 93(5), 779–97 (2013)
Ghandi, M., Lee, D., Mohammad-Noori, M., Beer, M.A.: Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 10(7), e1003711 (2014)
Grama, A., Karypis, G., Kumar, V., Gupta, A.: Introduction to Parallel Computing, 2nd edn. Addison Wesley, Boston (2003)
Kircher, M., Witten, D.M., Jain, P., O’Roak, B.J., Cooper, G.M., Shendure, J.: A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46(3), 310–315 (2014)
Kumar, P., Henikoff, S., Ng, P.: Predicting the effects of coding non-synonymous variants on protein function using the sift algorithm. Nat. Protoc. 4(7), 1073–81 (2009)
Leung, M.K.K., Delong, A., Alipanahi, B., Frey, B.J.: Machine learning in genomic medicine: a review of computational problems and data sets. Proc. IEEE 104, 176–197 (2016)
Petrini, A., et al.: parSMURF, a high-performance computing tool for the genome-wide detection of pathogenic variants. GigaScience 9(5), giaa052 (2020)
Petrini, A., et al.: Parameters tuning boosts hyperSMURF predictions of rare deleterious non-coding genetic variants. In: NETTAB 2017, Methods, Tools and Platforms for Personalized Medicine in the Big Data Era, Palermo, Italy, October 2017
Quang, D., Xie, X., Chen, Y.: DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31(5), 761–763 (2014)
Rentzsch, P., Witten, D., Cooper, G., Shendure, J., Kircher, M.: CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47(D1), D886–D894 (2019)
Ritchie, G.R.S., Dunham, I., Zeggini, E., Flicek, P.: Functional annotation of noncoding sequence variants. Nat. Methods 11(3), 294–296 (2014)
Schubach, M., Re, M., Robinson, P.N., Valentini, G.: Imbalance-aware machine learning for predicting rare and common disease-associated non-coding variants. Sci. Rep. 7(1), 2959 (2017)
Shihab, H.A., et al.: An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics 31(10), 1536–1543 (2015)
Smedley, D., et al.: A whole-genome analysis framework for effective identification of pathogenic regulatory variants in mendelian disease. Am. J. Hum. Genet. 99(3), 595–606 (2016)
Turnbull, C., et al.: The 100 000 genomes project: bringing whole genome sequencing to the NHS. BMJ 361 (2018)
Zhou, J., Troyanskaya, O.G.: Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12(10), 931–934 (2015)
Acknowledgements
This study has been developed and performed in the context of the project “ParBigMen: ParSMURF application to Big genomic and epigenomic data for the detection of pathogenic variants in Mendelian diseases”. This project had been awarded by the Partnership for Advanced Computing in Europe in its 21st Call for Proposal. We acknowledge PRACE for awarding us access to SuperMUC-NG at LRZ, Germany.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
Petrini, A. et al. (2022). ParSMURF-NG: A Machine Learning High Performance Computing System for the Analysis of Imbalanced Big Omics Data. In: Maglogiannis, I., Iliadis, L., Macintyre, J., Cortez, P. (eds) Artificial Intelligence Applications and Innovations. AIAI 2022 IFIP WG 12.5 International Workshops. AIAI 2022. IFIP Advances in Information and Communication Technology, vol 652. Springer, Cham. https://doi.org/10.1007/978-3-031-08341-9_34
Download citation
DOI: https://doi.org/10.1007/978-3-031-08341-9_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08340-2
Online ISBN: 978-3-031-08341-9
eBook Packages: Computer ScienceComputer Science (R0)