Abstract
Genomics studies have increasingly had to deal with datasets containing high variation between the sequenced nucleotide chains. This is most common in metagenomics studies and polyploid studies, where the biological nature of studied samples requires analysis of multiple variants of nearly identical sequences. The high variation makes it more difficult to determine the correct nucleotide sequences, as well as to distinguish signal from noise, producing digital results with higher error rates than the ones that can be achieved in samples with low variation. This paper presents an original pure machine learning-based approach for detecting and potentially correcting those errors. It uses a generic machine learning-based model that can be applied to different types of sequencing data with minor modifications. As presented in a separate part of this work, these models can be combined with data-specific error candidate selection to apply the models on, for a refined error discovery, but as shown here, can also be used independently.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The study used a probabilistic error simulation based on estimated error rates.
- 2.
Since SHREC has no parameters to tune, our models were tuned to be close to – or above – SHREC’s detection rates to compare, so each experiment used different tunings and the figures aren’t directly comparable.
References
Allen-Vercoe, E., Petrof, E.O.: The microbiome: what it means for medicine. Br. J. Gen. Pract. 64(620), 118–119 (2014)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Brenchley, R., et al.: Analysis of the bread wheat genome using whole-genome shotgun sequencing. Nature 491(7426), 705–710 (2012)
Gilles, A., Meglécz, E., Pech, N., Ferreira, S., Malausa, T., Martin, J.F.: Accuracy and quality assessment of 454 gs-flx titanium pyrosequencing. BMC Genomics 12, 245 (2011)
Huse, S., Huber, J., Morrison, H., Sogin, M., Welch, D.: Accuracy and quality of massively parallel dna pyrose- quencing. Genome Biol. 8(7), R143 (2007)
Karlsson, O.E., Hansen, T., Knutsson, R., Löfström, C., Granberg, F., Berg, M.: Metagenomic detection methods in biopreparedness outbreak scenarios. Biosecurity Bioterrorism Biodefense Strategy Pract. Sci. 11(S1), S146–S157 (2013)
Katoh, K., Kuma, K., Toh, H., Miyata, T.: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleid Acid Res. 33(2), 511–518 (2005)
Kau, A.L., et al.: Human nutrition, the gut microbiome, and immune system: envisioning the future. Nature 474(7351), 327–336 (2011)
Kirov, K., Krachunov, M., Kulev, O., Nisheva, M., Vassilev, D.: Reducing false negatives for errors in snp detection using a machine learning approach. Comptes rendus de l’Académie bulgare des Sciences 69(2), 155–160 (2016)
Krachunov, M., Nisheva, M., Vassilev, D.: Machine learning models in error and variant detection high-variation high-throughput sequencing datasets. Procedia Comput. Sci. 108C, 1145–1154 (2017)
Krachunov, M., Vassilev, D.: An approach to a metagenomic data processing workflow. J. Comput. Sci. 5, 357–362 (2014)
Kristensen, D., Mushegian, A., Dolja, V., Koonin, E.: New dimensions of the virus world discovered through metagenomics. Trends Microbiol. 18(1), 11–19 (2010)
Kunin, V., Engelbrektson, A., Ochman, H., Hugenholtz, P.: Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ. Microbiol. 12(1), 118–123 (2010)
Laver, T., et al.: Assessing the performance of the Oxford Nanopore Technologies MinION. Biomol. Detect. Quantification 3, 1–8 (2015)
Li, R.W. (ed.): Metagenomics and its Applications in Agriculture, Biomedicine and Environmental Studies. Nova Science Pub Inc. (2010)
Li, W., Godzik, A.: Cd-Hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)
Marcussen, T., et al.: Ancient hybridizations among the ancestral genomes of bread wheat. Science 345(6194), 286–291 (2014)
Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for Next-Generation Sequencing data. Genomics 95(6), 315–327 (2010)
Nelson, K., White, B.: Metagenomics and its applications to the study of the human microbiome. In: Metagenomics: Theory, Methods and Applications, pp. 171–182 (2010)
Qi, Y.: Random forest for bioinformatics. In: Zhang, C., Ma, Y. (eds.) Ensemble Machine Learning, pp. 307–323. Springer, Boston (2012). https://doi.org/10.1007/978-1-4419-9326-7_11
Rojas, R.: Neural Networks: A Systematic Introduction. Springer, Heidelberg (1996). https://doi.org/10.1007/978-3-642-61068-4
Saei, A.A., Barzegari, A.: The microbiome: the forgotten organ of the astronaut’s body–probiotics beyond terrestrial limits. Future Microbiol. 7(9), 1037–1046 (2012)
Schröder, J., Schröder, H., Puglisi, S.J., Sinha, R., Schmidt, B.: SHREC: a short-read error correction method. Bioinformatics 25(17), 2157–2163 (2009)
United Nations, Food and Agriculture Organization, S.D.F. Crops/World total/Wheat/Area harvested (2014). https://web.archive.org/web/20150906230329/, http://faostat.fao.org/site/567/DesktopDefault.aspx?PageID=567. Accessed 25 June 2018
Valverde, J., Mellado, R.: Analysis of metagenomic data containing high biodiversity levels. PLoS ONE 8(3) (2013). Article no. e58118
Witten, I.H., Frank, E., Hal, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann Publishers, San Francisco (2011)
Acknowledgements
The presented work has been funded by the Bulgarian NSF within the “GloBIG: A Model of Integration of Cloud Framework for Hybrid Massive Parallelism and its Application for Analysis and Automated Semantic Enhancement of Big Heterogeneous Data Collections” project, Contract DN02/9 of 17.12.2016, and by the Sofia University SRF within the “Models for semantic integration of biomedical data” project, Contract 80-10-207 of 26.04.2018.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Krachunov, M., Nisheva, M., Vassilev, D. (2018). Machine Learning-Driven Noise Separation in High Variation Genomics Sequencing Datasets. In: Agre, G., van Genabith, J., Declerck, T. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2018. Lecture Notes in Computer Science(), vol 11089. Springer, Cham. https://doi.org/10.1007/978-3-319-99344-7_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-99344-7_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99343-0
Online ISBN: 978-3-319-99344-7
eBook Packages: Computer ScienceComputer Science (R0)