Probabilistic Models for Error Correction of Nonuniform Sequencing Data

Schulz, Marcel H.; Bar-Joseph, Ziv

doi:10.1007/978-3-319-59826-0_6

Marcel H. Schulz^2,3 &
Ziv Bar-Joseph⁴

1829 Accesses

Abstract

Sequencing error correction has become an important step in the analyses of next-generation sequencing (NGS) datasets in order to improve data quality for downstream applications. In this chapter, we discuss different formulations for sequencing read error corrections that are based on probabilistic models able to handle datasets with a nonuniform read coverage. Nonuniform coverage is common in several applications of NGS, including small RNA and messenger RNA sequencing, metagenomics, metatranscriptomics, and single-cell sequencing. Here, we review popular formulations based on the Hamming graph of k-mers found in sequencing reads and introduce a more complete formulation that can also handle insertion and deletion errors. as found in As the breadth of applications is steadily increasing to In this chapter, we will introduce different approaches to correct sequencing errors with probabilistic models. One common formulation is based on models over Hamming graphs. A particular focus will be on a more general formulation using hidden Markov models that can solve indel errors. These methods are suitable for the correction of reads from experiments with nonuniform coverage, like RNA-Seq, single-cell sequencing, or metagenomics, a topic of rising importance in the community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A.V., Sirotkin, A.V., Vyahhi, N., Tesler, G., Alekseyev, M.A., Pevzner, P.A.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)
Article MathSciNet Google Scholar
Bullard, J.H., Purdom, E., Hansen, K.D., Dudoit, S.: Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinform. 11, 94 (2010)
Article Google Scholar
Embree, M., Nagarajan, H., Movahedi, N., Chitsaz, H., Zengler, K.: Single-cell genome and metatranscriptome sequencing reveal metabolic interactions of an alkane-degrading methanogenic community. ISME J. 8(4), 757–767 (2014)
Article Google Scholar
Glenn, T.C.: Field guide to next-generation DNA sequencers. Mol. Ecol. Resour. 11(5), 759–769 (2011)
Article Google Scholar
Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., di Palma, F., Birren, B.W., Nusbaum, C., Lindblad-Toh, K., Friedman, N., Regev, A.: Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29(7), 644–652 (2011)
Article Google Scholar
Hemme, C.L., Deng, Y., Gentry, T.J., Fields, M.W., Wu, L., Barua, S., Barry, K., Tringe, S.G., Watson, D.B., He, Z., Hazen, T.C., Tiedje, J.M., Rubin, E.M., Zhou, J.: Metagenomic insights into evolution of a heavy metal-contaminated groundwater microbial community. ISME J. 4(5), 660–672 (2010)
Article Google Scholar
Hinman, V.F., Nguyen, A.T., Davidson, E.H.: Expression and function of a starfish Otx ortholog, AmOtx: a conserved role for Otx proteins in endoderm development that predates divergence of the eleutherozoa. Mech. Dev. 120(10), 1165–1176 (2003)
Google Scholar
Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), R116 (2010)
Article Google Scholar
Kent, W.J.: Blat—the blast-like alignment tool. Genome Res. 12(4), 656–664 (2002)
Article Google Scholar
Le, H.-S., Schulz, M.H., McCauley, B.M., Hinman, V.F., Bar-Joseph, Z.: Probabilistic error correction for RNA sequencing. Nucleic Acids Res. 41(10), e109 (2013)
Article Google Scholar
Le Chatelier, E., Nielsen, T., Qin, J., Prifti, E., Hildebrand, F., Falony, G., Almeida, M., Arumugam, M., Batto, J.-M., Kennedy, S., Leonard, P., Li, J., Burgdorf, K., Grarup, N., Jorgensen, T., Brandslund, I., Nielsen, H.B., Juncker, A.S., Bertalan, M., Levenez, F., Pons, N., Rasmussen, S., Sunagawa, S., Tap, J., Tims, S., Zoetendal, E.G., Brunak, S., Clement, K., Dore, J., Kleerebezem, M., Kristiansen, K., Renault, P., Sicheritz-Ponten, T., de Vos, W.M., Zucker, J.-D., Raes, J., Hansen, T., MetaHIT consortium, Bork, P., Wang, J., Ehrlich, S.D., Pedersen, O., MetaHIT consortium additional members: Richness of human gut microbiome correlates with metabolic markers. Nature 500(7464), 541–546 (2013)
Google Scholar
Mardis, E.R.: Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 9, 387–402 (2008)
Article Google Scholar
Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., Gilad, Y.: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18(9), 1509–1517 (2008)
Article Google Scholar
Medvedev, P., Scott, E., Kakaradov, B., Pevzner, P.: Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics (Oxford, England) 27(13), i137–i141 (2011)
Google Scholar
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5(7), 621–628 (2008)
Article Google Scholar
Nikolenko, S., Korobeynikov, A., Alekseyev, M.: Bayeshammer: Bayesian clustering for error correction in single-cell sequencing. BMC Genomics 14(Suppl. 1), S7 (2013)
Article Google Scholar
Peng, Z., Cheng, Y., Tan, B.C.-M., Kang, L., Tian, Z., Zhu, Y., Zhang, W., Liang, Y., Hu, X., Tan, X., Guo, J., Dong, Z., Liang, Y., Bao, L., Wang, J.: Comprehensive analysis of RNA-seq data reveals extensive RNA editing in a human transcriptome. Nat. Biotechnol. 30(3), 253–260 (2012)
Article Google Scholar
Qu, W., Hashimoto, S.-I., Morishita, S.: Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. Genome Res. 19(7), 1309–1315 (2009)
Article Google Scholar
Richard, H., Schulz, M.H., Sultan, M., Nürnberger, A., Schrinner, S., Balzereit, D., Dagand, E., Rasche, A., Lehrach, H., Vingron, M., Haas, S.A., Yaspo, M.-L.: Prediction of alternative isoforms from exon expression levels in RNA-seq experiments. Nucleic Acids Res. 38(10), e112 (2010)
Article Google Scholar
Saccone, S.F., Quan, J., Mehta, G., Bolze, R., Thomas, P., Deelman, E., Tischfield, J.A., Rice, J.P.: New tools and methods for direct programmatic access to the dbSNP relational database. Nucleic Acids Res. 39(Database issue), D901–D907 (2011)
Article Google Scholar
Schulz, M.H., Zerbino, D.R., Vingron, M., Birney, E.: Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics (Oxford, England) 28(8), 1086–1092 (2012)
Google Scholar
Schulz, M.H., Weese, D., Holtgrewe, M., Dimitrova, V., Niu, S., Reinert, K., Richard, H.: Fiona: a parallel and automatic strategy for read error correction. Bioinformatics 30(17), i356–i363 (2014)
Article Google Scholar
Sultan, M., Schulz, M.H., Richard, H., Magen, A., Klingenhoff, A., Scherf, M., Seifert, M., Borodina, T., Soldatov, A., Parkhomchuk, D., Schmidt, D., O’Keeffe, S., Haas, S., Vingron, M., Lehrach, H., Yaspo, M.-L.: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321(5891), 956–960 (2008)
Article Google Scholar
Treangen, T., Koren, S., Sommer, D., Liu, B., Astrovskaya, I., Ondov, B., Darling, A., Phillippy, A., Pop, M.: Metamos: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol. 14(1), R2 (2013)
Article Google Scholar
Wang, Z., Gerstein, M., Snyder, M.: RNA-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10(1), 57–63 (2009)
Article Google Scholar
Wijaya, E., Frith, M.C., Suzuki, Y., Horton, P.: Recount: expectation maximization based error correction tool for next generation sequencing data. Genome Inform. 23(1), 189–201 (2009). International Conference on Genome Informatics
Google Scholar
Yin, X., Song, Z., Dorman, K., Ramamoorthy, A.: PREMIER Turbo: probabilistic error-correction using Markov inference in errored reads using the turbo principle. In: 2013 IEEE Global Conference on Signal and Information Processing, December, pp. 73–76. IEEE, New York (2013)
Google Scholar
Zeller, G., Tap, J., Voigt, A.Y., Sunagawa, S., Kultima, J.R., Costea, P.I., Amiot, A., Böhm, J., Brunetti, F., Habermann, N., Hercog, R., Koch, M., Luciani, A., Mende, D.R., Schneider, M.A., Schrotz-King, P., Tournigand, C., Van Nhieu, J.T., Yamada, T., Zimmermann, J., Benes, V., Kloor, M., Ulrich, C.M., von Knebel Doeberitz, M., Sobhani, I., Bork, P.: Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10(11), 766 (2014)
Article Google Scholar

Download references

Acknowledgements

We would like to thank Dilip Ariyur Durai for his help with the Oases benchmark.

Author information

Authors and Affiliations

Excellence Cluster for Multimodal Computing and Interaction, Saarland University, Saarbrücken, Germany
Marcel H. Schulz
Computational Biology and Applied Algorithms, Max Planck Institute for Informatics, Saarbrücken, Germany
Marcel H. Schulz
Machine Learning Department and Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
Ziv Bar-Joseph

Authors

Marcel H. Schulz
View author publications
You can also search for this author in PubMed Google Scholar
Ziv Bar-Joseph
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcel H. Schulz .

Editor information

Editors and Affiliations

LaTICE, Tunis, Tunisia
Mourad Elloumi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Schulz, M.H., Bar-Joseph, Z. (2017). Probabilistic Models for Error Correction of Nonuniform Sequencing Data. In: Elloumi, M. (eds) Algorithms for Next-Generation Sequencing Data. Springer, Cham. https://doi.org/10.1007/978-3-319-59826-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-59826-0_6
Published: 19 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59824-6
Online ISBN: 978-3-319-59826-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics