Accuracy Assessment of Consensus Sequence from Shotgun Sequencing

Li, Lei M.

doi:10.1007/978-3-642-16345-6_1

Lei M. Li^4,5

Part of the book series: Springer Handbooks of Computational Statistics ((SHCS))

4129 Accesses

Abstract

The significance of any genetic or biological implication based on DNA sequencing depends on its accuracy. The statistical evaluation of accuracy requires a probabilistic model of measurement error. In this chapter, we describe two statistical models of sequence assembly from shotgun sequencing respectively for the cases of haploid and diploid target genome. The first model allows us to convert quality scores into probabilities. It combines quality scores of base-calling and the power of alignment to improve sequencing accuracy. Specifically, we start with assembled contigs and represent probabilistic errors by logistic models that takes quality scores and other genomic features as covariates. Since the true sequence is unknown, an EM algorithm is used to deal with missing data. The second model describes the case in which DNA reads are from one of diploid genome, and our aim is to reconstruct the two haplotypes including phase information. The statistical model consists of sequencing errors, compositional information and haplotype memberships of each DNA fragment. Consequently, optimal haplotype sequences can be inferred by maximizing the probability among all configurations conditional on the given assembly. In the meantime, this probability together with the coverage information provides an assessment of the confidence for the reconstruction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Adams, M. D., Fields, C., & Ventor, J. C. (Eds.). (1994). Automated DNA sequencing and analysis. London, San Diego: Academic.
Google Scholar
An, H., & Gu, L. (1985). On the selection of regression variables. Acta Mathematicae Applicatae Sinica, 2, 27–36.
Article MATH Google Scholar
Churchill, G. A., & Waterman, M. S. (1992). The accuracy of DNA sequences: Estimating sequence quality. Genomics, 14, 89–98.
Article Google Scholar
Dehal, P., et al. (2002). The draft genome of ciona intestinalis: Insights into chordate and vertebrate origins. Science, 298, 2157–2167.
Article Google Scholar
Ewing, B., & Green, P. (1998). Base-calling of automated sequencer traces using phred. 2. error probabilities. Genome Research, 8, 186–194.
Google Scholar
Ewing, B., et al. (1998). Base-calling of automated sequencer traces using phred. 1. accuracy assessment. Genome Research, 8, 175–185.
Google Scholar
Felsenfeld, A., Peterson, J., Schloss, J., & Guyer, M. (1999). Assessing the quality of the DNA sequence from the human genome project. Genome Research, 9, 1–4.
Google Scholar
Kim, J. H., Waterman, M. S., & Li, L. M. (2006). Accuracy assessment of diploid consensus sequences. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4, 88–97.
Article Google Scholar
Kim, J. H., Waterman, M. S., & Li, L. M. (2007). Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi. Genome Research, 17, 1101–1110.
Article Google Scholar
Lancia, G., Bafna, V., Istrail, S., Lippert, R., & Schwartz, R. (2001). SNPs problems, complexity, and algorithms. In European symposium on algorithms (pp. 182–193). Lecture Notes in Computer Science. Springer-Verlag GmbH.
Google Scholar
Lander, E. S., & Waterman, M. S. (1988). Genomic mapping by fingerprinting random clones. Genomics, 2, 231–239.
Article Google Scholar
Levy, S., et al. (2007). The diploid genome sequence of an individual human. PLoS Biology, 5, e254. dOi:10.1371/journal.pbio.0050254.
Article Google Scholar
Li, L. M. (2002). DNA sequencing and parametric deconvolution. Statistica Sinica, 12, 179–202.
MathSciNet MATH Google Scholar
Li, L. M., Kim, J. H., & Waterman, M. S. (2004). Haplotype reconstruction from SNP alignment. Journal of Computational Biology, 11, 505–516.
Article Google Scholar
Li, L. M., & Speed, T. P. (1999). An estimate of the color separation matrix in four-dye fluorescence-based DNA sequencing. Electrophoresis, 20, 1433–1442.
Article Google Scholar
Li, L. M., & Speed, T. P. (2002). Parametric deconvolution of positive spike trains. Annals of Statistics, 28, 1279–1301.
MathSciNet Google Scholar
Lippert, R., Schwartz, R., Lancia, G., & Istrail, S. (2002). Algorithmic strategies for the SNP haplotype assembly problem. Briefings in Bioinformatics, 3, 1–9.
Article Google Scholar
McCullagh, P., & Nelder, J. A. (1989). Generalized linear model (2nd ed.). London: Chapman and Hall.
Google Scholar
Nelson, D. O., & Fridlyand, J. (2003). Designing meaningful measures of real length for data produced by DNA sequencers. In Science and statistics: A festschrift for Terry Speed (pp. 295–306). Lecture Notes-Monograph Series. Institute of Mathematical Statistics.
Google Scholar
Parkhill, J., et al. (2000). The genome sequence of the food-borne pathogen campylobacter jejuni reveals hypervariable sequences. Nature, 403, 665–668.
Article Google Scholar
Ross, S. M. (1989). Introduction to probability models (4th ed.). Academic.
Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Article MathSciNet MATH Google Scholar
Venables, W. N., & Ripley, B. D. (1994). Modern applied statistics with S-plus. Springer.
Google Scholar
Winer, R., Yen, G., & Huang, J. (2002). Call scores and quality values: Two measures of quality produced by the CEQ ^{{ $Ⓡ$}} genetic analysis systems. Beckman Coulter, Inc.
Google Scholar

Download references

Acknowledgements

We thank Prof. Michael Waterman for initiating the works reported in this chapter. This work is supported by the NIH CEGS grant to University of Southern California and the NIH grant R01 GM75308;.

Author information

Authors and Affiliations

Institute of Systems Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, P.R. China
Lei M. Li
Computational Biology and Mathematics, University of Southern California, Los Angeles, CA, 90089, USA
Lei M. Li

Authors

Lei M. Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei M. Li .

Editor information

Editors and Affiliations

, Institute of Statistics, National Chiao Tung University, Ta Hsueh Road 1001, Hsinchu, 30050, Taiwan, Taiwan R.O.C.
Henry Horng-Shing Lu
, Department of Empirical Inference, MPI for Intelligent Systems, Spemannstraße 38, Tübingen, 72076, Germany
Bernhard Schölkopf
School of Medicine, Dept. Epidemiology & Public Health, Yale University, College Street 60, New Haven, 06520, Connecticut, USA
Hongyu Zhao

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Li, L.M. (2011). Accuracy Assessment of Consensus Sequence from Shotgun Sequencing. In: Lu, HS., Schölkopf, B., Zhao, H. (eds) Handbook of Statistical Bioinformatics. Springer Handbooks of Computational Statistics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16345-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-16345-6_1
Published: 09 April 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16344-9
Online ISBN: 978-3-642-16345-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics