Abstract
The significance of any genetic or biological implication based on DNA sequencing depends on its accuracy. The statistical evaluation of accuracy requires a probabilistic model of measurement error. In this chapter, we describe two statistical models of sequence assembly from shotgun sequencing respectively for the cases of haploid and diploid target genome. The first model allows us to convert quality scores into probabilities. It combines quality scores of base-calling and the power of alignment to improve sequencing accuracy. Specifically, we start with assembled contigs and represent probabilistic errors by logistic models that takes quality scores and other genomic features as covariates. Since the true sequence is unknown, an EM algorithm is used to deal with missing data. The second model describes the case in which DNA reads are from one of diploid genome, and our aim is to reconstruct the two haplotypes including phase information. The statistical model consists of sequencing errors, compositional information and haplotype memberships of each DNA fragment. Consequently, optimal haplotype sequences can be inferred by maximizing the probability among all configurations conditional on the given assembly. In the meantime, this probability together with the coverage information provides an assessment of the confidence for the reconstruction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adams, M. D., Fields, C., & Ventor, J. C. (Eds.). (1994). Automated DNA sequencing and analysis. London, San Diego: Academic.
An, H., & Gu, L. (1985). On the selection of regression variables. Acta Mathematicae Applicatae Sinica, 2, 27–36.
Churchill, G. A., & Waterman, M. S. (1992). The accuracy of DNA sequences: Estimating sequence quality. Genomics, 14, 89–98.
Dehal, P., et al. (2002). The draft genome of ciona intestinalis: Insights into chordate and vertebrate origins. Science, 298, 2157–2167.
Ewing, B., & Green, P. (1998). Base-calling of automated sequencer traces using phred. 2. error probabilities. Genome Research, 8, 186–194.
Ewing, B., et al. (1998). Base-calling of automated sequencer traces using phred. 1. accuracy assessment. Genome Research, 8, 175–185.
Felsenfeld, A., Peterson, J., Schloss, J., & Guyer, M. (1999). Assessing the quality of the DNA sequence from the human genome project. Genome Research, 9, 1–4.
Kim, J. H., Waterman, M. S., & Li, L. M. (2006). Accuracy assessment of diploid consensus sequences. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4, 88–97.
Kim, J. H., Waterman, M. S., & Li, L. M. (2007). Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi. Genome Research, 17, 1101–1110.
Lancia, G., Bafna, V., Istrail, S., Lippert, R., & Schwartz, R. (2001). SNPs problems, complexity, and algorithms. In European symposium on algorithms (pp. 182–193). Lecture Notes in Computer Science. Springer-Verlag GmbH.
Lander, E. S., & Waterman, M. S. (1988). Genomic mapping by fingerprinting random clones. Genomics, 2, 231–239.
Levy, S., et al. (2007). The diploid genome sequence of an individual human. PLoS Biology, 5, e254. dOi:10.1371/journal.pbio.0050254.
Li, L. M. (2002). DNA sequencing and parametric deconvolution. Statistica Sinica, 12, 179–202.
Li, L. M., Kim, J. H., & Waterman, M. S. (2004). Haplotype reconstruction from SNP alignment. Journal of Computational Biology, 11, 505–516.
Li, L. M., & Speed, T. P. (1999). An estimate of the color separation matrix in four-dye fluorescence-based DNA sequencing. Electrophoresis, 20, 1433–1442.
Li, L. M., & Speed, T. P. (2002). Parametric deconvolution of positive spike trains. Annals of Statistics, 28, 1279–1301.
Lippert, R., Schwartz, R., Lancia, G., & Istrail, S. (2002). Algorithmic strategies for the SNP haplotype assembly problem. Briefings in Bioinformatics, 3, 1–9.
McCullagh, P., & Nelder, J. A. (1989). Generalized linear model (2nd ed.). London: Chapman and Hall.
Nelson, D. O., & Fridlyand, J. (2003). Designing meaningful measures of real length for data produced by DNA sequencers. In Science and statistics: A festschrift for Terry Speed (pp. 295–306). Lecture Notes-Monograph Series. Institute of Mathematical Statistics.
Parkhill, J., et al. (2000). The genome sequence of the food-borne pathogen campylobacter jejuni reveals hypervariable sequences. Nature, 403, 665–668.
Ross, S. M. (1989). Introduction to probability models (4th ed.). Academic.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Venables, W. N., & Ripley, B. D. (1994). Modern applied statistics with S-plus. Springer.
Winer, R., Yen, G., & Huang, J. (2002). Call scores and quality values: Two measures of quality produced by the CEQ { $Ⓡ$} genetic analysis systems. Beckman Coulter, Inc.
Acknowledgements
We thank Prof. Michael Waterman for initiating the works reported in this chapter. This work is supported by the NIH CEGS grant to University of Southern California and the NIH grant R01 GM75308;.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Li, L.M. (2011). Accuracy Assessment of Consensus Sequence from Shotgun Sequencing. In: Lu, HS., Schölkopf, B., Zhao, H. (eds) Handbook of Statistical Bioinformatics. Springer Handbooks of Computational Statistics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16345-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-16345-6_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16344-9
Online ISBN: 978-3-642-16345-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)