Abstract
Accurate identification of a DNA sequence depends on the ability to precisely track the time varying signal baseline in all parts of the electrophoretic trace. We propose a statistical learning formulation of the signal background estimation problem that can be solved using an Expectation-Maximization type algorithm. We also present an alternative method for estimating the background level of a signal in small size windows based on a recursive histogram computation. Both background estimation algorithms introduced here can be combined with regression methods in order to track slow and fast baseline changes occurring in different regions of a DNA chromatogram. Accurate baseline tracking improves cluster separation and thus contributes to the reduction in classification errors when the Bayesian EM (BEM) base-calling system, developed in our group (Pereira et al., Discrete Applied Mathematics, 2000), is employed to decide how many bases are “hidden” in every base-call event pattern extracted from the chromatogram.
References
L. Alphey, DNA Sequencing: From Experimental Methods to Bioinformatics, Springer-Verlag, 1997.
T.A. Brown, DNA Sequencing: The Basics, Oxford University Press, 1994.
D. Micklos and G. Freyer, Primer on Molecular Genetics, U.S. Dept. of Energy, 1992.
J. Forrester, "Interpreting DNA Sequencing Results," http://biotech.missouri.edu/dnacore/.
Perkin-Elmer, ABI PRISM, DNA Sequencing Analysis Software, User's Manual, Applied Biosystems, Foster City, CA, 1996.
M. Pereira, L. Andrade, S. El-Difrawy, B. Karger, and E. Manolakos, "Statistical Learning Formulation of theDNABase-Calling Problem and its Solution Using a Bayesian EM Framework," Discrete Applied Mathematics, vol. 104, no. 1-3, 2000, pp. 229-258.
B. Ewing, L. Hillier, M. Wendl, and P. Green, "Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment," Genome Research, vol. 8, 1998, pp. 175-185.
L. Andrade and E. Manolakos, "Skyline Normalization of DNA Chromatograms by Regression," inWorkshop on Genomic Signal Processing and Statistics (GENSIPS), 2002, pp. CP2-07:1-4.
H. Fujii and K. Kashiwagi, "Compensation for Mobility Inequalities between Lanes Computed from Band Signals in On-line FluorescenceDNASequencing," Electrophoresis, vol. 13, 1992, pp. 500-505.
S. El-Difrawy and E. Manolakos. "An Analytical Solution to the Mobility Shifts Correction Problem for DNA chromatograms," in Workshop on Genomic Signal Processing and Statistics (GENSIPS), 2002, pp. CP2-05:1-4.
C.G. Molina and J. Mullikin, "AProbabilistic Approach for Long Read-Length DNA Sequence Analysis," in IEEE Workshop on Neural Networks for Signal Processing (NNSP), Sept. 2002, pp. 45-56.
L. Andrade and E. Manolakos, "Accurate Estimation of the Signal Baseline in DNA Chromatograms," in IEEE Workshop on Neural Networks for Signal Processing (NNSP), Sept. 2002, pp. 35-44.
J. Golden, D. Torgersen, and C. Tibbetts, "Pattern Recognition for AutomatedDNASequencing: I. On-line Signal Conditioning and Feature Extraction for Base-Calling," in First International Conference on Intelligent Systems for Molecular Biology, AAAI Press, 1993, pp. 136-144.
Z. Yin, J. Severin, M.C. Giddings, W. Huang, M.S. Westphall, and L.M. Smith, "Automatic Matrix Determination in Four Dye Fluorescence-Based DNA Sequencing," Electrophoresis, vol. 17, 1996, pp. 1143-1150.
M.C. Giddings, J. Severin, M. Westphall, J. Wu, and L.M. Smith, "Asoftware system for data analysis in automated DNA sequencing,"Genome Research, vol. 8, 1998, pp. 644-665.
D. Brady, M. Kocic, A. Miller, and B. Karger, "Maximum Likelihood Base-Calling for DNA Sequencing," IEEE Trans. Signal Background Estimation 243 on Biomedical Engineering, vol. 47, no. 9, 2000, pp. 1271-1280.
A. Berno, "A Graph Theoretic Approach to the Analysis of DNA Sequencing Data," Genome Research, vol. 6, no. 2, 1996, pp. 80-91.
T.K. Moon, "The Expectation-Maximization Algorithm," IEEE Signal Processing Magazine, vol. 13, no. 6, 1996, pp. 47-60.
D. Walther, G. Bartha, and M. Morris, "Base-Calling with LifeTrace," Genome Research, vol. 11, 2001, pp. 875-888.
T.D. Yager, L. Baron, R. Batra, A. Bouevitch, D. Chan, K. Chan, S. Darasch, R. Gilchrist, A. Izmailov, J.M. Lacroix, K. Marchelleta, J. Renfrew, D. Rushlow, E. Steinbach, C. Ton, P. Waterhouse, H. Zaleski, J.M. Dunn, and J. Stevens, "High performance DNA Sequencing, and the detection of Mutations and Polymorphisms, on the Clipper Sequencer," Electrophoresis, vol. 20, 1999, pp. 1280-1300.
S.B. Needleman and C.D. Wunsch, "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of two Proteins," Journal of Molecular Biology, vol. 48, 1970, pp. 443-453.
T.F. Smith and M.S. Waterman, "Identification of Common Molecular Subsequences," Journal of Molecular Biology, vol. 147, 1981, pp. 195-197.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Andrade, L., Manolakos, E.S. Signal Background Estimation and Baseline Correction Algorithms for Accurate DNA Sequencing. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 35, 229–243 (2003). https://doi.org/10.1023/B:VLSI.0000003022.86639.1f
Published:
Issue Date:
DOI: https://doi.org/10.1023/B:VLSI.0000003022.86639.1f