Abstract
Given a long string of characters from a constant size (w.l.o.g. binary) alphabet we present an algorithm to determine whether its characters have been generated by a single i.i.d. random source. More specifically, consider all possible k-coin models for generating a binary string S, where each bit of S is generated via an independent toss of one of the k coins in the model. The choice of which coin to toss is decided by a random walk on the set of coins where the probability of a coin change is much lower than the probability of using the same coin repeatedly. We present a statistical test procedure which, for any given S, determines whether the a posteriori probability for k = 1 is higher than for any other k > 1. Our algorithm runs in time O(l 4 log l), where l is the length of S, through a dynamic programming approach which exploits the convexity of the a posteriori probability for k.
The problem we consider arises from two critical applications in analyzing long alignments between pairs of genomic sequences. A high alignment score between two DNA sequences usually indicates an evolutionary relationship, i.e. that the sequences have been generated as a result of one or more copy events followed by random point mutations. Such sequences may include functional regions (e.g. exons) as well as nonfunctional ones (e.g. introns). Functional regions with critical importance exhibit much lower mutation rates than non-functional DNA (or DNA
Supported in part by an NSF Career Award and by Charles B. Wang Foundation.
Partially supported by the IST Programme of the EU under contract number IST-1999-14186 (ALCOM-FT).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
E. F. Adebiyi, T. Jiang, M. Kaufmann, An Efficient Algorithm for Finding Short Approximate Non-Tandem Repeats, In Proceedings of ISMB 2001.
A. N. Arslan, O. Egecioglu, P. A. Pevzner A new approach to sequence comparison: normalized sequence alignment, Proceedings of RECOMB 2001.
Bailey J. A., Yavor A. M., Massa H. F., Trask B. J., Eichler E. E., Segmental duplications: organization and impact within the current human genome project assembly, Genome Research 11(6), Jun 2001.
T. Bailey, C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proceedings of ISMB 1994, AAAI Press.
J. Buhler and M. Tompa Finding Motifs Using Random Projections, In Proc. of RECOMB 2001.
J. Buhler Efficient Large Scale Sequence Comparison by Locality Sensitive Hashing, Bioinformatics17(5), 2001.
Richard Cole and Ramesh Hariharan, Approximate String Matching: A Simpler Faster Algorithm, Proc. ACM-SIAM Symposium on Discrete Algorithms, pp. 463ā472, 25ā27 January 1998.
Churchill, G. A. Stochastic models for heterogeneous DNA sequences, Bulletin of Mathemathical Biology 51, 79ā94 (1989).
W. Chang and E. Lawler, Approximate String Matching in Sublinear Expected Time, Proc. IEEE Symposium on Foundations of Computer Science, 1990.
Fu, Y.-X and R. N. Curnow. Maximum likelihood estimation of multiple change points, Biometrika 77, 563ā573 (1990).
Green, P. J. Reversible Jump Markov chain Monte Carlo Computation and Bayesian Model Determination Biometrika 82, 711ā732 (1995)
A. L. Halpern Minimally Selected p and Other Tests for a Single Abrupt Change-point in a Binary Sequence Biometrics 55, Dec 1999.
A. L. Halpern Multiple Changepoint Testing for an Alternating Segments Model of a Binary Sequence Biometrics 56, Sep 2000.
J. E. Horvath, L. Viggiano, B. J. Loftus, M. D. Adams, N. Archidiacono, M. Rocchi, E. E. Eichler Molecular structure and evolution of an alpha satellite/non-satellite junction at 16p11. Human Molecular Genetics, 2000, Vol 9, No 1.
Jackson, Strachan, Dover, Human Genome Evolution, Bios Scientific Publishers, 1996.
E. S. Lander et al., Initial sequencing and analysis of the human genome, Nature, 15:409, Feb 2001.
V. I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Cybernetics and Control Theory, 10(8):707ā710, 1966.
T. Mashkova, N. Oparina, I. Alexandrov, O. Zinovieva, A. Marusina, Y. Yurov, M. Lacroix, L. Kisselev, Unequal crossover is involved in human alpha satellite DNA rearrangements on a border of the satellite domain, FEBS Letters, 441 (1998).
A. Marzal and E. Vidal, Computation of normalized edit distances and applications, IEEE Trans. on PAMI, 15(9):926ā932, 1993.
L. Parida, I. Rigoutsos, A. Floratsas, D. Platt, Y. Gao, Pattern discovery on character sets and real valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm, Proceedings of ACM-SIAM SODA, 2000.
S. C. Sahinalp and U. Vishkin, Approximate and Dynamic Matching of Patterns Using a Labeling Paradigm, Proc. IEEE Symposium on Foundations of Computer Science, 1996.
George P. Smith Evolution of Repeated DNA Sequences by Unequal Crossover, Science, vol 191, pp 528ā535.
J. D. Thompson, D. G. Higgins, T. J. Gibson, Clustal-W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice, Nucleic Acid Research 1994, Vol. 22, No. 22.
E. Ukkonen, On Approximate String Matching, Proc. Conference on Foundations of Computation Theory, 1983.
Venter, J. and Steel, S. Finding multiple abrupt change points. Computational Statistics and Data Analysis 22, 481ā501. (1996).
C. Venter et. al., The sequence of the human genome, Science, 16:291, Feb 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
į¹¢ahinalp, S.C., Eichler, E., Goldberg, P., Berenbrink, P., Friedetzky, T., Ergun, F. (2002). Statistical Identification of Uniformly Mutated Segments within Repeats. In: Apostolico, A., Takeda, M. (eds) Combinatorial Pattern Matching. CPM 2002. Lecture Notes in Computer Science, vol 2373. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45452-7_21
Download citation
DOI: https://doi.org/10.1007/3-540-45452-7_21
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43862-5
Online ISBN: 978-3-540-45452-6
eBook Packages: Springer Book Archive