Abstract
Given a text of length n and a pattern of length m over some (possibly unbounded) alphabet, we consider the problem of finding all positions in the text at which the pattern “almost occurs”. Here by “almost occurs” we mean that at least some fixed fraction ρ of the characters of the pattern (for example, ≥ 60% of them) are equal to their corresponding characters in the text. We design a randomized algorithm that has O(n log m) worst-case time complexity and computes with high probability all of the almost-occurrences of the pattern in the text. This algorithm assumes that the fraction ρ is given as part of its input, and it works well even for relatively small values of ρ. It makes no assumptions about the probabilistic characteristics of the input. Our second contribution deals with the issue of which values of ρ correspond to the intuitive notion of similarity between pattern and text, and this leads us to the development of a probabilistic analysis for the case where both input strings are random (in the usual, i.e., Bernoulli, model).
The first author's research was supported by the Office of Naval Research under Grants N0014-84-K-0502 and N0014-36-K-0689, and in part by AFOSR Grant 90-0107, and the NSF under Grant DCR-8451393, and in part by Grant R01 LM05118 from the National Library of Medicine. The second author was supported by NATO Collaborative Grant 0057/89. The third author's research was supported by AFOSR Grant 90-0107 and NATO Collaborative Grant 0057/89, and, in part by the NSF Grant CCR-8900305, and by Grant R01 LM05118 from the National Library of Medicine
Preview
Unable to display preview. Download preview PDF.
References
K. Abrahamson, Generalized String Matching, SIAM J. Comput., 16, 1039–1051, 1987.
Abramowitz, M. and Stegun, I., Handbook of Mathematical Functions, Dover, New York (1964).
A.V. Aho, J.E. Hopcroft and J.D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, Mass., 1974.
Aldous, D., Probability Approximations via the Poisson Clumping Heuristic, Springer Verlag, New York 1989.
Arratia, R., Gordon, L., and Waterman, M., An Extreme Value Theory for Sequence Matching, Annals of Statistics, 14, 971–993, 1986.
Arratia, R., Gordon, L., and Waterman, M., The Erdös-Rényi Law in Distribution, for Coin Tossing and Sequence Matching, Annals of Statistics, 18, 539–570, 1990.
Chang, W.I. and Lawler, E.L., Approximate String Matching in Sublinear Expected Time, Proc. 31st Ann. IEEE Symp. on Foundations of Comp. Sci., 116–124, 1990.
Chung, K.L. and Erdös, P., On the Application of the Borel-Cantelli Lemma, Trans. of the American Math. Soc., 72, 179–186, 1952.
DeLisi, C., The Human Genome Project, American Scientist, 76, 488–493, 1988.
Feller, W., An Introduction to Probability Theory and its Applications, Vol. II, John Wiley & Sons, New York (1971).
Flajolet, P., Analysis of Algorithms, in Trends in Theoretical Computer Science (ed. E. Börger), Computer Science Press, 1988.
Galambos, J., The Asymptotic Theory of Extreme Order Statistics, John Wiley & Sons, New York (1978).
Galil, Z. and Park, K., An Improved Algorithm for Approximate String Matching, SIAM J. Comp., 19, 989–999, 1990.
L. Guibas and A. Odlyzko, Periods in Strings Journal of Combinatorial Theory, Series A, 30, 19–43 (1981).
L. Guibas and A. W. Odlyzko, String Overlaps, Pattern Matching, and Nontransitive Games, Journal of Combinatorial Theory, Series A, 30, 183–208 (1981).
Henrici, P., Applied and Computational Complex Analysis, vol. I., John Wiley & Sons, New York 1974.
Jacquet, P. and Szpankowski, W., Autocorrelation on Words and Its Applications. Analysis of Suffix Trees by String-Ruler Approach, INRIA Technical report No. 1106, October 1989; submitted to a journal.
Karlin, S. and Ost, F., Counts of Long Aligned Matches Among Random Letter Sequences, Adv. Appl. Probab., 19, 293–351, 1987.
Knuth, D.E., J. Morris and V. Pratt, Fast Pattern Matching in Strings, SIAM J. Computing, 6, 323–350, 1977.
Landau, G.M. and Vishkin, U., Efficient String Matching with k Mismatches, Theor. Comp. Sci., 43, 239–249, 1986.
Landau, G.M. and Vishkin, U., Fast String Matching with k Differences, J. Comp. Sys. Sci., 37, 63–78, 1988.
Landau, G.M. and Vishkin, U., Fast Parallel and Serial Approximate String Matching, J. Algorithms, 10, 157–169, 1989.
E.W. Myers, An O(ND) Difference Algorithm and Its Variations, Algorithmica, 1, 252–266, 1986.
Noble, B. and Daniel, J., Applied Linear Algebra, Prentice-Hall, New Jersey 1988
Seneta, E., Non-Negative Matrices and Markov Chains, Springer-Verlag, New York 1981.
Szpankowski, W., On the Height of Digital Trees and Related Problems, Algorithmica, 6, 256–277, 1991.
M. Zuker, Computer Prediction of RNA Structure, Methods in Enzymology, 180, 262–288, 1989.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1992 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Atallah, M.J., Jacquet, P., Szpankowski, W. (1992). Pattern matching with mismatches: A probabilistic analysis and a randomized algorithm. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds) Combinatorial Pattern Matching. CPM 1992. Lecture Notes in Computer Science, vol 644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56024-6_3
Download citation
DOI: https://doi.org/10.1007/3-540-56024-6_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-56024-1
Online ISBN: 978-3-540-47357-2
eBook Packages: Springer Book Archive