Pattern matching with mismatches: A probabilistic analysis and a randomized algorithm

Atallah, Mikhail J.; Jacquet, Philippe; Szpankowski, Wojciech

doi:10.1007/3-540-56024-6_3

Mikhail J. Atallah¹,
Philippe Jacquet² &
Wojciech Szpankowski³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 644))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

143 Accesses
4 Citations

Abstract

Given a text of length n and a pattern of length m over some (possibly unbounded) alphabet, we consider the problem of finding all positions in the text at which the pattern “almost occurs”. Here by “almost occurs” we mean that at least some fixed fraction ρ of the characters of the pattern (for example, ≥ 60% of them) are equal to their corresponding characters in the text. We design a randomized algorithm that has O(n log m) worst-case time complexity and computes with high probability all of the almost-occurrences of the pattern in the text. This algorithm assumes that the fraction ρ is given as part of its input, and it works well even for relatively small values of ρ. It makes no assumptions about the probabilistic characteristics of the input. Our second contribution deals with the issue of which values of ρ correspond to the intuitive notion of similarity between pattern and text, and this leads us to the development of a probabilistic analysis for the case where both input strings are random (in the usual, i.e., Bernoulli, model).

The first author's research was supported by the Office of Naval Research under Grants N0014-84-K-0502 and N0014-36-K-0689, and in part by AFOSR Grant 90-0107, and the NSF under Grant DCR-8451393, and in part by Grant R01 LM05118 from the National Library of Medicine. The second author was supported by NATO Collaborative Grant 0057/89. The third author's research was supported by AFOSR Grant 90-0107 and NATO Collaborative Grant 0057/89, and, in part by the NSF Grant CCR-8900305, and by Grant R01 LM05118 from the National Library of Medicine

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

K. Abrahamson, Generalized String Matching, SIAM J. Comput., 16, 1039–1051, 1987.
Google Scholar
Abramowitz, M. and Stegun, I., Handbook of Mathematical Functions, Dover, New York (1964).
Google Scholar
A.V. Aho, J.E. Hopcroft and J.D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, Mass., 1974.
Google Scholar
Aldous, D., Probability Approximations via the Poisson Clumping Heuristic, Springer Verlag, New York 1989.
Google Scholar
Arratia, R., Gordon, L., and Waterman, M., An Extreme Value Theory for Sequence Matching, Annals of Statistics, 14, 971–993, 1986.
Google Scholar
Arratia, R., Gordon, L., and Waterman, M., The Erdös-Rényi Law in Distribution, for Coin Tossing and Sequence Matching, Annals of Statistics, 18, 539–570, 1990.
Google Scholar
Chang, W.I. and Lawler, E.L., Approximate String Matching in Sublinear Expected Time, Proc. 31st Ann. IEEE Symp. on Foundations of Comp. Sci., 116–124, 1990.
Google Scholar
Chung, K.L. and Erdös, P., On the Application of the Borel-Cantelli Lemma, Trans. of the American Math. Soc., 72, 179–186, 1952.
Google Scholar
DeLisi, C., The Human Genome Project, American Scientist, 76, 488–493, 1988.
Google Scholar
Feller, W., An Introduction to Probability Theory and its Applications, Vol. II, John Wiley & Sons, New York (1971).
Google Scholar
Flajolet, P., Analysis of Algorithms, in Trends in Theoretical Computer Science (ed. E. Börger), Computer Science Press, 1988.
Google Scholar
Galambos, J., The Asymptotic Theory of Extreme Order Statistics, John Wiley & Sons, New York (1978).
Google Scholar
Galil, Z. and Park, K., An Improved Algorithm for Approximate String Matching, SIAM J. Comp., 19, 989–999, 1990.
Google Scholar
L. Guibas and A. Odlyzko, Periods in Strings Journal of Combinatorial Theory, Series A, 30, 19–43 (1981).
Google Scholar
L. Guibas and A. W. Odlyzko, String Overlaps, Pattern Matching, and Nontransitive Games, Journal of Combinatorial Theory, Series A, 30, 183–208 (1981).
Google Scholar
Henrici, P., Applied and Computational Complex Analysis, vol. I., John Wiley & Sons, New York 1974.
Google Scholar
Jacquet, P. and Szpankowski, W., Autocorrelation on Words and Its Applications. Analysis of Suffix Trees by String-Ruler Approach, INRIA Technical report No. 1106, October 1989; submitted to a journal.
Google Scholar
Karlin, S. and Ost, F., Counts of Long Aligned Matches Among Random Letter Sequences, Adv. Appl. Probab., 19, 293–351, 1987.
Google Scholar
Knuth, D.E., J. Morris and V. Pratt, Fast Pattern Matching in Strings, SIAM J. Computing, 6, 323–350, 1977.
Google Scholar
Landau, G.M. and Vishkin, U., Efficient String Matching with k Mismatches, Theor. Comp. Sci., 43, 239–249, 1986.
Google Scholar
Landau, G.M. and Vishkin, U., Fast String Matching with k Differences, J. Comp. Sys. Sci., 37, 63–78, 1988.
Google Scholar
Landau, G.M. and Vishkin, U., Fast Parallel and Serial Approximate String Matching, J. Algorithms, 10, 157–169, 1989.
Google Scholar
E.W. Myers, An O(ND) Difference Algorithm and Its Variations, Algorithmica, 1, 252–266, 1986.
Google Scholar
Noble, B. and Daniel, J., Applied Linear Algebra, Prentice-Hall, New Jersey 1988
Google Scholar
Seneta, E., Non-Negative Matrices and Markov Chains, Springer-Verlag, New York 1981.
Google Scholar
Szpankowski, W., On the Height of Digital Trees and Related Problems, Algorithmica, 6, 256–277, 1991.
Google Scholar
M. Zuker, Computer Prediction of RNA Structure, Methods in Enzymology, 180, 262–288, 1989.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, Purdue University, 47907, W. Lafayette, IN, USA
Mikhail J. Atallah
INRIA, Rocquencourt, 78153, Le Chesnay Cedex, France
Philippe Jacquet
Dept. of Computer Science, Purdue University, 47907, W. Lafayette, IN, USA
Wojciech Szpankowski

Authors

Mikhail J. Atallah
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Jacquet
View author publications
You can also search for this author in PubMed Google Scholar
Wojciech Szpankowski
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alberto Apostolico Maxime Crochemore Zvi Galil Udi Manber

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Atallah, M.J., Jacquet, P., Szpankowski, W. (1992). Pattern matching with mismatches: A probabilistic analysis and a randomized algorithm. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds) Combinatorial Pattern Matching. CPM 1992. Lecture Notes in Computer Science, vol 644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56024-6_3

Download citation

DOI: https://doi.org/10.1007/3-540-56024-6_3
Published: 04 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-56024-1
Online ISBN: 978-3-540-47357-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics