A Fast Algorithm for Approximate String Matching on Gene Sequences

Liu, Zheng; Chen, Xin; Borneman, James; Jiang, Tao

doi:10.1007/11496656_8

Zheng Liu¹⁹,
Xin Chen¹⁹,
James Borneman²⁰ &
…
Tao Jiang¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3537))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

939 Accesses

Abstract

Approximate string matching is a fundamental and challenging problem in computer science, for which a fast algorithm is highly demanded in many applications including text processing and DNA sequence analysis. In this paper, we present a fast algorithm for approximate string matching, called FAAST. It aims at solving a popular variant of the approximate string matching problem, the k-mismatch problem, whose objective is to find all occurrences of a short pattern in a long text string with at most k mismatches. FAAST generalizes the well-known Tarhio-Ukkonen algorithm by requiring two or more matches when calculating shift distances, which makes the approximate string matching process significantly faster than the Tarhio-Ukkonen algorithm. Theoretically, we prove that FAAST on average skips more characters than the Tarhio-Ukkonen algorithm in a single shift, and makes fewer character comparisons in an entire matching process. Experiments on both simulated data sets and real gene sequences also demonstrate that FAAST runs several times faster than the Tarhio-Ukkonen algorithm in all the cases that we tested.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

libFLASM: a software library for fixed-length approximate string matching

Article Open access 10 November 2016

Evaluation and Improvement of Fast Algorithms for Exact Matching on Genome Sequences

How to Find Long Maximal Exact Matches and Ignore Short Ones

References

Baeza-Yates, R., Gonnet, G.H.: A New Approach to Text Searching. Communication of the ACM 35(10) (1992)
Google Scholar
Baeza-Yates, R.A., Gonnet, G.H.: Fast String Matching with Mismatches. Information and Computation 108, 187–199 (1994)
Article MATH MathSciNet Google Scholar
Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Communications of the ACM 10(20), 762–772 (1977)
Article Google Scholar
Cornish-Bowden, A.: Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucl. Acids Res. 13, 3021–3030 (1985)
Article Google Scholar
El-Mabrouk, N., Crochemore, M.: Boyer-Moore strategy to efficient approximate string matching. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 24–38. Springer, Heidelberg (1996)
Google Scholar
Horspool, R.N.: Practical fast searching in strings. Software - Practice and experience 10, 501–506 (1980)
Article Google Scholar
Navarro, G., Raffinot, M.: Fast Regular Expression Search. In: Vitter, J.S., Zaroliagis, C.D. (eds.) WAE 1999. LNCS, vol. 1668, pp. 198–212. Springer, Heidelberg (1999)
Chapter Google Scholar
Navarro, G.: Approximate Regular Expression Searching with Arbitrary Integer Weights. In: Ibaraki, T., Katoh, N., Ono, H. (eds.) ISAAC 2003. LNCS, vol. 2906, pp. 230–239. Springer, Heidelberg (2003)
Chapter Google Scholar
Tarhio, J., Ukkonen, E.: Approximate Boyer-Moore String Matching. SIAM J. Comput. 22, 243–260 (1993)
Article MATH MathSciNet Google Scholar
Valinsky, L., Scupham, A., Vedova, G.D., Liu, Z., Figueroa, A., Jampachaisri, K., Yin, B., Bent, E., Mancini-Jones, R., Press, J., Jiang, T., Borneman, J.: Oligonucleotide Fingerprinting of Ribosomal RNA Genes (OFRG). In: Kowalchuk, G.A., de Bruijn, F.J., Head, I.M., Akkermans, A.D.L., van Elsas, J.D. (eds.) Molecular Microbial Ecology Manual, 2nd edn., pp. 569–585. Kluwer Academic Publishers, Dordrecht (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of California, Riverside
Zheng Liu, Xin Chen & Tao Jiang
Department of Plant Pathology, University of California, Riverside
James Borneman

Authors

Zheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Chen
View author publications
You can also search for this author in PubMed Google Scholar
James Borneman
View author publications
You can also search for this author in PubMed Google Scholar
Tao Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Georgia Institute of Technology and Università di Padova,
Alberto Apostolico
Université Paris-Est, France
Maxime Crochemore
School of Computer Science and Engineering, Seoul National University, 151-742, Seoul, Korea
Kunsoo Park

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Z., Chen, X., Borneman, J., Jiang, T. (2005). A Fast Algorithm for Approximate String Matching on Gene Sequences. In: Apostolico, A., Crochemore, M., Park, K. (eds) Combinatorial Pattern Matching. CPM 2005. Lecture Notes in Computer Science, vol 3537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11496656_8

Download citation

DOI: https://doi.org/10.1007/11496656_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26201-5
Online ISBN: 978-3-540-31562-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics