Fast Matching Method for DNA Sequences

Kim, Jin Wook; Kim, Eunsang; Park, Kunsoo

doi:10.1007/978-3-540-74450-4_25

Jin Wook Kim¹,
Eunsang Kim² &
Kunsoo Park²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4614))

Included in the following conference series:

International Symposium on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies

1549 Accesses
8 Citations

Abstract

DNA sequences are the fundamental information for each species and a comparison between DNA sequences of different species is an important task. Since DNA sequences are very long and there exist many species, not only fast matching but also efficient storage is an important factor for DNA sequences. Thus, a fast string matching method suitable for encoded DNA sequences is needed. In this paper, we present a fast string matching method for encoded DNA sequences which does not decode DNA sequences while matching. We use four-characters-to-one-byte encoding and combine a suffix approach and a multi-pattern matching approach. Experimental results show that our method is about 5 times faster than AGREP and the fastest among known algorithms.

This work was supported by FPR05A2-341 of 21C Frontier Functional Proteomics Project from Korean Ministry of Science & Technology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amir, A., Benson, G.: Efficient Two-Dimensional Compressed Matching. In: Data Compression Conference, pp. 279–288 (1992)
Google Scholar
Amir, A., Benson, G., Farach, M.: Let Sleeping Files Lie: Pattern Matching in Z-compressed Files. In: 5th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 705–714 (1994)
Google Scholar
Allauzen, C., Crochemore, M., Raffinot, M.: Efficient experimental string matching by weak factor recognition. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 51–72. Springer, Heidelberg (2001)
Google Scholar
Baeza-Yates, R., Gonnet, G.H.: A New Approach to Text Searching. Communications of the ACM 35(10), 74–82 (1992)
Article Google Scholar
BLAST, http://www.ncbi.nlm.nih.gov/BLAST
Boyer, R.S., Strother Moore, J.: A Fast String Searching Algorithm. Communications of the ACM 20(10), 762–772 (1977)
Article Google Scholar
Charras, C., Lecroq, T., Daniel Pehoushek, J.: A Very Fast String Matching Algorithm for Small Alphabets and Long Patterns. In: Farach-Colton, M. (ed.) CPM 1998. LNCS, vol. 1448, pp. 55–64. Springer, Heidelberg (1998)
Chapter Google Scholar
Chen, L., Lu, S., Ram, J.: Compressed Pattern Matching in DNA Sequences. In: CSB 2004. IEEE Computational Systems Bioinformatics Conference, pp. 62–68 (2004)
Google Scholar
Commentz-Walter, B.: A String Matching Algorithm Fast on the Average. In: Maurer, H.A. (ed.) Automata, Languages, and Programming. LNCS, vol. 71, pp. 118–132. Springer, Heidelberg (1979)
Google Scholar
Commentz-Walter, B.: A String Matching Algorithm Fast on the Average. Technical Report TR 79.09.007, IBM Germany, Heidelberg Scientific Center (1979)
Google Scholar
FASTA, http://www.ebi.ac.uk/fasta
Franek, F., Jennings, C.G., Smyth, W.F.: A Simple Fast Hybrid Pattern-Matching Algorithm. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 288–297. Springer, Heidelberg (2005)
Google Scholar
Fredriksson, K.: Shift-Or String Matching with Super-Alphabets. Information Processing Letters 87(4), 201–204 (2003)
Article MathSciNet Google Scholar
Fredriksson, K., Grabowski, S.: Practical and Optimal String Matching. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 376–387. Springer, Heidelberg (2005)
Chapter Google Scholar
Nigel Horspool, R.: Practical Fast Searching in Strings. Software Practice and Experience 10(6), 501–506 (1980)
Article Google Scholar
Knuth, D.E., Morris Jr, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6, 323–350 (1977)
Article MATH MathSciNet Google Scholar
Manber, U.: A Text Compression Scheme That Allows Fast Searching Directly in the Compressed File. ACM Transactions on Information Systems 15(2), 124–136 (1997)
Article Google Scholar
de Moura, E.S., Navarro, G., Ziviani, N., Baeza-Yates, R.: Direct Pattern Matching on Compressed Text. In: 5th International Symposium on String Processing and Information Retrieval, pp. 90–95. IEEE Computer Society Press, Los Alamitos (1998)
Google Scholar
Navarro, G., Raffinot, M.: Fast and Flexible String Matching by Combining Bit-Parallelism and Suffix Automata. ACM Journal of Experimental Algorithmics 5(4) (2000)
Google Scholar
Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences. Cambridge University Press, Cambridge (2002)
MATH Google Scholar
Navarro, G., Raffinot, M.: Practical and Flexible Pattern Matching over Ziv-Lempel Compressed Text. Journal of Discrete Algorithms 2(3), 347–371 (2004)
Article MATH MathSciNet Google Scholar
Navarro, G., Tarhio, J.: LZgrep: a Boyer-Moore String Matching Tool for Ziv-Lempel Compressed Text. Software-Practice and Experience 35(12), 1107–1130 (2005)
Article Google Scholar
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding Up Pattern Matching by Text Compression. In: Bongiovanni, G., Petreschi, R., Gambosi, G. (eds.) CIAC 2000. LNCS, vol. 1767, pp. 306–315. Springer, Heidelberg (2000)
Chapter Google Scholar
Shibata, Y., Matsumoto, T., Takeda, M., Shinohara, A., Arikawa, S.: A Boyer-Moore Type Algorithm for Compressed Pattern Matching. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 181–194. Springer, Heidelberg (2000)
Chapter Google Scholar
Sunday, D.M.: A Very Fast Substring Search Algorithm. Communications of the ACM 33(8), 132–142 (1990)
Article Google Scholar
Tarhio, J., Peltola, H.: String Matching in the DNA Alphabet. Software-Practice and Experience 27(7), 851–861 (1997)
Article Google Scholar
Wu, S., Manber, U.: Fast Text Searching Allowing Errors. Communications of the ACM 35(10), 83–91 (1992)
Article Google Scholar
Wu, S., Manber, U.: AGREP - A Fast Approximate Pattern-matching Tool. In: The Winter 1992 USENIX Conference, pp. 153–162 (1992)
Google Scholar

Download references

Author information

Authors and Affiliations

HM Research, San 56-1, Sillim-dong, Gwanak-gu, Seoul, 151-742, Korea
Jin Wook Kim
School of Computer Science and Engineering, Seoul National University, Seoul, 151-742, Korea
Eunsang Kim & Kunsoo Park

Authors

Jin Wook Kim
View author publications
You can also search for this author in PubMed Google Scholar
Eunsang Kim
View author publications
You can also search for this author in PubMed Google Scholar
Kunsoo Park
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Bo Chen Mike Paterson Guochuan Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, J.W., Kim, E., Park, K. (2007). Fast Matching Method for DNA Sequences. In: Chen, B., Paterson, M., Zhang, G. (eds) Combinatorics, Algorithms, Probabilistic and Experimental Methodologies. ESCAPE 2007. Lecture Notes in Computer Science, vol 4614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74450-4_25

Download citation

DOI: https://doi.org/10.1007/978-3-540-74450-4_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74449-8
Online ISBN: 978-3-540-74450-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics