Abstract
DNA sequences are the fundamental information for each species and a comparison between DNA sequences of different species is an important task. Since DNA sequences are very long and there exist many species, not only fast matching but also efficient storage is an important factor for DNA sequences. Thus, a fast string matching method suitable for encoded DNA sequences is needed. In this paper, we present a fast string matching method for encoded DNA sequences which does not decode DNA sequences while matching. We use four-characters-to-one-byte encoding and combine a suffix approach and a multi-pattern matching approach. Experimental results show that our method is about 5 times faster than AGREP and the fastest among known algorithms.
This work was supported by FPR05A2-341 of 21C Frontier Functional Proteomics Project from Korean Ministry of Science & Technology.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Amir, A., Benson, G.: Efficient Two-Dimensional Compressed Matching. In: Data Compression Conference, pp. 279–288 (1992)
Amir, A., Benson, G., Farach, M.: Let Sleeping Files Lie: Pattern Matching in Z-compressed Files. In: 5th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 705–714 (1994)
Allauzen, C., Crochemore, M., Raffinot, M.: Efficient experimental string matching by weak factor recognition. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 51–72. Springer, Heidelberg (2001)
Baeza-Yates, R., Gonnet, G.H.: A New Approach to Text Searching. Communications of the ACM 35(10), 74–82 (1992)
Boyer, R.S., Strother Moore, J.: A Fast String Searching Algorithm. Communications of the ACM 20(10), 762–772 (1977)
Charras, C., Lecroq, T., Daniel Pehoushek, J.: A Very Fast String Matching Algorithm for Small Alphabets and Long Patterns. In: Farach-Colton, M. (ed.) CPM 1998. LNCS, vol. 1448, pp. 55–64. Springer, Heidelberg (1998)
Chen, L., Lu, S., Ram, J.: Compressed Pattern Matching in DNA Sequences. In: CSB 2004. IEEE Computational Systems Bioinformatics Conference, pp. 62–68 (2004)
Commentz-Walter, B.: A String Matching Algorithm Fast on the Average. In: Maurer, H.A. (ed.) Automata, Languages, and Programming. LNCS, vol. 71, pp. 118–132. Springer, Heidelberg (1979)
Commentz-Walter, B.: A String Matching Algorithm Fast on the Average. Technical Report TR 79.09.007, IBM Germany, Heidelberg Scientific Center (1979)
FASTA, http://www.ebi.ac.uk/fasta
Franek, F., Jennings, C.G., Smyth, W.F.: A Simple Fast Hybrid Pattern-Matching Algorithm. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 288–297. Springer, Heidelberg (2005)
Fredriksson, K.: Shift-Or String Matching with Super-Alphabets. Information Processing Letters 87(4), 201–204 (2003)
Fredriksson, K., Grabowski, S.: Practical and Optimal String Matching. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 376–387. Springer, Heidelberg (2005)
Nigel Horspool, R.: Practical Fast Searching in Strings. Software Practice and Experience 10(6), 501–506 (1980)
Knuth, D.E., Morris Jr, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6, 323–350 (1977)
Manber, U.: A Text Compression Scheme That Allows Fast Searching Directly in the Compressed File. ACM Transactions on Information Systems 15(2), 124–136 (1997)
de Moura, E.S., Navarro, G., Ziviani, N., Baeza-Yates, R.: Direct Pattern Matching on Compressed Text. In: 5th International Symposium on String Processing and Information Retrieval, pp. 90–95. IEEE Computer Society Press, Los Alamitos (1998)
Navarro, G., Raffinot, M.: Fast and Flexible String Matching by Combining Bit-Parallelism and Suffix Automata. ACM Journal of Experimental Algorithmics 5(4) (2000)
Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences. Cambridge University Press, Cambridge (2002)
Navarro, G., Raffinot, M.: Practical and Flexible Pattern Matching over Ziv-Lempel Compressed Text. Journal of Discrete Algorithms 2(3), 347–371 (2004)
Navarro, G., Tarhio, J.: LZgrep: a Boyer-Moore String Matching Tool for Ziv-Lempel Compressed Text. Software-Practice and Experience 35(12), 1107–1130 (2005)
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding Up Pattern Matching by Text Compression. In: Bongiovanni, G., Petreschi, R., Gambosi, G. (eds.) CIAC 2000. LNCS, vol. 1767, pp. 306–315. Springer, Heidelberg (2000)
Shibata, Y., Matsumoto, T., Takeda, M., Shinohara, A., Arikawa, S.: A Boyer-Moore Type Algorithm for Compressed Pattern Matching. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 181–194. Springer, Heidelberg (2000)
Sunday, D.M.: A Very Fast Substring Search Algorithm. Communications of the ACM 33(8), 132–142 (1990)
Tarhio, J., Peltola, H.: String Matching in the DNA Alphabet. Software-Practice and Experience 27(7), 851–861 (1997)
Wu, S., Manber, U.: Fast Text Searching Allowing Errors. Communications of the ACM 35(10), 83–91 (1992)
Wu, S., Manber, U.: AGREP - A Fast Approximate Pattern-matching Tool. In: The Winter 1992 USENIX Conference, pp. 153–162 (1992)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kim, J.W., Kim, E., Park, K. (2007). Fast Matching Method for DNA Sequences. In: Chen, B., Paterson, M., Zhang, G. (eds) Combinatorics, Algorithms, Probabilistic and Experimental Methodologies. ESCAPE 2007. Lecture Notes in Computer Science, vol 4614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74450-4_25
Download citation
DOI: https://doi.org/10.1007/978-3-540-74450-4_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74449-8
Online ISBN: 978-3-540-74450-4
eBook Packages: Computer ScienceComputer Science (R0)