Abstract
We study the problem of structured motif search in DNA sequences. This is a fundamental task in bioinformatics which contributes to better understanding of genome characteristics and properties. We propose an efficient algorithm for Exact Match, Overlapping Structured motif search (EMOS), which uses a suffix tree index we proposed earlier and runs on a typical desktop computer. We have conducted numerous experiments to evaluate EMOS and compared its performance with the best known solution, SMOTIF1 [1]. While in some cases the search time of EMOS is comparable to SMOTIF1, it is on average 5 to 6 times faster.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Zhang, Y., Zaki, M.J.: SMOTIF: efficient structured pattern and profile motif search. Algorithms for Molecular Biology, 1–22 (November 2006)
McCarthy, E., McDonald, J.: LTR_STRUC: A Novel Search and Identification Program for LTR Retrotransposons. Bioinformatics 19(3), 362–367 (2003)
Feschotte, C., Jiang, N., Wessler, S.: Plant transposable elements: where genetics meets genomics. Nature Review Genetics 3(5), 329–341 (2002)
Jurka, J., Kapitonov, V., Pavlicek, A., Klonowski, P., Kohany, O., Walichiewicz, J.: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 110(1-4), 462–467 (2005)
Policriti, A., Vitacolonna, N., Morgante, M., Zuccolo, A.: Structured Motif Search. In: Int’l Conf. on Research in Computational Molecular Biology, pp. 133–139 (2004)
Mehldau, G., Myers, G.: A system for Pattern Matching Applications on Biosequences. Computer Applications in the Biosciences 9(3), 299–314 (1993)
Myers, E.: Approximate Matching of Network Expressions with Spacers. J. Comput. Biol. 3(1), 33–51 (1996)
Navarro, G., Raffinot, M.: Fast and Simple Character Classes and Bounded Gaps Pattern Matching, with Application to protein Searching. J. Comput. Biol. 10(6), 903–923 (2003)
Zaki, M.J.: SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal 42(1/2), 1–31 (2001)
Zaki, M.J.: Sequence Mining in Categorical Domains: Incorporating Constraints. In: ACM Int’l Conf on Information and Knowledge Management, pp. 422–429 (2000)
Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)
Halachev, M., Shiri, N., Thamildurai, A.: Efficient and scalable indexing techniques for biological sequence data. In: Hochreiter, S., Wagner, R. (eds.) BIRD 2007. LNCS (LNBI), vol. 4414, pp. 464–479. Springer, Heidelberg (2007)
FASST web-interface, http://sepehr.cs.concordia.ca/
Giegerich, R., Kurtz, S., Stoye, J.: Efficient implementation of lazy suffix trees. Software – Practice and Experience 33(11), 1035–1049 (2003)
Tian, Y., Tata, S., Hankins, R.A., Patel, J.: Practical methods for constructing suffix trees. VLDB Journal 14(3), 281–299 (2005)
Human Genome Data, ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes
SMOTIF1 source code, http://www.cs.rpi.edu/~zaki/software/sMotif/
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Halachev, M., Shiri, N. (2008). Fast Structured Motif Search in DNA Sequences. In: Elloumi, M., Küng, J., Linial, M., Murphy, R.F., Schneider, K., Toma, C. (eds) Bioinformatics Research and Development. BIRD 2008. Communications in Computer and Information Science, vol 13. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70600-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-540-70600-7_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70598-7
Online ISBN: 978-3-540-70600-7
eBook Packages: Computer ScienceComputer Science (R0)