Fast Motif Search in Protein Sequence Databases

Zheleva, Elena; Arslan, Abdullah N.

doi:10.1007/11753728_67

Elena Zheleva¹⁹ &
Abdullah N. Arslan²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3967))

Included in the following conference series:

International Computer Science Symposium in Russia

1001 Accesses

Abstract

Regular expression pattern matching is widely used in computational biology. Searching through a database of sequences for a motif (a simple regular expression), or its variations is an important interactive process which requires fast motif-matching algorithms. In this paper, we explore and evaluate various representations of the database of sequences using suffix trees for two types of query problems for a given regular expression: 1) Find the first match, and 2) Find all matches. Answering Problem 1 increases the level and effectiveness of interactive motif exploration. We propose a framework in which Problem 1 can be solved in a faster manner than existing solutions while not slowing down the solution of Problem 2. We apply several heuristics both at the level of suffix tree creation resulting in modified tree representations, and at the regular expression matching level in which we search subtrees in a given predefined order by simulating a deterministic finite automaton that we create from the given regular expression. The focus of our work is to develop a method for faster retrieval of PROSITE motif (a restricted regular expression) matches from a protein sequence database. We show empirically the effectiveness of our solution using several real protein data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Fast Indexes for Gapped Pattern Matching

Efficient Index-Based Regular Expression Matching with Optimal Query Plan Tree

Experimental Analysis of an Online Dictionary Matching Algorithm for Regular Expressions with Gaps

References

Aho, A.: Algorithms for finding patterns in strings. In: van Leeuwen, J. (ed.) Handbook of Theoretical Computer Science: Algorithms and Complexity, vol. 5, pp. 255–300. Elsevier Science Publishers B.V, Amsterdam (1990)
Google Scholar
Arslan, A.N.: Efficient approximate dictionary look-up for long words over small alphabets. In: Correa, J.R., Hevia, A., Kiwi, M. (eds.) LATIN 2006. LNCS, vol. 3887, pp. 118–129. Springer, Heidelberg (2006)
Chapter Google Scholar
Bieganski, P., Riedl, J., Carlis, J.V., Retzel, E.F.: Motif explorer - a tool for interactive exploration of aminoacid sequence motifs. In: Proceedings of Pacific Symposium on Biocomputing (1996)
Google Scholar
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., Phan, I., O’Donovan, C., Pilbout, S., Schneider, M.: The swiss-prot protein knowledgebase and its supplement trembl in 2003. Nucleic Acids Research 31, 365–370 (2003)
Article Google Scholar
Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J., Hofmann, K., Bairoch, A.: The prosite database, its status in 2002. Nucleic Acids Research 30, 235–238 (2002)
Article Google Scholar
Gattiker, A., Gasteiger, E., Bairoch, A.: Scanprosite: a reference implementation of a prosite scanning tool. Applied Bioinformatics 1, 107–108 (2002)
Google Scholar
Gerzic, A.: Write your own regular expression parser (2003), http://www.codeguru.com/CppCppcpp_mfc/parsing/article.php/c4093/
Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Kreibich, C.: libstree - a generic suffix tree library (2004), http://www.cl.cam.ac.uk/~cpk25/libstree/
Mehlhorn, K.: Data Structures and Algorithms: Sorting and Searching. Springer, Heidelberg (1977)
Google Scholar
Ukkonen, E.: On-line construction of suffix-trees. Algorithmica 14, 249–260 (1995)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
Elena Zheleva
Department of Computer Science, University of Vermont, Burlington, VT, 05405, USA
Abdullah N. Arslan

Authors

Elena Zheleva
View author publications
You can also search for this author in PubMed Google Scholar
Abdullah N. Arslan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IRMAR, Université de Rennes, Campus de Beaulieu, 35042, Rennes Cedex, France
Dima Grigoriev
Intel Corporation, JF1-13, 2111 NE 25th Avenue, 97124, Hillsboro, OR, USA
John Harrison
Steklov Institute of Mathematics at St. Petersburg, 27 Fontanka, St., 191023, Petersburg, Russia
Edward A. Hirsch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheleva, E., Arslan, A.N. (2006). Fast Motif Search in Protein Sequence Databases. In: Grigoriev, D., Harrison, J., Hirsch, E.A. (eds) Computer Science – Theory and Applications. CSR 2006. Lecture Notes in Computer Science, vol 3967. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11753728_67

Download citation

DOI: https://doi.org/10.1007/11753728_67
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34166-6
Online ISBN: 978-3-540-34168-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics