Abstract
Genomic sequence databases has been widely used by molecular biologists for homology searching. However, as amino acid and nucleotide databases are growing in size at an alarming rate, traditional brute force approach of comparing a query sequence against each of the database sequences is becoming prohibitively expensive. In this paper, we re-examine the problem of searching for homology in large protein databases. We proposed a novel filter-and-refine approach to speed up the search process. The scheme operates in two phases. In the filtering phase, a small set of candidate database sequences (as compared to all sequences in the database) is quickly identified. This is realized using a signature-based scheme. In the refinement phase, the query sequence is matched against the sequences in the candidate set using any local alignment strategies. Our preliminary experimental results show that the proposed method results in significant savings in computation without sacrificing on the accuracy of the answers as compared to FASTA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
H. Williams and J. Zobel. Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering (to appear), 2001.
W.R. Pearson and D.J. Lipman. Improved tools for biological sequence comparison. In Proceedings Natl. Acad. Sci. USA Vol. 85, pages 2444–2448, 1988.
D. J. States, W. Gish, and S. F. Altschul. Improved sensivity of nucleic acid databas searches using application-specific scoring matrices. Methods: A Companion to Methods in Enzymology, 3(1):66–70, 1991.
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. A basic local alignment search tool. Journal of Moelcular Biology, 215:403–410, 1990.
M. Dipperstein. Dna sequence databases. In http://www.cs.ucsb.edu/ mdipper/ dna/DNApaper.html.
C. Fondrat and P. Dessen. A rapid access motif database (ramdb) with a search algorithm for the retrieval patterns in nucleic acids or protein databanks. Computer Applications in the Biosciences, 11(3):273–279, 1995.
A. Califano and I. Rigoutsos. Flash: A fast look-up algorithm for string homology. In Proceedinsg of the International Conference on Intelligent Systems for Molecular Biology, pages 56–64, Bethesda, MD, 1993.
V. Guralnik and G. Karypis. A scalable algorithm for clustering protein sequences. In Proceedings of the BIOKDD 2001 Workshop (see http://www.cs.rpi.edu/ zaki/BIOKDD01), 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ong, TH., Tan, KL., Wang, H. (2002). Indexing Genomic Databases for Fast Homology Searching. In: Hameurlain, A., Cicchetti, R., Traunmüller, R. (eds) Database and Expert Systems Applications. DEXA 2002. Lecture Notes in Computer Science, vol 2453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46146-9_86
Download citation
DOI: https://doi.org/10.1007/3-540-46146-9_86
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44126-7
Online ISBN: 978-3-540-46146-3
eBook Packages: Springer Book Archive