Definition
Biological sequence databases are mainly composed of DNA, RNA, and protein sequences. DNA and RNA sequences are polymers of nucleotides, whereas proteins are polymers of amino acids. A database of biological sequences contains a set of biological sequences of the same type. The length of each sequence varies from less than a hundred to several hundred million bases. An index structure on a database of biological sequences helps in identifying sequences in that database that are similar to a given query sequence quickly. The definition of similarity depends on two orthogonal parameters; similarity function and the length of the similarity of interest.
The simplest similarity function is the edit distance, which measures the number of substitutions, insertions, and deletions needed to transform one sequence to the other. More complex functions involve variable gap penalties and substitution scores based on how frequent substitutions are observed in nature. The length of the...
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsRecommended Reading
Altschul S., Gish W., Miller W., Meyers E.W., and Lipman D.J.,Basic Local Alignment Search Tool. J. Mole. Biol., 215(3):403–410, 1990.
Benson D., Karsch-Mizrachi I., Lipman D., Ostell J., Rapp B., and Wheeler D. GenBank. Nucleic Acids Res., 28(1):15–18, 2000.
Bray N., Dubchak I., and Pachter L. AVID: a global alignment program. Genome Res., 13(1):97–102, 2003.
Ferragina P. and Grossi R. The string B-tree: a new data structure for string search in external memory and its applications. J. ACM, 46(2):236–280, 1999.
Filho R.F.S., Traina A.J.M., Caetano Traina J., and Faloutsos C. Similarity search without tears: The OMNI family of all-purpose access methods. In Proc. 17th Int. Conf. on Data Engineering, 2001, pp. 623–630.
Giladi E., Walker M., Wang J., and Volkmuth W. SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size. Bioinformatics, 18(6):873–877, 2002.
Kahveci T. and Singh A. An efficient index structure for string databases. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 351–360.
Manber U. and Myers E. Suffix arrays: a new method for on-line string searches. SIAM J. Comput., 22(5):935–948, 1993.
McCreight E. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262–272, 1976.
Pearson W. and Lipman D. Improved tools for biological sequence comparison. In Proc. Natl. Acad. Sci., 85:2444–2448, 1988.
Pol A. and Kahveci T. Highly scalable and accurate seeds for subsequence alignment. In Proc. IEEE Int. Conf. on Bioinformatics and Bioengineering, 2005.
Ukkonen E. On-line Construction of Suffix-trees. Algorithmica, 14:249–260, 1995.
Venkateswaran J., Lachwani D., Kahveci T., and Jermaine C. Reference-based indexing for metric spaces with costly distance measures. VLDB J. 17(5):1231–1251, 2008.
Weiner P. Linear pattern matching algorithms. In Proc. IEEE Symposium on Switching and Automata Theory, 1973, pp. 1–11.
Yianilos P. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proc. 4th Annual ACM -SIAM Symp. on Discrete Algorithms, 1993, pp. 311–321.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this entry
Cite this entry
Kahveci, T. (2009). Index Structures for Biological Sequences. In: LIU, L., ÖZSU, M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-39940-9_1434
Download citation
DOI: https://doi.org/10.1007/978-0-387-39940-9_1434
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-35544-3
Online ISBN: 978-0-387-39940-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering