Abstract
As biological databases grow larger, effective query of the biological sequences in these databases has become an increasingly important issue for researchers. There are currently not many systems for fast access of very large biological sequences. In this paper, we propose a new approach for biological sequences similarity querying in databases. The general idea is to first trans form the biological sequences into vectors and then onto 2-d points in planes; then use a spatial index to index these points with self-organizing maps (SOM), and perform a single efficient similarity query (with multiple simultaneous input sequences) using a fast algorithm, the multi-point range query (MPRQ) algorithm. This approach works well because we could perform multiple sequences similarity queries and return the results with just one MPRQ query, with tremendous savings in query time. We applied our method onto DNA and protein sequences in database, and results show that our algorithm is efficient in time, and the accuracies are satisfactory.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Altschul, S.F., et al.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)
McGinnis, S., Madden, T.L.: BLAST: At the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research 32, W20–W25 (2004)
Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Kohonen, T.: Self-Organizing Maps. Springer, New York (2001)
Ng, H.K., Leong, H.W., Ho, N.L.: Efficient Algorithm for Path-Based Range Query in Spatial Databases. In: IDEAS 2004, pp. 334–343 (2004)
Ng, H.K., Leong, H.W.: Multi-Point Range Queries for Large Spatial Databases. In: The Third IASTED International Conference on Advances in Computer Science and Technology (2007)
Bertone, P., Gerstein, M.: Integrative data mining: the new direction in bioinformatics. IEEE Engineering in Medicine and Biology Magazine 20, 33–40 (2001)
Garcia, Y.J., Lopez, M.A., Leutenegger, S.T.: A Greedy Algorithm for Bulk Loading R-Trees. In: Proceedings of 6th ACM Symposium on Geographic Information Systems (ACM-GIS), pp. 163–164. ACM Press, New York (1998)
Benson, D.A., et al.: GenBank. Nucleic Acids Research 34, D21–D24 (2006)
Bairoch, A., et al.: The Universal Protein Resource (UniProt). Nucleic Acids Research 33, D154–D159 (2005)
Gish, W., States, D.J.: Identification of protein coding regions by database similarity search. Nature Genetics 3, 266–272 (1993)
Kohonen, T., et al.: SOM_PAK: The Self-Organizing Map Program Package. Technical Report A31, FIN-02150 Espoo, Finland (1996)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Ng, H.K., Ning, K., Leong, H.W. (2007). A New Approach for Similarity Queries of Biological Sequences in Databases. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_79
Download citation
DOI: https://doi.org/10.1007/978-3-540-71701-0_79
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71700-3
Online ISBN: 978-3-540-71701-0
eBook Packages: Computer ScienceComputer Science (R0)