Skip to main content

A New Approach for Similarity Queries of Biological Sequences in Databases

  • Conference paper
  • 1375 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4426))

Abstract

As biological databases grow larger, effective query of the biological sequences in these databases has become an increasingly important issue for researchers. There are currently not many systems for fast access of very large biological sequences. In this paper, we propose a new approach for biological sequences similarity querying in databases. The general idea is to first trans form the biological sequences into vectors and then onto 2-d points in planes; then use a spatial index to index these points with self-organizing maps (SOM), and perform a single efficient similarity query (with multiple simultaneous input sequences) using a fast algorithm, the multi-point range query (MPRQ) algorithm. This approach works well because we could perform multiple sequences similarity queries and return the results with just one MPRQ query, with tremendous savings in query time. We applied our method onto DNA and protein sequences in database, and results show that our algorithm is efficient in time, and the accuracies are satisfactory.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)

    Article  Google Scholar 

  2. Altschul, S.F., et al.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)

    Google Scholar 

  3. McGinnis, S., Madden, T.L.: BLAST: At the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research 32, W20–W25 (2004)

    Article  Google Scholar 

  4. Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)

    Article  Google Scholar 

  5. Kohonen, T.: Self-Organizing Maps. Springer, New York (2001)

    MATH  Google Scholar 

  6. Ng, H.K., Leong, H.W., Ho, N.L.: Efficient Algorithm for Path-Based Range Query in Spatial Databases. In: IDEAS 2004, pp. 334–343 (2004)

    Google Scholar 

  7. Ng, H.K., Leong, H.W.: Multi-Point Range Queries for Large Spatial Databases. In: The Third IASTED International Conference on Advances in Computer Science and Technology (2007)

    Google Scholar 

  8. Bertone, P., Gerstein, M.: Integrative data mining: the new direction in bioinformatics. IEEE Engineering in Medicine and Biology Magazine 20, 33–40 (2001)

    Article  Google Scholar 

  9. Garcia, Y.J., Lopez, M.A., Leutenegger, S.T.: A Greedy Algorithm for Bulk Loading R-Trees. In: Proceedings of 6th ACM Symposium on Geographic Information Systems (ACM-GIS), pp. 163–164. ACM Press, New York (1998)

    Chapter  Google Scholar 

  10. Benson, D.A., et al.: GenBank. Nucleic Acids Research 34, D21–D24 (2006)

    Article  Google Scholar 

  11. Bairoch, A., et al.: The Universal Protein Resource (UniProt). Nucleic Acids Research 33, D154–D159 (2005)

    Article  Google Scholar 

  12. http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#pstab

  13. Gish, W., States, D.J.: Identification of protein coding regions by database similarity search. Nature Genetics 3, 266–272 (1993)

    Article  Google Scholar 

  14. Kohonen, T., et al.: SOM_PAK: The Self-Organizing Map Program Package. Technical Report A31, FIN-02150 Espoo, Finland (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Zhi-Hua Zhou Hang Li Qiang Yang

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Ng, H.K., Ning, K., Leong, H.W. (2007). A New Approach for Similarity Queries of Biological Sequences in Databases. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_79

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71701-0_79

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71700-3

  • Online ISBN: 978-3-540-71701-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics