skip to main content
article

Piers: an efficient model for similarity search in DNA sequence databases

Published:01 June 2004Publication History
Skip Abstract Section

Abstract

Growing interest in genomic research has resulted in the creation of huge biological sequence databases. In this paper, we present a hash-based pier model for efficient homology search in large DNA sequence databases. In our model, only certain segments in the databases called 'piers' need to be accessed during searches as opposite to other approaches which require a full scan on the biological sequence database. To further improve the search efficiency, the piers are stored in a specially designed hash table which helps to avoid expensive alignment operation. The has table is small enough to reside in main memory, hence avoiding I/O in the search steps. We show theoretically and empirically that the proposed approach can efficiently detect biological sequences that are similar to a query sequence with very high sensitivity.

References

  1. S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. A basic local alignment search tool. In Journal of Molecular Biology, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  2. S. Burkhardt, A. Crauser, P. Ferragina, H. P. Lenhof, and M. Vingron. q-gram based database searching using a suffix array (quasar). In Int. Conf. RECOMB, Lyon, April 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In Proc. 1994 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'94), pages 419--429, Minneapolis, Minnesota, May 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Giladi, M. Walker, J. Wang, and W. Volkmuth. Sst: An algorithm for searching sequence databases in time proportional to the logarithm of the database size. In Int. Conf. RECOMB, Japan, 2000.Google ScholarGoogle Scholar
  5. D. Gusfield. Algorithms on Strings, Trees and Sequences, Computer Science and Computation Biology. Cambridge University Press, New York, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Hunt, M. P. Atkinson, and R. W. Irving. A database index to large biological sequences. In International Journal on VLDB, pages 139--148, Roma, Italy, September 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Kahveci and A. Singh. An efficient index structure for string databases. In Int. Conf. VLDB, Roma, Italy, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Ma, J. Tromp, and M. Li. Patternhunter: faster and more sensitive homology search. Bioinformatics, 18:440--445, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  9. U. Manber and G. Myers. Suffix arrays: a new method for on-line string search. SIAM Journal on Computing, 22:935--948, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Meek, J. M. Patel, and S. Kasetty. Oasis: An online and accurate technique for local-alignment searches on biological sequences. In Proc. 2003 Int. Conf. Very Large Data Bases (VLDB'03), pages 910--921, Berlin, Germany, Sept. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Muthukrishnan and S. C. Sahinalp. Approximate nearest neighbors and sequence comparison with block operation. In STOC, Portland, Or, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. O. Ozturk and H. Ferhatosmanoglu. Effective indexing and filtering for similarity search in large biosequence datasbases. In Third IEEE Symposium on BioInformatics and BioEngineering (BIBE'03), Bethesda, Maryland, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proceedings Natl. Acad. Sci. USA, 85:2444--2448, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  14. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Molecular Biology, 147:195--197, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  15. Z. Tan, X. Cao, B. C. Ooi, and A. Tung. The ed-tree: an index for large dna sequence databases. In In Proc. 15th Int. Conf. on Scientific and Statistical Database Management, pages 151--160, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Weiner. Linear pattern matching algorithms. In Proc. 14th IEEE Symp. On Switching and Automata Theory, pages 1--11, 1973.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H.E. Williams and J. Zobel. Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering, 14:63--78, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGMOD Record
    ACM SIGMOD Record  Volume 33, Issue 2
    June 2004
    126 pages
    ISSN:0163-5808
    DOI:10.1145/1024694
    Issue’s Table of Contents

    Copyright © 2004 Authors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 1 June 2004

    Check for updates

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader