Abstract
Growing interest in genomic research has resulted in the creation of huge biological sequence databases. In this paper, we present a hash-based pier model for efficient homology search in large DNA sequence databases. In our model, only certain segments in the databases called 'piers' need to be accessed during searches as opposite to other approaches which require a full scan on the biological sequence database. To further improve the search efficiency, the piers are stored in a specially designed hash table which helps to avoid expensive alignment operation. The has table is small enough to reside in main memory, hence avoiding I/O in the search steps. We show theoretically and empirically that the proposed approach can efficiently detect biological sequences that are similar to a query sequence with very high sensitivity.
- S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. A basic local alignment search tool. In Journal of Molecular Biology, 1990.Google ScholarCross Ref
- S. Burkhardt, A. Crauser, P. Ferragina, H. P. Lenhof, and M. Vingron. q-gram based database searching using a suffix array (quasar). In Int. Conf. RECOMB, Lyon, April 1999. Google ScholarDigital Library
- C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In Proc. 1994 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'94), pages 419--429, Minneapolis, Minnesota, May 1994. Google ScholarDigital Library
- E. Giladi, M. Walker, J. Wang, and W. Volkmuth. Sst: An algorithm for searching sequence databases in time proportional to the logarithm of the database size. In Int. Conf. RECOMB, Japan, 2000.Google Scholar
- D. Gusfield. Algorithms on Strings, Trees and Sequences, Computer Science and Computation Biology. Cambridge University Press, New York, 1997. Google ScholarDigital Library
- E. Hunt, M. P. Atkinson, and R. W. Irving. A database index to large biological sequences. In International Journal on VLDB, pages 139--148, Roma, Italy, September 2001. Google ScholarDigital Library
- T. Kahveci and A. Singh. An efficient index structure for string databases. In Int. Conf. VLDB, Roma, Italy, 2001. Google ScholarDigital Library
- B. Ma, J. Tromp, and M. Li. Patternhunter: faster and more sensitive homology search. Bioinformatics, 18:440--445, 2002.Google ScholarCross Ref
- U. Manber and G. Myers. Suffix arrays: a new method for on-line string search. SIAM Journal on Computing, 22:935--948, 1993. Google ScholarDigital Library
- C. Meek, J. M. Patel, and S. Kasetty. Oasis: An online and accurate technique for local-alignment searches on biological sequences. In Proc. 2003 Int. Conf. Very Large Data Bases (VLDB'03), pages 910--921, Berlin, Germany, Sept. 2003. Google ScholarDigital Library
- S. Muthukrishnan and S. C. Sahinalp. Approximate nearest neighbors and sequence comparison with block operation. In STOC, Portland, Or, 2000. Google ScholarDigital Library
- O. Ozturk and H. Ferhatosmanoglu. Effective indexing and filtering for similarity search in large biosequence datasbases. In Third IEEE Symposium on BioInformatics and BioEngineering (BIBE'03), Bethesda, Maryland, 2003. Google ScholarDigital Library
- W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proceedings Natl. Acad. Sci. USA, 85:2444--2448, 1988.Google ScholarCross Ref
- T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Molecular Biology, 147:195--197, 1981.Google ScholarCross Ref
- Z. Tan, X. Cao, B. C. Ooi, and A. Tung. The ed-tree: an index for large dna sequence databases. In In Proc. 15th Int. Conf. on Scientific and Statistical Database Management, pages 151--160, 2003. Google ScholarDigital Library
- P. Weiner. Linear pattern matching algorithms. In Proc. 14th IEEE Symp. On Switching and Automata Theory, pages 1--11, 1973.Google ScholarDigital Library
- H.E. Williams and J. Zobel. Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering, 14:63--78, 2002. Google ScholarDigital Library
Recommendations
The Floor Is Lava - Halving Genomes with Viaducts, Piers and Pontoons
Comparative GenomicsAbstractThe Double Cut and Join (DCJ) model is a simple and powerful model for the analysis of large structural rearrangements. After being extended to the DCJ-indel model, capable of handling gains and losses of genetic material, research has shifted in ...
sRNA associated genomic islands in Salmonella spp.
ISB '10: Proceedings of the International Symposium on BiocomputingGenomic Islands are parts of a genome that has evidence of horizontal origins. The present work is a continuation of our earlier work that identified 25 regions downstream of the small RNAs as hotspots of genomic island integration by analyzing three ...
Comments