article

Piers: an efficient model for similarity search in DNA sequence databases

Authors:
Xia Cao

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

,
Shuai Cheng Li

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

,
Beng Chin Ooi

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

,
Anthony K. H. Tung

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 33 Issue 2June 2004pp 39–44https://doi.org/10.1145/1024694.1024701

Published:01 June 2004Publication History

ACM SIGMOD Record

Abstract

Growing interest in genomic research has resulted in the creation of huge biological sequence databases. In this paper, we present a hash-based pier model for efficient homology search in large DNA sequence databases. In our model, only certain segments in the databases called 'piers' need to be accessed during searches as opposite to other approaches which require a full scan on the biological sequence database. To further improve the search efficiency, the piers are stored in a specially designed hash table which helps to avoid expensive alignment operation. The has table is small enough to reside in main memory, hence avoiding I/O in the search steps. We show theoretically and empirically that the proposed approach can efficiently detect biological sequences that are similar to a query sequence with very high sensitivity.

References

S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. A basic local alignment search tool. In Journal of Molecular Biology, 1990.Google ScholarCross Ref
S. Burkhardt, A. Crauser, P. Ferragina, H. P. Lenhof, and M. Vingron. q-gram based database searching using a suffix array (quasar). In Int. Conf. RECOMB, Lyon, April 1999. Google ScholarDigital Library
C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In Proc. 1994 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'94), pages 419--429, Minneapolis, Minnesota, May 1994. Google ScholarDigital Library
E. Giladi, M. Walker, J. Wang, and W. Volkmuth. Sst: An algorithm for searching sequence databases in time proportional to the logarithm of the database size. In Int. Conf. RECOMB, Japan, 2000.Google Scholar
D. Gusfield. Algorithms on Strings, Trees and Sequences, Computer Science and Computation Biology. Cambridge University Press, New York, 1997. Google ScholarDigital Library
E. Hunt, M. P. Atkinson, and R. W. Irving. A database index to large biological sequences. In International Journal on VLDB, pages 139--148, Roma, Italy, September 2001. Google ScholarDigital Library
T. Kahveci and A. Singh. An efficient index structure for string databases. In Int. Conf. VLDB, Roma, Italy, 2001. Google ScholarDigital Library
B. Ma, J. Tromp, and M. Li. Patternhunter: faster and more sensitive homology search. Bioinformatics, 18:440--445, 2002.Google ScholarCross Ref
U. Manber and G. Myers. Suffix arrays: a new method for on-line string search. SIAM Journal on Computing, 22:935--948, 1993. Google ScholarDigital Library
C. Meek, J. M. Patel, and S. Kasetty. Oasis: An online and accurate technique for local-alignment searches on biological sequences. In Proc. 2003 Int. Conf. Very Large Data Bases (VLDB'03), pages 910--921, Berlin, Germany, Sept. 2003. Google ScholarDigital Library
S. Muthukrishnan and S. C. Sahinalp. Approximate nearest neighbors and sequence comparison with block operation. In STOC, Portland, Or, 2000. Google ScholarDigital Library
O. Ozturk and H. Ferhatosmanoglu. Effective indexing and filtering for similarity search in large biosequence datasbases. In Third IEEE Symposium on BioInformatics and BioEngineering (BIBE'03), Bethesda, Maryland, 2003. Google ScholarDigital Library
W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proceedings Natl. Acad. Sci. USA, 85:2444--2448, 1988.Google ScholarCross Ref
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Molecular Biology, 147:195--197, 1981.Google ScholarCross Ref
Z. Tan, X. Cao, B. C. Ooi, and A. Tung. The ed-tree: an index for large dna sequence databases. In In Proc. 15th Int. Conf. on Scientific and Statistical Database Management, pages 151--160, 2003. Google ScholarDigital Library
P. Weiner. Linear pattern matching algorithms. In Proc. 14th IEEE Symp. On Switching and Automata Theory, pages 1--11, 1973.Google ScholarDigital Library
H.E. Williams and J. Zobel. Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering, 14:63--78, 2002. Google ScholarDigital Library

Recommendations

The Floor Is Lava - Halving Genomes with Viaducts, Piers and Pontoons
Comparative Genomics
Abstract
The Double Cut and Join (DCJ) model is a simple and powerful model for the analysis of large structural rearrangements. After being extended to the DCJ-indel model, capable of handling gains and losses of genetic material, research has shifted in ...
Read More
Identification of human-specific transcript variants induced by DNA insertions in the human genome

Motivation: Many genes in the human genome produce a wide variety of transcript variants resulting from alternative exon splicing, differential promoter usage, or altered polyadenylation site utilization that may function differently in human cells. ...
Read More
sRNA associated genomic islands in Salmonella spp.
ISB '10: Proceedings of the International Symposium on Biocomputing

Genomic Islands are parts of a genome that has evidence of horizontal origins. The present work is a continuation of our earlier work that identified 25 regions downstream of the small RNAs as hotspots of genomic island integration by analyzing three ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGMOD Record Volume 33, Issue 2
June 2004
126 pages
ISSN:0163-5808
DOI:10.1145/1024694
Issue’s Table of Contents

Copyright © 2004 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2004
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 390
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Piers: an efficient model for similarity search in DNA sequence databases

ACM SIGMOD Record

Abstract

References

Cited By

Recommendations

The Floor Is Lava - Halving Genomes with Viaducts, Piers and Pontoons

Identification of human-specific transcript variants induced by DNA insertions in the human genome

sRNA associated genomic islands in Salmonella spp.

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Piers: an efficient model for similarity search in DNA sequence databases

ACM SIGMOD Record

Abstract

References

Cited By

Recommendations

The Floor Is Lava - Halving Genomes with Viaducts, Piers and Pontoons

Identification of human-specific transcript variants induced by DNA insertions in the human genome

sRNA associated genomic islands in Salmonella spp.

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media