A hash trie filter method for approximate string matching in genomic databases

Chang, Ye-In; Chen, Jiun-Rung; Hsu, Min-Tze

doi:10.1007/s10489-010-0233-4

A hash trie filter method for approximate string matching in genomic databases

Published: 19 May 2010

Volume 33, pages 21–38, (2010)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Ye-In Chang¹,
Jiun-Rung Chen¹ &
Min-Tze Hsu¹

156 Accesses
1 Citation
Explore all metrics

Abstract

In genomic databases, approximate string matching with k errors is often applied when searching genomic sequences, where k errors can be caused by substitution, insertion, or deletion operations. In this paper, we propose a new method, the hash trie filter, to efficiently support approximate string matching in genomic databases. First, we build a hash trie for indexing the genomic sequence stored in a database in advance. Then, we utilize an efficient technique to find the ordered subpatterns in the sequence, which could reduce the number of candidates by pruning some unreasonable matching positions. Moreover, our method will dynamically decide the number of ordered matching grams, resulting in the increase of precision. The simulation results show that the hash trie filter outperforms the well-known (k+s) q-samples filter in terms of the response time, the number of verified candidates, and the precision, under different lengths of the query patterns and different error levels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Google Scholar
Baeza-Yates R, Gonnet G (1992) A new approach to text searching. Commun ACM 35(10):74–82
Article Google Scholar
Chang W, Lawler E (1994) Sublinear approximate string matching and biological applications. Algorithmica 12(4):327–344
Article MATH MathSciNet Google Scholar
Chang W, Marr T (1994) Approximate string matching and local similarity. In: 5th annual symposium on combinatorial pattern matching, pp 259–273
Dobrišek S, Žibert J, Pavešić N, Mihelič F (2009) An edit-distance model for the approximate matching of timed strings. IEEE Trans Pattern Anal Mach Intell 31(4):736–741
Article Google Scholar
Farach-Colton M, Landau GM, Sahinalp SC, Tsur D (2007) Identification of common molecular subsequences. J Comput Syst Sci 73(7):1035–1044
Article MATH MathSciNet Google Scholar
Friedberg EC, Walker GC, Siede W (1995) DNA repair and mutagenesis. American Society Microbiology, America
Houle JL, Cadigan W, Henry S, Pinnamaneni A, Lundahl S (2000) Database Mining in the Human Genome Initiative. Available at: http://www.biodatabases.com/whitepaper01.html. Accessed 2 Sept. 2009
Hunt E, Atkinson MP, Irving RW (2001) A database index to large biological sequences. In: 27th conference on very large databases, pp 139–148
Hunt E, Atkinson MP, Irving RW (2002) Database indexing for large DNA and protein sequence collections. VLDB J 10(1):256–271
Google Scholar
Hyyro H, Pinzon Y, Shinohara A (2005) Fast bit-vector algorithms for approximate string matching under indel distance. In: 31st annual conference on current trends in theory and practice of informatics, pp 380–384
Karkkainen J, Na JC (2007) Faster filters for approximate string matching. In: Workshop on algorithm engineering and experiments, pp 1–7
Lee HP, Tsai YT, CY Tang (2004) A seriate coverage filtration approach for homology search. In: ACM symposium on applied computing, pp 180–184
Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227(4693):1435–1441
Article Google Scholar
Ma B, Tromp J, Li M (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics 18(3):440–445
Article Google Scholar
Mazeika A, Böhlen MH, Koudas N, Srivastava D (2007) Estimating the selectivity of approximate string queries. ACM Trans Database Syst 32(2):1–40
Article Google Scholar
Myers G (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM 46(3):395–415
Article MATH MathSciNet Google Scholar
Navarro G (1997) Multiple approximate string matching by counting. In: 4th south American workshop on string processing, pp 95–111
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Article Google Scholar
Navarro G, Sutinen E, Tanninen J, Tarhio J (2000) Indexing text with approximate q-grams. In: 11th annual symposium on combinatorial pattern matching, pp 350–363
Smith TF, Waterman MS (1995) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Article Google Scholar
Sutinen E, Tarhio J (1995) On using q-gram locations in approximate string matching. In: 3th annual European symposium on algorithms, pp 327–340
Sutinen E, Tarhio J (1996) Filtration with q-samples in approximate string matching. In: 7th annual symposium on combinatorial pattern matching, pp 50–63
Sutinen E, Tarhio J (2004) Approximate string matching with ordered q-grams. Nord J Comput 11(4):321–343
MATH MathSciNet Google Scholar
Takaoka T (1994) Approximate pattern matching with samples. In: 5th international symposium on algorithms and computation, pp 234–242
Ukkonen E (1985) Finding approximate patterns in strings. J Algorithms 6(1):132–137
Article MATH MathSciNet Google Scholar
Ukkonen E (1992) Approximate string matching with q-grams and maximal matches. Theor Comput Sci 92(1):191–211
Article MATH MathSciNet Google Scholar
Williams HE, Zobel J (2002) Indexing and retrieval for genomic databases. IEEE Trans Knowl Data Eng 14(1):63–78
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science and Engineering, National Sun Yat-Sen University, No. 70, Lienhai Rd., Kaohsiung, 80424, Taiwan
Ye-In Chang, Jiun-Rung Chen & Min-Tze Hsu

Authors

Ye-In Chang
View author publications
You can also search for this author in PubMed Google Scholar
Jiun-Rung Chen
View author publications
You can also search for this author in PubMed Google Scholar
Min-Tze Hsu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiun-Rung Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chang, YI., Chen, JR. & Hsu, MT. A hash trie filter method for approximate string matching in genomic databases. Appl Intell 33, 21–38 (2010). https://doi.org/10.1007/s10489-010-0233-4

Download citation

Published: 19 May 2010
Issue Date: August 2010
DOI: https://doi.org/10.1007/s10489-010-0233-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hash trie filter method for approximate string matching in genomic databases

Abstract

Access this article

Similar content being viewed by others

Efficient Approximate Subsequence Matching Using Hybrid Signatures

ACRES: efficient query answering on large compressed sequences

A unified framework for string similarity search with edit-distance constraint

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A hash trie filter method for approximate string matching in genomic databases

Abstract

Access this article

Similar content being viewed by others

Efficient Approximate Subsequence Matching Using Hybrid Signatures

ACRES: efficient query answering on large compressed sequences

A unified framework for string similarity search with edit-distance constraint

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation