Abstract
In genomic databases, approximate string matching with k errors is often applied when searching genomic sequences, where k errors can be caused by substitution, insertion, or deletion operations. In this paper, we propose a new method, the hash trie filter, to efficiently support approximate string matching in genomic databases. First, we build a hash trie for indexing the genomic sequence stored in a database in advance. Then, we utilize an efficient technique to find the ordered subpatterns in the sequence, which could reduce the number of candidates by pruning some unreasonable matching positions. Moreover, our method will dynamically decide the number of ordered matching grams, resulting in the increase of precision. The simulation results show that the hash trie filter outperforms the well-known (k+s) q-samples filter in terms of the response time, the number of verified candidates, and the precision, under different lengths of the query patterns and different error levels.
Similar content being viewed by others
References
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Baeza-Yates R, Gonnet G (1992) A new approach to text searching. Commun ACM 35(10):74–82
Chang W, Lawler E (1994) Sublinear approximate string matching and biological applications. Algorithmica 12(4):327–344
Chang W, Marr T (1994) Approximate string matching and local similarity. In: 5th annual symposium on combinatorial pattern matching, pp 259–273
Dobrišek S, Žibert J, Pavešić N, Mihelič F (2009) An edit-distance model for the approximate matching of timed strings. IEEE Trans Pattern Anal Mach Intell 31(4):736–741
Farach-Colton M, Landau GM, Sahinalp SC, Tsur D (2007) Identification of common molecular subsequences. J Comput Syst Sci 73(7):1035–1044
Friedberg EC, Walker GC, Siede W (1995) DNA repair and mutagenesis. American Society Microbiology, America
Houle JL, Cadigan W, Henry S, Pinnamaneni A, Lundahl S (2000) Database Mining in the Human Genome Initiative. Available at: http://www.biodatabases.com/whitepaper01.html. Accessed 2 Sept. 2009
Hunt E, Atkinson MP, Irving RW (2001) A database index to large biological sequences. In: 27th conference on very large databases, pp 139–148
Hunt E, Atkinson MP, Irving RW (2002) Database indexing for large DNA and protein sequence collections. VLDB J 10(1):256–271
Hyyro H, Pinzon Y, Shinohara A (2005) Fast bit-vector algorithms for approximate string matching under indel distance. In: 31st annual conference on current trends in theory and practice of informatics, pp 380–384
Karkkainen J, Na JC (2007) Faster filters for approximate string matching. In: Workshop on algorithm engineering and experiments, pp 1–7
Lee HP, Tsai YT, CY Tang (2004) A seriate coverage filtration approach for homology search. In: ACM symposium on applied computing, pp 180–184
Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227(4693):1435–1441
Ma B, Tromp J, Li M (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics 18(3):440–445
Mazeika A, Böhlen MH, Koudas N, Srivastava D (2007) Estimating the selectivity of approximate string queries. ACM Trans Database Syst 32(2):1–40
Myers G (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM 46(3):395–415
Navarro G (1997) Multiple approximate string matching by counting. In: 4th south American workshop on string processing, pp 95–111
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Navarro G, Sutinen E, Tanninen J, Tarhio J (2000) Indexing text with approximate q-grams. In: 11th annual symposium on combinatorial pattern matching, pp 350–363
Smith TF, Waterman MS (1995) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Sutinen E, Tarhio J (1995) On using q-gram locations in approximate string matching. In: 3th annual European symposium on algorithms, pp 327–340
Sutinen E, Tarhio J (1996) Filtration with q-samples in approximate string matching. In: 7th annual symposium on combinatorial pattern matching, pp 50–63
Sutinen E, Tarhio J (2004) Approximate string matching with ordered q-grams. Nord J Comput 11(4):321–343
Takaoka T (1994) Approximate pattern matching with samples. In: 5th international symposium on algorithms and computation, pp 234–242
Ukkonen E (1985) Finding approximate patterns in strings. J Algorithms 6(1):132–137
Ukkonen E (1992) Approximate string matching with q-grams and maximal matches. Theor Comput Sci 92(1):191–211
Williams HE, Zobel J (2002) Indexing and retrieval for genomic databases. IEEE Trans Knowl Data Eng 14(1):63–78
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chang, YI., Chen, JR. & Hsu, MT. A hash trie filter method for approximate string matching in genomic databases. Appl Intell 33, 21–38 (2010). https://doi.org/10.1007/s10489-010-0233-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-010-0233-4