Skip to main content
Log in

A hash trie filter method for approximate string matching in genomic databases

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In genomic databases, approximate string matching with k errors is often applied when searching genomic sequences, where k errors can be caused by substitution, insertion, or deletion operations. In this paper, we propose a new method, the hash trie filter, to efficiently support approximate string matching in genomic databases. First, we build a hash trie for indexing the genomic sequence stored in a database in advance. Then, we utilize an efficient technique to find the ordered subpatterns in the sequence, which could reduce the number of candidates by pruning some unreasonable matching positions. Moreover, our method will dynamically decide the number of ordered matching grams, resulting in the increase of precision. The simulation results show that the hash trie filter outperforms the well-known (k+s) q-samples filter in terms of the response time, the number of verified candidates, and the precision, under different lengths of the query patterns and different error levels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410

    Google Scholar 

  2. Baeza-Yates R, Gonnet G (1992) A new approach to text searching. Commun ACM 35(10):74–82

    Article  Google Scholar 

  3. Chang W, Lawler E (1994) Sublinear approximate string matching and biological applications. Algorithmica 12(4):327–344

    Article  MATH  MathSciNet  Google Scholar 

  4. Chang W, Marr T (1994) Approximate string matching and local similarity. In: 5th annual symposium on combinatorial pattern matching, pp 259–273

  5. Dobrišek S, Žibert J, Pavešić N, Mihelič F (2009) An edit-distance model for the approximate matching of timed strings. IEEE Trans Pattern Anal Mach Intell 31(4):736–741

    Article  Google Scholar 

  6. Farach-Colton M, Landau GM, Sahinalp SC, Tsur D (2007) Identification of common molecular subsequences. J Comput Syst Sci 73(7):1035–1044

    Article  MATH  MathSciNet  Google Scholar 

  7. Friedberg EC, Walker GC, Siede W (1995) DNA repair and mutagenesis. American Society Microbiology, America

  8. Houle JL, Cadigan W, Henry S, Pinnamaneni A, Lundahl S (2000) Database Mining in the Human Genome Initiative. Available at: http://www.biodatabases.com/whitepaper01.html. Accessed 2 Sept. 2009

  9. Hunt E, Atkinson MP, Irving RW (2001) A database index to large biological sequences. In: 27th conference on very large databases, pp 139–148

  10. Hunt E, Atkinson MP, Irving RW (2002) Database indexing for large DNA and protein sequence collections. VLDB J 10(1):256–271

    Google Scholar 

  11. Hyyro H, Pinzon Y, Shinohara A (2005) Fast bit-vector algorithms for approximate string matching under indel distance. In: 31st annual conference on current trends in theory and practice of informatics, pp 380–384

  12. Karkkainen J, Na JC (2007) Faster filters for approximate string matching. In: Workshop on algorithm engineering and experiments, pp 1–7

  13. Lee HP, Tsai YT, CY Tang (2004) A seriate coverage filtration approach for homology search. In: ACM symposium on applied computing, pp 180–184

  14. Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227(4693):1435–1441

    Article  Google Scholar 

  15. Ma B, Tromp J, Li M (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics 18(3):440–445

    Article  Google Scholar 

  16. Mazeika A, Böhlen MH, Koudas N, Srivastava D (2007) Estimating the selectivity of approximate string queries. ACM Trans Database Syst 32(2):1–40

    Article  Google Scholar 

  17. Myers G (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM 46(3):395–415

    Article  MATH  MathSciNet  Google Scholar 

  18. Navarro G (1997) Multiple approximate string matching by counting. In: 4th south American workshop on string processing, pp 95–111

  19. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88

    Article  Google Scholar 

  20. Navarro G, Sutinen E, Tanninen J, Tarhio J (2000) Indexing text with approximate q-grams. In: 11th annual symposium on combinatorial pattern matching, pp 350–363

  21. Smith TF, Waterman MS (1995) Identification of common molecular subsequences. J Mol Biol 147(1):195–197

    Article  Google Scholar 

  22. Sutinen E, Tarhio J (1995) On using q-gram locations in approximate string matching. In: 3th annual European symposium on algorithms, pp 327–340

  23. Sutinen E, Tarhio J (1996) Filtration with q-samples in approximate string matching. In: 7th annual symposium on combinatorial pattern matching, pp 50–63

  24. Sutinen E, Tarhio J (2004) Approximate string matching with ordered q-grams. Nord J Comput 11(4):321–343

    MATH  MathSciNet  Google Scholar 

  25. Takaoka T (1994) Approximate pattern matching with samples. In: 5th international symposium on algorithms and computation, pp 234–242

  26. Ukkonen E (1985) Finding approximate patterns in strings. J Algorithms 6(1):132–137

    Article  MATH  MathSciNet  Google Scholar 

  27. Ukkonen E (1992) Approximate string matching with q-grams and maximal matches. Theor Comput Sci 92(1):191–211

    Article  MATH  MathSciNet  Google Scholar 

  28. Williams HE, Zobel J (2002) Indexing and retrieval for genomic databases. IEEE Trans Knowl Data Eng 14(1):63–78

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiun-Rung Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chang, YI., Chen, JR. & Hsu, MT. A hash trie filter method for approximate string matching in genomic databases. Appl Intell 33, 21–38 (2010). https://doi.org/10.1007/s10489-010-0233-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-010-0233-4

Keywords

Navigation