Skip to main content
Log in

An efficient pruning strategy for approximate string matching over suffix tree

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Approximate string matching over suffix tree with depth-first search (ASM_ST_DFS), a classical algorithm in the field of approximate string matching, was originally proposed by Ricardo A. Baeza-Yates and Gaston H. Gonnet in 1990. The algorithm is one of the most excellent algorithms for approximate string matching if combined with other indexing techniques. However, its time complexity is sensitive to the length of pattern string because it searches \(m+k\) characters on each path from the root before backtracking. In this paper, we propose an efficient pruning strategy to solve this problem. We prove its correctness and efficiency in theory. Particularly, we proved that if the pruning strategy is adopted, it averagely searches O(k) characters on each path before backtracking instead of O(m). Considering each internal node of suffix tree has multiple branches, the pruning strategy should work very well. We also experimentally show that when k is much smaller than m, the efficiency improves hundreds of times, and when k is not much smaller than m, it is still several times faster. This is the first paper that tries to solve the backtracking problem of ASM_ST_DFS in both theory and practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Baeza-Yates RA, Gonnet GH (1999) A fast algorithm on average for all-against-all sequence matching. In: SPIRE/CRIWG. pp 16–23

  2. Baeza-Yates RA, Navarro G (1999) Faster approximate string matching. Algorithmica 23(2):127–158

    Article  MathSciNet  MATH  Google Scholar 

  3. Boytsov L (2011) Indexing methods for approximate dictionary searching: comparative analysis. ACM J Exp Algorithmics 16:1.1:1.1–1.1:1.91. doi:10.1145/1963190.1963191

  4. Bunke H, Bühler U (1993) Applications of approximate string matching to 2d shape recognition. Pattern Recognit 26(12):1797–1812

    Article  Google Scholar 

  5. Chang WI, Lampe J (1992) Theoretical and empirical comparisons of approximate string matching algorithms. In: Combinatorial pattern matching, third annual symposium, CPM 92, Tucson, Arizona, USA, April 29–May 1 1992, Proceedings, pp 175–184

  6. Danek A, Deorowicz S, Grabowski S (2014) Indexing large genome collections on a PC. In: CoRR. arXiv:1403.7481

  7. French JC, Powell AL, Schulman E (1997) Applications of approximate word matching in information retrieval. In: Proceedings of the sixth international conference on information and knowledge management (CIKM’97), Las Vegas, Nevada, November 10–14 1997, pp 9–15

  8. Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 29(2):147–160

    Article  MathSciNet  Google Scholar 

  9. Hyyrö H, Fredriksson K, Navarro G (2005) Increased bit-parallelism for approximate and multiple string matching. J Exp Algorithmics 10:2.6. doi:10.1145/1064546.1180617

  10. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10:707–710

    MathSciNet  MATH  Google Scholar 

  11. Mitra S, Acharya T, Luo J (2006) Data mining: multimedia, soft computing, and bioinformatics. J. Electron Imaging 15(1):019901

    Article  Google Scholar 

  12. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88

    Article  Google Scholar 

  13. Navarro G, Baeza-Yates R (2000) A hybrid indexing method for approximate string matching. J Discrete Algorithms 1(1):205–239

    MathSciNet  Google Scholar 

  14. Navarro G, Raffinot M (2002) Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  15. Sellers PH (1980) The theory and computation of evolutionary distances: pattern recognition. J Algorithms 1(4):359–373

    Article  MathSciNet  MATH  Google Scholar 

  16. Typke R, Wiering F, Veltkamp RC (2005) A survey of music information retrieval systems. In: ISMIR 2005, 6th international conference on music information retrieval, London, UK, 11–15 Sept 2005, Proceedings, pp 153–160

  17. Ukkonen E (1985) Finding approximate patterns in strings. J Algorithms 6(1):132–137

    Article  MathSciNet  MATH  Google Scholar 

  18. Ukkonen E (1993) Approximate string-matching over suffix trees. In: Combinatorial pattern matching, 4th annual symposium, CPM 93, Padova, Italy, 2–4 June 1993, Proceedings, pp 228–242

  19. Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–260

    Article  MathSciNet  MATH  Google Scholar 

  20. Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. PVLDB 6(13):1534–1545

    Google Scholar 

  21. Wu S, Manber U (1992) Fast text searching allowing errors. Commun ACM 35(10):83–91

    Article  Google Scholar 

  22. Yang X, Wang B, Li C, Wang J, Xie X (2013) Efficient direct search on compressed genomic data. In: 29th IEEE international conference on data engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp 961–972

  23. Zhang Q, Chamberlain RD, Indeck RS, West BM, White J (2004) Massively parallel data mining using reconfigurable hardware: approximate string matching. In: 18th International parallel and distributed processing symposium (IPDPS 2004) CD-ROM/Abstracts proceedings, 26–30 April 2004. Santa Fe, New Mexico, USA

  24. Zobel J, Dart PW (1996) Phonetic string matching: lessons from information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR’96, 18–22 August 1996, Zurich, Switzerland (Special Issue of the SIGIR Forum), pp 166–172

Download references

Acknowledgments

This paper was partially supported by NGFR 973 Grant 2012CB316200, NSFC Grant 61472099, 61133002 and National Sci-Tech Support Plan 2015BAH10F00. Partial Experiments are conducted on the Xing cloud of the DEKE Lab, Renmin University of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongzhi Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, H., Wang, H., Li, J. et al. An efficient pruning strategy for approximate string matching over suffix tree. Knowl Inf Syst 49, 121–141 (2016). https://doi.org/10.1007/s10115-015-0896-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-015-0896-6

Keywords

Navigation