Skip to main content
Log in

Multi-pattern matching with variable-length wildcards using suffix tree

  • Industrial and Commercial Application
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Multi-pattern matching with variable-length wildcards is an interesting and important problem in bioinformatics, information retrieval and other domains. Most of the previously developed multi-pattern matching methods, such as famous Aho–Corasick and Wu–Manber algorithms, aimed to solve some classical string matching problems. However, these algorithms are not efficient for patterns with flexible wildcards or do-not-care characters. In this paper, we propose two efficient algorithms for multi-pattern matching with variable-length wildcards based on suffix tree, called MMST-L and MMST-S, according to the length of exact characters in a pattern. Experimental results show that the two MMST algorithms, in most cases, outperform other various versions of comparing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Gonzalo N, Mathieu R (2007) Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Publishing House of Electronics Industry, Beijing

    MATH  Google Scholar 

  2. Aho AV, Corasick MJ (1975) Efficient string matching: an aid to bibliographic search. Commun ACM 18(6):333–340

    Article  MathSciNet  Google Scholar 

  3. Baeza-Yates R, Gonnet GH (1992) A new approach to text searching. Commun ACM 35(10):74–82

    Article  Google Scholar 

  4. Commentz-Walter B (1979) A string matching algorithm fast on the average. Automata, languages and programming, pp 118–132

  5. Wu S, Manber U (1994) A fast algorithm for multi-pattern searching. Department of Computer Science, University of Arizona, Tucson

    Google Scholar 

  6. Raffinot M (1997) On the multi backward dawg matching algorithm (MultiBDM). In: Proceedings of the 4th South American workshop on string processing. Carleton University Press, pp 149–165

  7. Allauzen C, Raffinot M (1999) Factor oracle of a set of words. Technical report 99-11

  8. Rahman MS, Iliopoulos CS, Lee I et al (2006) Finding patterns with variable length gaps and don’t cares. In: Proceedings of the 12th annual international computing and combinatorics conference, vol 8, pp 146–155

  9. Akutsu T (1996) Approximate string matching with variable length don’t care characters. IEICE Trans Inf Syst 79(9):1353–1354

    Google Scholar 

  10. Fischer MJ, Paterson MS (1974) String-matching and other products. In: Proceeding of the 7th SIAM AMS complexity of computation, Cambridge, USA, pp 113–125

  11. Min F, Wu XD, Lu ZY (2009) Pattern matching with independent wildcard gaps. In: Proceedings of the 8th IEEE international conference on dependable, autonomic and secure computing. Chengdu, China, IEEE, pp 194–199

  12. Guo D, Hong XL, HuX G, Gao J, Liu YL, Wu GQ, Wu XD (2011) A bit-parallel algorithm for sequential pattern matching with wildcards. Cybernet Syst 42(6):382–401

    Article  Google Scholar 

  13. Bille P, Gørtz IL, Vildhøj HW, Wind DK (2012) String matching with variable length gaps. Theoret Comput Sci 443(1):25–34

    Article  MathSciNet  Google Scholar 

  14. Inenaga S, Hoshino H, Shinohara A, Takeda M, Arikawa S, Mauri G, Pavesi G (2001) On-line construction of compact directed acylic word graphs. In: Proceedings of the 12th annual symposium on combinatorial pattern matching, pp 169–180

    Chapter  Google Scholar 

  15. Zhang M, Zhang Y, Hu L (2010) A faster algorithm for matching a set of patterns with variable length don’t cares. Inf Process Lett 110(6):216–220

    Article  MathSciNet  Google Scholar 

  16. Zhang H, Chow TW, Wu QM (2016) Organizing books and authors by multilayer SOM. IEEE Trans Neural Netw Learn Syst 27(12):2537

    Article  Google Scholar 

  17. Weiner P (1973) Linear pattern matching algorithm. In: 14th annual IEEE symposium on switching and automata theory, pp 1–11

  18. Giegerich R, Kurtz S (1997) From Ukkonen to McCreight and Weiner: a unifying view of linear-time suffix tree construction. Algorithmica 19(3):331–353

    Article  MathSciNet  Google Scholar 

  19. Grossi R, Italiano GF (1993) Suffix trees and their applications in string algorithms. In: Proceedings of the 1st South American workshop on string processing, pp 57–76

  20. Zhou Z, Zhang T, Chow SSM, Zhang Y, Zhang K (2016) Efficient authenticated multi-pattern matching. In: Presented at the 11th ACM, ACM Press, New York, USA, pp 593–604. http://doi.org/10.1145/2897845.2897906

  21. Raffinot M (1997) On the multi backward Dawg matching algorithm (MultiBDM). In: Baeza-Yates R, (ed) Proceedings of the 4th South American workshop on string processing, Valparaíso, Chile. Carleton University Press, pp 149–165

  22. Crochemore M, Czumaj A, Gasieniec L, Lecroq T, Plandowski W, Rytter W (1999) Fast practical multi-pattern matching. Inf Process Lett 71(3/4):107–113

    Article  MathSciNet  Google Scholar 

  23. Muth R, Manber U (1996) Approximate multiple string search. In: Proceedings of the 7th annual symposium on combinatorial pattern matching, number 1075 in lecture notes in computer science, Springer, Berlin, pp 75–86

    Chapter  Google Scholar 

  24. Baeza-Yates RA, Navarro G (1997) Multiple approximate string matching. In: Proceedings of the 5th workshop on algorithms and data structures, number 1272 in lecture notes in computer science, Springer, Berlin, pp 174–184. Extended version to appear in Random Structures and Algorithms (Wiley)

    Google Scholar 

  25. Cole R, Hariharan R (2002) Verifying candidate matches in sparse and wildcard matching. In: Proceedings of the 34th annual ACM symposium on theory of computing, May 2002, pp 592–601

  26. Rahman MS, Iliopoulos CS, Lee I et al (2006) Finding patterns with variable length gaps or don’t cares. In: Proceedings of the 12th annual international computing and combinatorics conference, August 2006, pp 146–155

    Google Scholar 

  27. Haapasalo T, Silvasti P, Sippu S, Soisalon-Soininen E (2011) Online dictionary matching with variable-length gaps. In: Proceedings of the 10th international symposium, SEA Kolimpari, Chania, Crete, Greece. Springer, Berlin, pp 76–87

    Chapter  Google Scholar 

  28. Kucherov G, Rusinowitch M (1997) Matching a set of strings with variable length don’t cares. Theoret Comput Sci 178(1–2):129–154

    Article  MathSciNet  Google Scholar 

  29. Kulekci MO (2007) TARA: an algorithm for fast searching of multiple patterns on text files. In: 22nd international symposium on computer and information sciences, pp 136–141

  30. Zhang M, Zhang Y, Tang J, Bai X (2011) Multi-pattern matching with wildcards. J Softw 6(12):2391–2398

    Google Scholar 

  31. McCreight EM (1976) A space-economical suffix tree construction algorithm. J ACM 23(2):262–272

    Article  MathSciNet  Google Scholar 

  32. Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–260

    Article  MathSciNet  Google Scholar 

  33. Gusfield D (1997) Algorithms on strings, trees, and sequences Computer Science and Computational Biology. Cambrigde University Press, Cambridge

    Book  Google Scholar 

  34. Chattaraj A, Parida L (2005) An inexact-suffix-tree-based algorithm for detecting extensible patterns. Theoret Comput Sci 335(1):3–14

    Article  MathSciNet  Google Scholar 

  35. Ukkonen E (2009) Maximal and minimal representations of gapped and non-gapped motifs of a string. Theoret Comput Sci 410(43):4341–4349

    Article  MathSciNet  Google Scholar 

  36. BilleP Gørtz IL et al (2014) String indexing for patterns with wildcards. Theory of Computing Systems 55(1):41–60

    Article  MathSciNet  Google Scholar 

  37. Thankachan SV, Apostolico A, Aluru S (2016) A provably efficient algorithm for the k-mismatch average common substring problem. J Comput Biol 23(6):472–482

    Article  MathSciNet  Google Scholar 

  38. Salmela L, Tarhio J, Kytöjoki J (2007) Multi-pattern string matching with q-grams. J Exp Algorithm 11(1):1–19

    MATH  Google Scholar 

  39. Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theoret Comput Sci 92(1):191–211

    Article  MathSciNet  Google Scholar 

  40. Arın İnanç, Erpam MK, Saygın Y (2018) I-TWEC: interactive clustering tool for Twitter. Expert Syst Appl 96:1–13

    Article  Google Scholar 

  41. NCBI: http://www.ncbi.nlm.nih.gov/

Download references

Acknowledgements

This research is supported by the National Key Research and Development Program of China (Grant No. 2016YFB1000900) and National Natural Science Foundation of China (NSFC) (Grant Nos. 61503116 and 61229301).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Na Liu.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 26 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, N., Xie, F. & Wu, X. Multi-pattern matching with variable-length wildcards using suffix tree. Pattern Anal Applic 21, 1151–1165 (2018). https://doi.org/10.1007/s10044-018-0733-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-018-0733-0

Keywords

Navigation