Abstract
Multi-pattern matching with variable-length wildcards is an interesting and important problem in bioinformatics, information retrieval and other domains. Most of the previously developed multi-pattern matching methods, such as famous Aho–Corasick and Wu–Manber algorithms, aimed to solve some classical string matching problems. However, these algorithms are not efficient for patterns with flexible wildcards or do-not-care characters. In this paper, we propose two efficient algorithms for multi-pattern matching with variable-length wildcards based on suffix tree, called MMST-L and MMST-S, according to the length of exact characters in a pattern. Experimental results show that the two MMST algorithms, in most cases, outperform other various versions of comparing algorithms.
Similar content being viewed by others
References
Gonzalo N, Mathieu R (2007) Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Publishing House of Electronics Industry, Beijing
Aho AV, Corasick MJ (1975) Efficient string matching: an aid to bibliographic search. Commun ACM 18(6):333–340
Baeza-Yates R, Gonnet GH (1992) A new approach to text searching. Commun ACM 35(10):74–82
Commentz-Walter B (1979) A string matching algorithm fast on the average. Automata, languages and programming, pp 118–132
Wu S, Manber U (1994) A fast algorithm for multi-pattern searching. Department of Computer Science, University of Arizona, Tucson
Raffinot M (1997) On the multi backward dawg matching algorithm (MultiBDM). In: Proceedings of the 4th South American workshop on string processing. Carleton University Press, pp 149–165
Allauzen C, Raffinot M (1999) Factor oracle of a set of words. Technical report 99-11
Rahman MS, Iliopoulos CS, Lee I et al (2006) Finding patterns with variable length gaps and don’t cares. In: Proceedings of the 12th annual international computing and combinatorics conference, vol 8, pp 146–155
Akutsu T (1996) Approximate string matching with variable length don’t care characters. IEICE Trans Inf Syst 79(9):1353–1354
Fischer MJ, Paterson MS (1974) String-matching and other products. In: Proceeding of the 7th SIAM AMS complexity of computation, Cambridge, USA, pp 113–125
Min F, Wu XD, Lu ZY (2009) Pattern matching with independent wildcard gaps. In: Proceedings of the 8th IEEE international conference on dependable, autonomic and secure computing. Chengdu, China, IEEE, pp 194–199
Guo D, Hong XL, HuX G, Gao J, Liu YL, Wu GQ, Wu XD (2011) A bit-parallel algorithm for sequential pattern matching with wildcards. Cybernet Syst 42(6):382–401
Bille P, Gørtz IL, Vildhøj HW, Wind DK (2012) String matching with variable length gaps. Theoret Comput Sci 443(1):25–34
Inenaga S, Hoshino H, Shinohara A, Takeda M, Arikawa S, Mauri G, Pavesi G (2001) On-line construction of compact directed acylic word graphs. In: Proceedings of the 12th annual symposium on combinatorial pattern matching, pp 169–180
Zhang M, Zhang Y, Hu L (2010) A faster algorithm for matching a set of patterns with variable length don’t cares. Inf Process Lett 110(6):216–220
Zhang H, Chow TW, Wu QM (2016) Organizing books and authors by multilayer SOM. IEEE Trans Neural Netw Learn Syst 27(12):2537
Weiner P (1973) Linear pattern matching algorithm. In: 14th annual IEEE symposium on switching and automata theory, pp 1–11
Giegerich R, Kurtz S (1997) From Ukkonen to McCreight and Weiner: a unifying view of linear-time suffix tree construction. Algorithmica 19(3):331–353
Grossi R, Italiano GF (1993) Suffix trees and their applications in string algorithms. In: Proceedings of the 1st South American workshop on string processing, pp 57–76
Zhou Z, Zhang T, Chow SSM, Zhang Y, Zhang K (2016) Efficient authenticated multi-pattern matching. In: Presented at the 11th ACM, ACM Press, New York, USA, pp 593–604. http://doi.org/10.1145/2897845.2897906
Raffinot M (1997) On the multi backward Dawg matching algorithm (MultiBDM). In: Baeza-Yates R, (ed) Proceedings of the 4th South American workshop on string processing, Valparaíso, Chile. Carleton University Press, pp 149–165
Crochemore M, Czumaj A, Gasieniec L, Lecroq T, Plandowski W, Rytter W (1999) Fast practical multi-pattern matching. Inf Process Lett 71(3/4):107–113
Muth R, Manber U (1996) Approximate multiple string search. In: Proceedings of the 7th annual symposium on combinatorial pattern matching, number 1075 in lecture notes in computer science, Springer, Berlin, pp 75–86
Baeza-Yates RA, Navarro G (1997) Multiple approximate string matching. In: Proceedings of the 5th workshop on algorithms and data structures, number 1272 in lecture notes in computer science, Springer, Berlin, pp 174–184. Extended version to appear in Random Structures and Algorithms (Wiley)
Cole R, Hariharan R (2002) Verifying candidate matches in sparse and wildcard matching. In: Proceedings of the 34th annual ACM symposium on theory of computing, May 2002, pp 592–601
Rahman MS, Iliopoulos CS, Lee I et al (2006) Finding patterns with variable length gaps or don’t cares. In: Proceedings of the 12th annual international computing and combinatorics conference, August 2006, pp 146–155
Haapasalo T, Silvasti P, Sippu S, Soisalon-Soininen E (2011) Online dictionary matching with variable-length gaps. In: Proceedings of the 10th international symposium, SEA Kolimpari, Chania, Crete, Greece. Springer, Berlin, pp 76–87
Kucherov G, Rusinowitch M (1997) Matching a set of strings with variable length don’t cares. Theoret Comput Sci 178(1–2):129–154
Kulekci MO (2007) TARA: an algorithm for fast searching of multiple patterns on text files. In: 22nd international symposium on computer and information sciences, pp 136–141
Zhang M, Zhang Y, Tang J, Bai X (2011) Multi-pattern matching with wildcards. J Softw 6(12):2391–2398
McCreight EM (1976) A space-economical suffix tree construction algorithm. J ACM 23(2):262–272
Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–260
Gusfield D (1997) Algorithms on strings, trees, and sequences Computer Science and Computational Biology. Cambrigde University Press, Cambridge
Chattaraj A, Parida L (2005) An inexact-suffix-tree-based algorithm for detecting extensible patterns. Theoret Comput Sci 335(1):3–14
Ukkonen E (2009) Maximal and minimal representations of gapped and non-gapped motifs of a string. Theoret Comput Sci 410(43):4341–4349
BilleP Gørtz IL et al (2014) String indexing for patterns with wildcards. Theory of Computing Systems 55(1):41–60
Thankachan SV, Apostolico A, Aluru S (2016) A provably efficient algorithm for the k-mismatch average common substring problem. J Comput Biol 23(6):472–482
Salmela L, Tarhio J, Kytöjoki J (2007) Multi-pattern string matching with q-grams. J Exp Algorithm 11(1):1–19
Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theoret Comput Sci 92(1):191–211
Arın İnanç, Erpam MK, Saygın Y (2018) I-TWEC: interactive clustering tool for Twitter. Expert Syst Appl 96:1–13
Acknowledgements
This research is supported by the National Key Research and Development Program of China (Grant No. 2016YFB1000900) and National Natural Science Foundation of China (NSFC) (Grant Nos. 61503116 and 61229301).
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Liu, N., Xie, F. & Wu, X. Multi-pattern matching with variable-length wildcards using suffix tree. Pattern Anal Applic 21, 1151–1165 (2018). https://doi.org/10.1007/s10044-018-0733-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-018-0733-0