Abstract
This paper defines a challenging problem of pattern matching between a pattern P and a text T, with wildcards and length constraints, and designs an efficient algorithm to return each pattern occurrence in an online manner. In this pattern matching problem, the user can specify the constraints on the number of wildcards between each two consecutive letters of P and the constraints on the length of each matching substring in T. We design a complete algorithm, SAIL that returns each matching substring of P in T as soon as it appears in T in an O(n+klmg) time with an O(lm) space overhead, where n is the length of T, k is the frequency of P's last letter occurring in T, l is the user-specified maximum length for each matching substring, m is the length of P, and g is the maximum difference between the user-specified maximum and minimum numbers of wildcards allowed between two consecutive letters in P.
Similar content being viewed by others
References
Akutsu T (1996) Approximate string matching with variable length don't care characters. IEICE Trans Info Syst E79-D(9):1353–1354
Cole R, Gottlieb LA, Lewenstein M (2004) Dictionary matching and indexing with errors and don't cares. In: Proceedings of the 36th ACM Symposium on the Theory of Computing. ACM Press, New York, NY, USA, pp 91–100
Crochemore M, Hancart C (1997) Automata for matching patterns. In: Rosenberg G, Salomaa A (eds) Handbook of formal languages, vol 2, Linear Modeling. Springer-Verlag, New York, NY, USA
Fischer MJ, Paterson MS (1974) String matching and other products. In: Karp RM (ed) Complexity of computation, vol 7. Massachusetts Institute of Technology, Cambridge, MA, USA, pp 113–125
Gusfield D (1997) Algorithms on strings, trees, and sequences–Computer science and computational biology. Cambridge University Press, Cambridge
Indyk P (1998) Faster algorithms for string matching problems: Matching the convolution bound. In: Proceedings of the 39th Symposium on Foundations of Computer Science. IEEE Computer Society, Washington, DC, USA, p 166
Kalai A (2002) Efficient pattern-matching with don't cares. In: Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 655–656
Kucherov G, Rusinowitch M (1995) Matching a set of strings with variable length don't cares. In: Proceedings of the 6th Symposium on Combinatorial Pattern Matching. Springer, Berlin Heidelberg New York, pp 230–247
Lin MY, Lee SY (2005) Efficient mining of sequential patterns with time constraints by delimited pattern growth. Knowl Inf Syst 7(4):499–514
Manber U, Baeza-Yates R (1991) An algorithm for string matching with a sequence of don't cares. Inf. Proc. Lett. 37(3):133–136
Muthukrishan S, Palem K (1994) Non-standard stringology: Algorithms and complexity. In: Proceedings of the 26th ACM Symposium on the Theory of Computing. ACM Press, New York, NY, USA, pp 770–779
Pei J, Han J (2002) Constrained frequent pattern mining: A pattern-growth view. SIGKDD Explor 4(1):31–39
Srikant R, Agrawal R (1996) Mining sequential patterns: Generalized and performance improvements. In: Proceedings of the 5th International Conference on Extending Database Technology. Springer, Berlin Heidelberg New York, pp 3–17
Tzvetkov P, Yan X, Han J (2005) TSP: Mining top-k closed sequential patterns. Knowl Inf Syst 7(4):438–457
Waterman MS (1995) Introduction to computational biology. Chapman & Hall/CRC, London
Yang J, Wang W, Yu PS (2004) Discovering high-order periodic patterns. Knowl Inf Syst 6(3):243–268
Zaki MJ (2000) Sequence mining in categorical domains: Incorporating constraints. In: Proceedings of the 9th International Conference on Information and Knowledge Management. ACM Press, New York, NY, USA, pp 422–429
Author information
Authors and Affiliations
Corresponding author
Additional information
SAIL stands for string matching with wildcards and length constraints.
Gong Chen received the B.Eng. degree from the Beijing University of Technology, China, and the M.Sc. degree from the University of Vermont, USA, both in computer science. He is currently a graduate student in the Department of Statistics at the University of California, Los Angeles, USA. His research interests include data mining, statistical learning, machine learning, algorithm analysis and design, and database management.
Xindong Wu is a professor and the chair of the Department of Computer Science at the University of Vermont. He holds a Ph.D. in Artificial Intelligence from the University of Edinburgh, Britain. His research interests include data mining, knowledge-based systems, and Web information exploration. He has published extensively in these areas in various journals and conferences, including IEEE TKDE, TPAMI, ACM TOIS, IJCAI, AAAI, ICML, KDD, ICDM and WWW, as well as 12 books and conference proceedings. Dr. Wu is the Editor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering (by the IEEE Computer Society), the founder and current Steering Committee Chair of the IEEE International Conference on Data Mining (ICDM),an Honorary Editor-in-Chief of Knowledge and Information Systems (by Springer), and a Series Editor of the Springer Book Series on Advanced Information and Knowledge Processing (AI&KP). He is the 2004 ACM SIGKDD Service Award winner.
Xingquan Zhu received his Ph.D degree in Computer Science from Fudan University, Shanghai, China, in 2001. He spent 4 months with Microsoft Research Asia, Beijing, China, where he was working on content-based image retrieval with relevance feedback. From 2001 to 2002, he was a postdoctoral associate in the Department of Computer Science at Purdue University, West Lafayette, IN. He is currently a research assistant professor in the Department of Computer Science, the University of Vermont, Burlington, VT. His research interests include data mining, machine learning, data quality, multimedia computing, and information retrieval. Since 2000, Dr. Zhu has published extensively, including over 50 refereed papers in various journals and conference proceedings.
Abdullah N. Arslan got his Ph.D. degree in Computer Science in 2002 from the University of California at Santa Barbara. Upon his graduation he joined the Department of Computer Science at the University of Vermont as an assistant professor. He has been with the computer science faculty there since then. Dr. Arslan's main research interests are on algorithms on strings, computational biology and bioinformatics. Dr. Arslan earned his Master's degree in Computer Science in 1996 from the University of North Texas, Denton, Texas and his Bachelor's degree in Computer Engineering in 1990 from the Middle East Technical University, Ankara, Turkey. He worked as a programmer for the Central Bank of Turkey between 1991 and 1994.
Yu He received her B.E. degree in Information Engineering from Zhejiang University, China, in 2001. She is currently a graduate student in the Department of Computer Science at the University of Vermont. Her research interests include data mining, bioinformatics and pattern recognition.
Rights and permissions
About this article
Cite this article
Chen, G., Wu, X., Zhu, X. et al. Efficient string matching with wildcards and length constraints. Knowl Inf Syst 10, 399–419 (2006). https://doi.org/10.1007/s10115-006-0016-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-006-0016-8