Skip to main content
Log in

Efficient string matching with wildcards and length constraints

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper defines a challenging problem of pattern matching between a pattern P and a text T, with wildcards and length constraints, and designs an efficient algorithm to return each pattern occurrence in an online manner. In this pattern matching problem, the user can specify the constraints on the number of wildcards between each two consecutive letters of P and the constraints on the length of each matching substring in T. We design a complete algorithm, SAIL that returns each matching substring of P in T as soon as it appears in T in an O(n+klmg) time with an O(lm) space overhead, where n is the length of T, k is the frequency of P's last letter occurring in T, l is the user-specified maximum length for each matching substring, m is the length of P, and g is the maximum difference between the user-specified maximum and minimum numbers of wildcards allowed between two consecutive letters in P.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Akutsu T (1996) Approximate string matching with variable length don't care characters. IEICE Trans Info Syst E79-D(9):1353–1354

  2. Cole R, Gottlieb LA, Lewenstein M (2004) Dictionary matching and indexing with errors and don't cares. In: Proceedings of the 36th ACM Symposium on the Theory of Computing. ACM Press, New York, NY, USA, pp 91–100

  3. Crochemore M, Hancart C (1997) Automata for matching patterns. In: Rosenberg G, Salomaa A (eds) Handbook of formal languages, vol 2, Linear Modeling. Springer-Verlag, New York, NY, USA

  4. Fischer MJ, Paterson MS (1974) String matching and other products. In: Karp RM (ed) Complexity of computation, vol 7. Massachusetts Institute of Technology, Cambridge, MA, USA, pp 113–125

  5. Gusfield D (1997) Algorithms on strings, trees, and sequences–Computer science and computational biology. Cambridge University Press, Cambridge

    Google Scholar 

  6. Indyk P (1998) Faster algorithms for string matching problems: Matching the convolution bound. In: Proceedings of the 39th Symposium on Foundations of Computer Science. IEEE Computer Society, Washington, DC, USA, p 166

  7. Kalai A (2002) Efficient pattern-matching with don't cares. In: Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 655–656

  8. Kucherov G, Rusinowitch M (1995) Matching a set of strings with variable length don't cares. In: Proceedings of the 6th Symposium on Combinatorial Pattern Matching. Springer, Berlin Heidelberg New York, pp 230–247

  9. Lin MY, Lee SY (2005) Efficient mining of sequential patterns with time constraints by delimited pattern growth. Knowl Inf Syst 7(4):499–514

    Article  MathSciNet  Google Scholar 

  10. Manber U, Baeza-Yates R (1991) An algorithm for string matching with a sequence of don't cares. Inf. Proc. Lett. 37(3):133–136

    Article  MATH  MathSciNet  Google Scholar 

  11. Muthukrishan S, Palem K (1994) Non-standard stringology: Algorithms and complexity. In: Proceedings of the 26th ACM Symposium on the Theory of Computing. ACM Press, New York, NY, USA, pp 770–779

  12. Pei J, Han J (2002) Constrained frequent pattern mining: A pattern-growth view. SIGKDD Explor 4(1):31–39

    Google Scholar 

  13. Srikant R, Agrawal R (1996) Mining sequential patterns: Generalized and performance improvements. In: Proceedings of the 5th International Conference on Extending Database Technology. Springer, Berlin Heidelberg New York, pp 3–17

  14. Tzvetkov P, Yan X, Han J (2005) TSP: Mining top-k closed sequential patterns. Knowl Inf Syst 7(4):438–457

    Article  Google Scholar 

  15. Waterman MS (1995) Introduction to computational biology. Chapman & Hall/CRC, London

    MATH  Google Scholar 

  16. Yang J, Wang W, Yu PS (2004) Discovering high-order periodic patterns. Knowl Inf Syst 6(3):243–268

    Article  MATH  Google Scholar 

  17. Zaki MJ (2000) Sequence mining in categorical domains: Incorporating constraints. In: Proceedings of the 9th International Conference on Information and Knowledge Management. ACM Press, New York, NY, USA, pp 422–429

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gong Chen.

Additional information

SAIL stands for string matching with wildcards and length constraints.

Gong Chen received the B.Eng. degree from the Beijing University of Technology, China, and the M.Sc. degree from the University of Vermont, USA, both in computer science. He is currently a graduate student in the Department of Statistics at the University of California, Los Angeles, USA. His research interests include data mining, statistical learning, machine learning, algorithm analysis and design, and database management.

Xindong Wu is a professor and the chair of the Department of Computer Science at the University of Vermont. He holds a Ph.D. in Artificial Intelligence from the University of Edinburgh, Britain. His research interests include data mining, knowledge-based systems, and Web information exploration. He has published extensively in these areas in various journals and conferences, including IEEE TKDE, TPAMI, ACM TOIS, IJCAI, AAAI, ICML, KDD, ICDM and WWW, as well as 12 books and conference proceedings. Dr. Wu is the Editor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering (by the IEEE Computer Society), the founder and current Steering Committee Chair of the IEEE International Conference on Data Mining (ICDM),an Honorary Editor-in-Chief of Knowledge and Information Systems (by Springer), and a Series Editor of the Springer Book Series on Advanced Information and Knowledge Processing (AI&KP). He is the 2004 ACM SIGKDD Service Award winner.

Xingquan Zhu received his Ph.D degree in Computer Science from Fudan University, Shanghai, China, in 2001. He spent 4 months with Microsoft Research Asia, Beijing, China, where he was working on content-based image retrieval with relevance feedback. From 2001 to 2002, he was a postdoctoral associate in the Department of Computer Science at Purdue University, West Lafayette, IN. He is currently a research assistant professor in the Department of Computer Science, the University of Vermont, Burlington, VT. His research interests include data mining, machine learning, data quality, multimedia computing, and information retrieval. Since 2000, Dr. Zhu has published extensively, including over 50 refereed papers in various journals and conference proceedings.

Abdullah N. Arslan got his Ph.D. degree in Computer Science in 2002 from the University of California at Santa Barbara. Upon his graduation he joined the Department of Computer Science at the University of Vermont as an assistant professor. He has been with the computer science faculty there since then. Dr. Arslan's main research interests are on algorithms on strings, computational biology and bioinformatics. Dr. Arslan earned his Master's degree in Computer Science in 1996 from the University of North Texas, Denton, Texas and his Bachelor's degree in Computer Engineering in 1990 from the Middle East Technical University, Ankara, Turkey. He worked as a programmer for the Central Bank of Turkey between 1991 and 1994.

Yu He received her B.E. degree in Information Engineering from Zhejiang University, China, in 2001. She is currently a graduate student in the Department of Computer Science at the University of Vermont. Her research interests include data mining, bioinformatics and pattern recognition.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, G., Wu, X., Zhu, X. et al. Efficient string matching with wildcards and length constraints. Knowl Inf Syst 10, 399–419 (2006). https://doi.org/10.1007/s10115-006-0016-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-006-0016-8

Keywords

Navigation