Skip to main content
Log in

A Matching Algorithm in PMWL based on CluTree

  • Published:
New Generation Computing Aims and scope Submit manuscript

Abstract

Pattern matching with wildcards and length constraints (PMWL) is a complex problem which has important applications in bioinformatics, network security and information retrieval. Existing algorithms use the traditional left-most strategy when selecting among multiple candidate matching positions, which leads to incomplete final matching results. This paper presents a new data structure CluTree and a new matching algorithm RBCT*1 based on CluTree. After establishing a cluster of trees with red and black nodes according to a pattern P and a text T, which is called CluTree, our RBCT algorithm uses the sharing degree, correlation degree and mixed information entropy of each node in the CluTree for path selection and dynamic pruning. Our RBCT algorithm traverses the CluTree and finds more occurrences compared to the existing algorithms under the one-off condition in a linear time cost. Theoretical analysis and experimental results show that the RBCT algorithm outperforms other peers in retrieval precision and matching efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Kolpakov, R. and Kucherov, G., “Finding repeats with fixed gap,” in Proc. of the 7th Int’1 Symp. String Processing and Information Retrieval (SPIRE), Washington, IEEE Computer Society, pp. 162–168, 2000.

  2. Lin Z., Lyu M.R., King I.: “MatchSim: a novel similarity measure based on maximum neighborhood matching,”. Knowledge and Information Systems, 32(1), 141–166 (2012)

    Article  Google Scholar 

  3. Kolpakov R., Kucherov G.: “Finding approximate repetitions under hamming distance,”. Theoretical Computer Science, 303(1), 135–156 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  4. Anchuri P., Zaki M. J., Barkol O. et al.: “Graph mining for discovering infrastructure patterns in configuration management databases,”. Knowledge and Information Systems, 33(3), 491–522 (2012)

    Article  Google Scholar 

  5. Fischer, M. J., Paterson, M. S., “String matching and other products,” in Complexity of computation (Karp, RM ed.), 7, Massachusetts Institute of Technology, Cambridge, MA, USA, pp. 113–125, 1974.

  6. Manber U., Baeza-Yates R.: “An algorithm for string matching with a sequence of don’t cares,”. Inf. Proc. Lett., 37(3), 133–136 (1991)

    Article  MATH  MathSciNet  Google Scholar 

  7. Muthukrishan, S. and Palem, K., “Non-standard stringology: Algorithms and complexity,” in Proc. of the 26th ACM Symposium on the Theory of Computing, ACM Press, New York, NY, USA, pp. 770–779, 1994.

  8. Kucherov, G. and Rusinowitch, M., “Matching a set of strings with variable length don’t cares,” in Proc. of the 6th Symposium on Combinatorial Pattern Matching, Berlin: Springer, pp. 230–247, 1995.

  9. Indyk, P., “Faster algorithms for string matching problems: Matching the convolution bound,” in Proc. of the 39th Symposium on Foundations of Computer Science, IEEE Computer Society, Washington, DC, USA, pp. 166–173, 1998.

  10. Kalai, A., “Efficient pattern-matching with don’t cares,” in Proc. of the 13 th ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 655–656, 2002.

  11. Cole, R., Gottlieb, L.A., Lewenstein, M., “Dictionary matching and indexing with errors and don’t cares,” in Proc. of the 36th ACM Symposium on the Theory of Computing, ACM Press, New York, NY, USA, pp.91–100, 2004.

  12. Navarro, G. and Raffinot, M., Flexible pattern matching in strings: Practical on-line search algorithms for texts and biological sequences, Cambridge, UK: Cambridge University Press, 2002.

  13. Navarro, G. and Raffinot, M., “Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching,” Computational Biology, 10, 6, pp. 903–923, 2003.

  14. Chen G., Wu X., Zhu X., Arslan Abdullah N., He Y.: “Efficient String Matching with Wildcards and Length Constraints,”. Knowledge and Information Systems 10(4), 399–419 (2006)

    Article  Google Scholar 

  15. Hong, X., Wu, X., Hu, X., Liu, Y., Gao, J., Wu, G., “BPBM: An Algorithm for String Matching with Wildcards and Length Constraints,” PreMI’09&RSFDGrC’09, pp. 518–525, 2009.

  16. Liu, Y., Wu, X., Hu, X., Gao, J., Wu, G., Wang, H., and Hong, X., “Pattern Matching with Wildcards based on Key Character Location,” Proc. of the 2009 IEEE International Conference in Information Reuse and Integration (IRI-2009), Las Vegas, USA, pp.167–170, 2009.

  17. He, D., Arslan, Abdullah N., He, Y. and Wu, X., “Iterative Refinement of Repeat Sequence Specification Using Constrained Pattern Matching,” Proc. of the IEEE 7th International Symposium on Bioinformatics & Bioengineering (BIBE 2007), Harvard Medical School Conference Center, Cambridge - Boston, Massachusetts, USA, pp. 1199–1203, 2007.

  18. Wu, Y., Wu, X., Min, F. and Li, Y., “A Nettree for Pattern Matching with Flexible Wildcard Constraints,” Proc. of the 11th IEEE International Conference on Information Reuse and Integration (IRI 2010), Las Vegas, USA, pp. 109–114, 2010.

  19. Min, F., Wu, X. and Lu, Z., “Pattern Matching with Independent Wildcard Gaps,” Proc. of the 8th International Conference on Pervasive Intelligence and Computing (PICom 2009), Chengdu, China, pp. 194–199, 2009.

  20. http://www.ncbi.nlm.nih.gov/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xindong Wu.

Additional information

*1RBCT stands for pattern matching with wildcards in a cluster of trees with Red and Black nodes called CluTree.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Wu, X., Hu, Xg. et al. A Matching Algorithm in PMWL based on CluTree. New Gener. Comput. 32, 95–122 (2014). https://doi.org/10.1007/s00354-014-0201-3

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00354-014-0201-3

Keywords

Navigation