Skip to main content

Advertisement

Log in

Automatic generation of regular expressions for the Regex Golf challenge using a local search algorithm

  • Published:
Genetic Programming and Evolvable Machines Aims and scope Submit manuscript

Abstract

Regular expression is a technology widely used in software development for extracting textual data, validating the structure of textual documents, or formatting data. Regex Golf is a challenge that consists in finding the smallest possible regular expression given a set of sentences to perform matches and another set not to match. An algorithm capable of meeting the Regex Golf requirements is a relevant contribution to the area of semi-structured document data extraction. In this paper, we propose a heuristic search algorithm based on local search, combined with a regular expression shrinker, to find valid results for Regex Golf problems. An experimental study was conducted to compare the proposed technique with an exact algorithm and a genetic programming algorithm designed for the Regex Golf challenge. The proposed local search was shown to outperform both competing algorithms in six out of fifteen problem instances, tying in another three instances. On the other hand, all algorithms still lack the ability to outperform human software developers in designing regular expressions for the challenge.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. By instance we mean a description of a Regex Golf challenge, comprised by a match and a unmatch list. We have collected 15 of such descriptions from the website for experimental evaluation purposes.

  2. https://versus.com/en/intel-core-i7-9750h-vs-intel-xeon-e5-2440.

  3. https://gist.github.com/jpsim/8057500.

  4. https://gist.github.com/Davidebyzero/9221685.

References

  1. L. Araujo, Genetic programming for natural language processing. Genet. Program Evol. Mach. 21, 1573–7632 (2019)

    Google Scholar 

  2. A. Bartoli, G. Davanzo, A. De Lorenzo, M. Mauri, E. Medvet, E. Sorio, Automatic generation of regular expressions from examples with genetic programming, in Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation. ACM, pp. 1477–1478 (2012)

  3. A. Bartoli, A. De Lorenzo, E. Medvet, F. Tarlao, Playing regex golf with genetic programming, in Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation. ACM, pp. 1063–1070 (2014)

  4. A. Bartoli, A. De Lorenzo, E. Medvet, F. Tarlao, Active learning approaches for learning regular expressions with genetic programming, in Proceedings of the 31st Annual ACM Symposium on Applied Computing. ACM, pp. 97–102 (2016)

  5. A. Bartoli, A. De Lorenzo, E. Medvet, F. Tarlao, Can a machine replace humans in building regular expressions? A case study. IEEE Intell. Syst. 31(6), 15–21 (2016)

    Article  Google Scholar 

  6. A. Bartoli, A. De Lorenzo, E. Medvet, F. Tarlao, Inference of regular expressions for text extraction from examples. IEEE Trans. Knowl. Data Eng. 28(5), 1217–1230 (2016)

    Article  Google Scholar 

  7. F. Brauer, R. Rieger, A. Mocan, W.M. Barczynski, Enabling information extraction by inference of regular expressions from sample entities, in Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, pp. 1285–1294 (2011)

  8. A. Cetinkaya, Regular expression generation through grammatical evolution, in Proceedings of the 9th Annual Conference Companion on Genetic and Evolutionary Computation. ACM, pp. 2643–2646 (2007)

  9. R.A. Cochran, L. D’Antoni, B. Livshits, D. Molnar, M. Veanes, Program boosting: program synthesis via crowd-sourcing. SIGPLAN Not. 50(1), 677–688 (2015). https://doi.org/10.1145/2775051.2676973

  10. B. Cody-Kenny, M. Fenton, A. Ronayne, E. Considine, T. McGuire, M. O’Neill, A search for improved performance in regular expressions. CoRR arXiv:1704.04119 (2017)

  11. A. Gonzalez-Pardo, D. Camacho, Analysis of grammatical evolutionary approaches to regular expression induction, in 2011 IEEE Congress of Evolutionary Computation (CEC). IEEE, pp. 639–646 (2011)

  12. E. Larson, A. Kirk, Generating evil test strings for regular expressions, in 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST). IEEE, pp. 309–319 (2016)

  13. Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, H. Jagadish, Regular expression learning for information extraction, in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 21–30 (2008)

  14. Norvig, P.: Regex golf with peter norvig. https://www.oreilly.com/learning/regex-golf-with-peter-norvig (2014). Visitado em: 26-05-2018

  15. R. Rastogi, S. Akash, G. Shobha, G. Poonam, D. Pratiba, A. Singh, Design and development of generic web based framework for log analysis, in 2016 IEEE Region 10 Conference (TENCON). IEEE, pp. 232–236 (2016)

  16. T. Wu, W.M. Pottenger, A semi-supervised active learning algorithm for information extraction from textual data. J. Am. Soc. Inf. Sci. 56(3), 258–271 (2005)

    Article  Google Scholar 

  17. J. Zhang, C. Seifert, J.W. Stokes, W. Lee, Arrow: Generating signatures to detect drive-by downloads, in Proceedings of the 20th international conference on world wide web. ACM, pp. 187–196 (2011)

Download references

Acknowledgements

The authors would like to thank CNPq (Conselho Nacional de Pesquisa) and CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) for their support for this research project. The authors also acknowledge the contributions given by the peer reviewers to improve this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Márcio de Oliveira Barros.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

de Almeida Farzat, A., de Oliveira Barros, M. Automatic generation of regular expressions for the Regex Golf challenge using a local search algorithm. Genet Program Evolvable Mach 23, 105–131 (2022). https://doi.org/10.1007/s10710-021-09411-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10710-021-09411-x

Keywords

Navigation