Abstract
Regular expressions (regexes) are patterns that are used in many applications to extract words or tokens from text. However, even hand-crafted regexes may fail to match all the intended words. In this paper, we propose a novel way to generalize a given regex so that it matches also a set of missing (previously non-matched) words. Our method finds an approximate match between the missing words and the regex, and adds disjunctions for the unmatched parts appropriately. We show that this method can not just improve the precision and recall of the regex, but also generate much shorter regexes than baselines and competitors on various datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Babbar, R., Singh, N.: Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In: Workshop on Analytics for Noisy Unstructured Text Data (2010)
Bartoli, A., Davanzo, G., Lorenzo, A.D., Mauri, M., Medvet, E., Sorio, E.: Automatic generation of regular expressions from examples with genetic programming. In: GECCO (2012)
Bartoli, A., Davanzo, G., Lorenzo, A.D., Medvet, E., Sorio, E.: Automatic synthesis of regular expressions from examples. IEEE Comput. 47(12), 72–80 (2014)
Bartoli, A., De Lorenzo, A., Medvet, E., Tarlao, F.: On the automatic construction of regular expressions from examples. In: GECCO (2016)
Brauer, F., Rieger, R., Mocan, A., Barczynski, W.M.: Enabling information extraction by inference of regular expressions from sample entities. In: CIKM (2011)
Ficara, D., Giordano, S., Procissi, G., Vitucci, F., Antichi, G., Di Pietro, A.: An improved DFA for fast regular expression matching. SIGCOMM Comput. Commun. Rev. 38(5), 29–40 (2008). https://doi.org/10.1145/1452335.1452339
Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In: SIGPLAN Notices, vol. 46 (2011)
Knight, J.R., Myers, E.W.: Approximate regular expression pattern matching with concave gap penalties. Algorithmica 14(1), 85–121 (1995)
Le, V., Gulwani, S.: FlashExtract: a framework for data extraction by examples. In: PLDI (2014)
Lehmann, J., et al.: DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Seman. Web J. 6(2), 167–195 (2015)
Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.V.: Regular expression learning for information extraction. In: EMNLP (2008)
Minkov, E., Wang, R.C., Cohen, W.W.: Extracting personal names from email: applying named entity recognition to informal text. In: EMNLP (2005)
Murthy, K., Padmanabhan, D., Deshpande, P.M.: Improving recall of regular expressions for information extraction. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE 2012. LNCS, vol. 7651, pp. 455–467. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35063-4_33
Myers, E.W., Miller, W.: Approximate matching of regular expressions. Bull. Math. Biol. 51(1), 5–37 (1989)
Navarro, G.: Approximate regular expression searching with arbitrary integer weights. Nord. J. Comput. 11(4), 356–373 (2004)
Prasse, P., Sawade, C., Landwehr, N., Scheffer, T.: Learning to identify concise regular expressions that describe email campaigns. J. Mach. Learn. Res. 16(1), 3687–3720 (2015)
Rebele, T., Tzompanaki, K., Suchanek, F.: Visualizing the addition of missing words to regular expressions. In: ISWC (2017)
Rebele, T., Tzompanaki, K., Suchanek, F.: Technical report: adding missing words to regular expressions. Technical report, Telecom ParisTech (2018)
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: WWW (2007)
Wu, S., Manber, U., Myers, E.: A subquadratic algorithm for approximate regular expression matching. J. Algorithms 19(3), 346–360 (1995)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: SIGIR (1999)
Acknowledgments
This research was partially supported by Labex DigiCosme (project ANR-11-LABEX-0045-DIGICOSME) operated by ANR as part of the program “Investissement d’Avenir” Idex Paris-Saclay (ANR-11-IDEX-0003-02).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Rebele, T., Tzompanaki, K., Suchanek, F.M. (2018). Adding Missing Words to Regular Expressions. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10938. Springer, Cham. https://doi.org/10.1007/978-3-319-93037-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-93037-4_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93036-7
Online ISBN: 978-3-319-93037-4
eBook Packages: Computer ScienceComputer Science (R0)