Abstract
In this paper, we present a comparative analysis of the leading rule-based information extraction systems in both research and industry, focusing on their main characteristics and their performance. Our evaluation was performed on a dataset of text documents about financial product descriptions from a real-world application scenario. In this study, we demonstrate that, while the considered tools share similarities in terms of expressiveness of their extractors and produce results of comparable quality, the implementation choices of their engines have a substantial impact on their overall execution time. Moreover, we emphasize that some of the considered tools offer seamless support for writing extraction rules, effectively addressing one of the common challenges associated with rule-based approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
PRIIPs Regulation n. 1286/2014.
- 2.
Note that in various cases we had to realize more than one extractor for a single field.
References
Appelt, D.E., Hobbs, J.R., Bear, J., Israel, D., Tyson, M.: FASTUS: a finite-state processor for information extraction from real-world text. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI), vol. 93 (1993)
Appelt, D.E., Onyshkevych, B.: The common pattern specification language. Technical report, International Menlo Park Artificial Intelligence Institute (1998)
Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S.: SystemT: an algebraic approach to declarative information extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL) (2010)
Chiticariu, L., Li, Y., Reiss, F.: Rule-based information extraction is dead! Long live rule-based information extraction systems! In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2013)
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: an architecture for development of robust HLT applications. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 168–175 (2002)
Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Document spanners: a formal approach to information extraction. J. ACM (JACM) 62(2), 1–51 (2015)
Freitag, D.: Machine learning for information extraction in informal domains. Mach. Learn. 39(2/3), 169–202 (2000)
Kluegl, P., Toepfer, M., Beck, P., Fette, G., Puppe, F.: UIMA Ruta: rapid development of rule-based information extraction applications. Nat. Lang. Eng. 22(1), 1–40 (2016)
Klügl, P., Atzmüller, M., Puppe, F.: Test-driven development of complex information extraction systems using textmarker. In: Nalepa, G.J., Baumeister, J. (eds.) Proceedings of the 4th Workshop on Knowledge Engineering and Software Engineering (KESE), vol. 425 (2008)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations (2014)
Skalickỳ, M., Šimsa, Š, Uřičář, M., Šulc, M.: Business document information extraction: towards practical benchmarks. In: Barron-Cedeno, A., et al. (eds.) CLEF 2022. Lecture Notes in Computer Science, vol. 13390. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13643-6_8
Valenzuela-Escárcega, M.A., Hahn-Powell, G., Surdeanu, M.: Odin’s runes: a rule language for information extraction. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC) (2016)
Acknowledgments
Scafoglieri’s research was entirely and exclusively supported by PNRR MUR project PE0000013-FAIR. Lembo’s research was supported by EU ICT-48 2020 project TAILOR (No. 952215), EU ERA-NET Cofund ICT-AGRI-FOOD project ADCATER (No. 40705), and PNRR MUR project PE0000013-FAIR.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lembo, D., Scafoglieri, F.M. (2023). Comparing State of the Art Rule-Based Tools for Information Extraction. In: Fensel, A., Ozaki, A., Roman, D., Soylu, A. (eds) Rules and Reasoning. RuleML+RR 2023. Lecture Notes in Computer Science, vol 14244. Springer, Cham. https://doi.org/10.1007/978-3-031-45072-3_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-45072-3_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45071-6
Online ISBN: 978-3-031-45072-3
eBook Packages: Computer ScienceComputer Science (R0)