Skip to main content

Comparing State of the Art Rule-Based Tools for Information Extraction

  • Conference paper
  • First Online:
Rules and Reasoning (RuleML+RR 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14244))

Included in the following conference series:

  • 230 Accesses

Abstract

In this paper, we present a comparative analysis of the leading rule-based information extraction systems in both research and industry, focusing on their main characteristics and their performance. Our evaluation was performed on a dataset of text documents about financial product descriptions from a real-world application scenario. In this study, we demonstrate that, while the considered tools share similarities in terms of expressiveness of their extractors and produce results of comparable quality, the implementation choices of their engines have a substantial impact on their overall execution time. Moreover, we emphasize that some of the considered tools offer seamless support for writing extraction rules, effectively addressing one of the common challenges associated with rule-based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    PRIIPs Regulation n. 1286/2014.

  2. 2.

    Note that in various cases we had to realize more than one extractor for a single field.

References

  1. Appelt, D.E., Hobbs, J.R., Bear, J., Israel, D., Tyson, M.: FASTUS: a finite-state processor for information extraction from real-world text. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI), vol. 93 (1993)

    Google Scholar 

  2. Appelt, D.E., Onyshkevych, B.: The common pattern specification language. Technical report, International Menlo Park Artificial Intelligence Institute (1998)

    Google Scholar 

  3. Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S.: SystemT: an algebraic approach to declarative information extraction. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL) (2010)

    Google Scholar 

  4. Chiticariu, L., Li, Y., Reiss, F.: Rule-based information extraction is dead! Long live rule-based information extraction systems! In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2013)

    Google Scholar 

  5. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: an architecture for development of robust HLT applications. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 168–175 (2002)

    Google Scholar 

  6. Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Document spanners: a formal approach to information extraction. J. ACM (JACM) 62(2), 1–51 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  7. Freitag, D.: Machine learning for information extraction in informal domains. Mach. Learn. 39(2/3), 169–202 (2000)

    Article  MATH  Google Scholar 

  8. Kluegl, P., Toepfer, M., Beck, P., Fette, G., Puppe, F.: UIMA Ruta: rapid development of rule-based information extraction applications. Nat. Lang. Eng. 22(1), 1–40 (2016)

    Article  Google Scholar 

  9. Klügl, P., Atzmüller, M., Puppe, F.: Test-driven development of complex information extraction systems using textmarker. In: Nalepa, G.J., Baumeister, J. (eds.) Proceedings of the 4th Workshop on Knowledge Engineering and Software Engineering (KESE), vol. 425 (2008)

    Google Scholar 

  10. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations (2014)

    Google Scholar 

  11. Skalickỳ, M., Šimsa, Š, Uřičář, M., Šulc, M.: Business document information extraction: towards practical benchmarks. In: Barron-Cedeno, A., et al. (eds.) CLEF 2022. Lecture Notes in Computer Science, vol. 13390. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13643-6_8

    Chapter  Google Scholar 

  12. Valenzuela-Escárcega, M.A., Hahn-Powell, G., Surdeanu, M.: Odin’s runes: a rule language for information extraction. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC) (2016)

    Google Scholar 

Download references

Acknowledgments

Scafoglieri’s research was entirely and exclusively supported by PNRR MUR project PE0000013-FAIR. Lembo’s research was supported by EU ICT-48 2020 project TAILOR (No. 952215), EU ERA-NET Cofund ICT-AGRI-FOOD project ADCATER (No. 40705), and PNRR MUR project PE0000013-FAIR.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Federico Maria Scafoglieri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lembo, D., Scafoglieri, F.M. (2023). Comparing State of the Art Rule-Based Tools for Information Extraction. In: Fensel, A., Ozaki, A., Roman, D., Soylu, A. (eds) Rules and Reasoning. RuleML+RR 2023. Lecture Notes in Computer Science, vol 14244. Springer, Cham. https://doi.org/10.1007/978-3-031-45072-3_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45072-3_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45071-6

  • Online ISBN: 978-3-031-45072-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics