Skip to main content

A Benchmark of Nested Named Entity Recognition Approaches in Historical Structured Documents

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2023 (ICDAR 2023)

Abstract

Named Entity Recognition (NER) is a key step in the creation of structured data from digitised historical documents. Traditional NER approaches deal with flat named entities, whereas entities are often nested. For example, a postal address might contain a street name and a number. This work compares three nested NER approaches, including two state-of-the-art approaches using Transformer-based architectures. We introduce a new Transformer-based approach based on joint labelling and semantic weighting of errors, evaluated on a collection of 19th-century Paris trade directories. We evaluate approaches regarding the impact of supervised fine-tuning, unsupervised pre-training with noisy texts, and variation of IOB tagging formats. Our results show that while nested NER approaches enable extracting structured data directly, they do not benefit from the extra knowledge provided during training and reach a performance similar to the base approach on flat entities. Even though all 3 approaches perform well in terms of F1-scores, joint labelling is most suitable for hierarchically structured data. Finally, our experiments reveal the superiority of the IO tagging format on such data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abadie, N., Bacciochi, S., Carlinet, E., Chazalon, J., Cristofoli, P., Duménieu, B., Perret, J.: A dataset of French trade directories from the 19th century (FTD) (2022). https://doi.org/10.5281/zenodo.6394464

  2. Abadie, N., Carlinet, E., Chazalon, J., Duménieu, B.: A benchmark of named entity recognition approaches in historical documents application to 19th century french directories. In: Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, Proceedings. La Rochelle, France (2022). https://doi.org/10.1007/978-3-031-06555-2_30

  3. Agrawal, A., Tripathi, S., Vardhan, M., Sihag, V., Choudhary, G., Dragoni, N.: BERT-Based Transfer-Learning Approach for Nested Named-Entity Recognition Using Joint Labeling. Appl. Sci. 12(3), 976 (2022). https://doi.org/10.3390/app12030976

    Article  Google Scholar 

  4. Albers, T., Kappner, K.: Perks and pitfalls of city directories as a micro-geographic data source. Collaborative Research Center Transregio 190 Discussion Paper No. 315 (2022). https://doi.org/10.5282/ubm/epub.90748

  5. Alshammari, N., Alanazi, S.: The impact of using different annotation schemes on named entity recognition. Egypt. Inform. J. 22(3), 295–302 (2021). https://doi.org/10.1016/j.eij.2020.10.004

    Article  Google Scholar 

  6. Bell, C., et al.: Automated data extraction from historical city directories: the rise and fall of mid-century gas stations in Providence, RI. PLoS ONE 15(8), e0220219 (2020). https://doi.org/10.1371/journal.pone.0220219

    Article  Google Scholar 

  7. Bertinetto, L., Mueller, R., Tertikas, K., Samangooei, S., Lord, N.A.: Making better mistakes: Leveraging class hierarchies with deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  8. Dinarelli, M., Rosset, S.: Models cascade for tree-structured named entity detection. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 1269–1278. Asian Federation of Natural Language Processing, Chiang Mai, Thailand (2011)

    Google Scholar 

  9. Dinarelli, M., Rosset, S.: Tree-structured named entity recognition on OCR data: analysis, processing and results. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1266–1272. European Language Resources Association (ELRA), Istanbul, Turkey (2012)

    Google Scholar 

  10. Ehrmann, M., Hamdi, A., Linhares Pontes, E., Romanello, M., Doucet, A.: Named entity recognition and classification in historical documents: a survey. ACM Computing Survey (2021). https://infoscience.epfl.ch/record/297355

  11. Ehrmann, M., Romanello, M., Doucet, A., Clematide, S.: Hipe-2022 shared task named entity datasets (2022). https://doi.org/10.5281/zenodo.6375600

  12. Finkel, J.R., Manning, C.D.: Nested named entity recognition. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 141–150. Association for Computational Linguistics, Singapore (2009)

    Google Scholar 

  13. Jia, L., Liu, S., Wei, F., Kong, B., Wang, G.: Nested named entity recognition via an independent-layered pretrained model. IEEE Access 9, 109693–109703 (2021). https://doi.org/10.1109/ACCESS.2021.3102685

    Article  Google Scholar 

  14. Ju, M., Miwa, M., Ananiadou, S.: A neural layered model for nested named entity recognition. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long Papers), pp. 1446–1459. Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1131

  15. Kišš, M., Beneš, K., Hradiš, M.: AT-ST: self-training adaptation strategy for OCR in domains with limited transcriptions. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 463–477. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_31

    Chapter  Google Scholar 

  16. Kohút, J., Hradiš, M.: TS-Net: OCR trained to switch between text transcription styles. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 478–493. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_32

    Chapter  Google Scholar 

  17. Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34(1), 50–70 (2022). https://doi.org/10.1109/TKDE.2020.2981314

    Article  Google Scholar 

  18. Marinho, Z., Mendes, A., Miranda, S., Nogueira, D.: Hierarchical nested named entity recognition. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 28–34. Association for Computational Linguistics, Minneapolis, Minnesota, USA (2019). https://doi.org/10.18653/v1/W19-1904

  19. Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219 (2020). https://doi.org/10.18653/v1/2020.acl-main.645

  20. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30 (2007). https://doi.org/10.1075/li.30.1.03nad

  21. Nakayama, H.: seqeval: a python framework for sequence labeling evaluation (2018). https://github.com/chakki-works/seqeval

  22. Neudecker, C., Baierer, K., Gerber, M., Christian, C., Apostolos, A., Stefan, P.: A survey of OCR evaluation tools and metrics. In: The 6th International workshop on Workshop on Historical Document Imaging and Processing, pp. 13–18 (2021)

    Google Scholar 

  23. Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Third Workshop on Very Large Corpora (1995), https://aclanthology.org/W95-0107

  24. Santos, E.A.: OCR evaluation tools for the 21st century. In: Proceedings of the Workshop on Computational Methods for Endangered Languages. vol. 1 (2019). https://doi.org/10.33011/computel.v1i.345

  25. Shen, D., Zhang, J., Zhou, G., Su, J., Tan, C.L.: Effective adaptation of hidden Markov model-based named entity recognizer for biomedical domain. In: Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pp. 49–56. Association for Computational Linguistics, Sapporo, Japan (2003). https://doi.org/10.3115/1118958.1118965

  26. Wajsbürt, P.: Extraction and normalization of simple and structured entities in medical documents. Ph.D. thesis, Sorbonne Université (2021)

    Google Scholar 

  27. Wajsbürt, P., Taillé, Y., Tannier, X.: Effect of depth order on iterative nested named entity recognition models. In: Tucker, A., Henriques Abreu, P., Cardoso, J., Pereira Rodrigues, P., Riaño, D. (eds.) AIME 2021. LNCS (LNAI), vol. 12721, pp. 428–432. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77211-6_50

    Chapter  Google Scholar 

  28. Wang, Y., Tong, H., Zhu, Z., Li, Y.: Nested named entity recognition: a survey. ACM Trans. Knowl. Discov. Data 16(6), 108:1–108:29 (2022). https://doi.org/10.1145/3522593

  29. Zheng, C., Cai, Y., Xu, J., Leung, H.f., Xu, G.: A Boundary-aware neural model for nested named entity recognition. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 357–366. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1034

Download references

Acknowledgments

This work is supported by the French National Research Agency (ANR), as part of the SODUCO project (grant ANR-18-CE38-0013) and by the French Ministry of the Armed Forces - Defence Innovation Agency (AID).

Our datasets (images and their associated text transcription), code and models are available on Zenodo (https://doi.org/10.5281/zenodo.7864174, https://doi.org/10.5281/zenodo.7867008) and HuggingFace (https://huggingface.co/nlpso).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Solenn Tual .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tual, S., Abadie, N., Chazalon, J., Duménieu, B., Carlinet, E. (2023). A Benchmark of Nested Named Entity Recognition Approaches in Historical Structured Documents. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14189. Springer, Cham. https://doi.org/10.1007/978-3-031-41682-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41682-8_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41681-1

  • Online ISBN: 978-3-031-41682-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics