A Benchmark of Nested Named Entity Recognition Approaches in Historical Structured Documents

Tual, Solenn; Abadie, Nathalie; Chazalon, Joseph; Duménieu, Bertrand; Carlinet, Edwin

doi:10.1007/978-3-031-41682-8_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14189))

Included in the following conference series:

International Conference on Document Analysis and Recognition

1137 Accesses

Abstract

Named Entity Recognition (NER) is a key step in the creation of structured data from digitised historical documents. Traditional NER approaches deal with flat named entities, whereas entities are often nested. For example, a postal address might contain a street name and a number. This work compares three nested NER approaches, including two state-of-the-art approaches using Transformer-based architectures. We introduce a new Transformer-based approach based on joint labelling and semantic weighting of errors, evaluated on a collection of 19^th-century Paris trade directories. We evaluate approaches regarding the impact of supervised fine-tuning, unsupervised pre-training with noisy texts, and variation of IOB tagging formats. Our results show that while nested NER approaches enable extracting structured data directly, they do not benefit from the extra knowledge provided during training and reach a performance similar to the base approach on flat entities. Even though all 3 approaches perform well in terms of F1-scores, joint labelling is most suitable for hierarchically structured data. Finally, our experiments reveal the superiority of the IO tagging format on such data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Benchmark of Named Entity Recognition Approaches in Historical Documents Application to 19 $$^{th}$$ Century French Directories

Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents

Nested Entity Recognition Method Based on Multidimensional Features and Fuzzy Localization

Article Open access 04 June 2024

References

Abadie, N., Bacciochi, S., Carlinet, E., Chazalon, J., Cristofoli, P., Duménieu, B., Perret, J.: A dataset of French trade directories from the 19th century (FTD) (2022). https://doi.org/10.5281/zenodo.6394464
Abadie, N., Carlinet, E., Chazalon, J., Duménieu, B.: A benchmark of named entity recognition approaches in historical documents application to 19th century french directories. In: Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, Proceedings. La Rochelle, France (2022). https://doi.org/10.1007/978-3-031-06555-2_30
Agrawal, A., Tripathi, S., Vardhan, M., Sihag, V., Choudhary, G., Dragoni, N.: BERT-Based Transfer-Learning Approach for Nested Named-Entity Recognition Using Joint Labeling. Appl. Sci. 12(3), 976 (2022). https://doi.org/10.3390/app12030976
Article Google Scholar
Albers, T., Kappner, K.: Perks and pitfalls of city directories as a micro-geographic data source. Collaborative Research Center Transregio 190 Discussion Paper No. 315 (2022). https://doi.org/10.5282/ubm/epub.90748
Alshammari, N., Alanazi, S.: The impact of using different annotation schemes on named entity recognition. Egypt. Inform. J. 22(3), 295–302 (2021). https://doi.org/10.1016/j.eij.2020.10.004
Article Google Scholar
Bell, C., et al.: Automated data extraction from historical city directories: the rise and fall of mid-century gas stations in Providence, RI. PLoS ONE 15(8), e0220219 (2020). https://doi.org/10.1371/journal.pone.0220219
Article Google Scholar
Bertinetto, L., Mueller, R., Tertikas, K., Samangooei, S., Lord, N.A.: Making better mistakes: Leveraging class hierarchies with deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Dinarelli, M., Rosset, S.: Models cascade for tree-structured named entity detection. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 1269–1278. Asian Federation of Natural Language Processing, Chiang Mai, Thailand (2011)
Google Scholar
Dinarelli, M., Rosset, S.: Tree-structured named entity recognition on OCR data: analysis, processing and results. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1266–1272. European Language Resources Association (ELRA), Istanbul, Turkey (2012)
Google Scholar
Ehrmann, M., Hamdi, A., Linhares Pontes, E., Romanello, M., Doucet, A.: Named entity recognition and classification in historical documents: a survey. ACM Computing Survey (2021). https://infoscience.epfl.ch/record/297355
Ehrmann, M., Romanello, M., Doucet, A., Clematide, S.: Hipe-2022 shared task named entity datasets (2022). https://doi.org/10.5281/zenodo.6375600
Finkel, J.R., Manning, C.D.: Nested named entity recognition. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 141–150. Association for Computational Linguistics, Singapore (2009)
Google Scholar
Jia, L., Liu, S., Wei, F., Kong, B., Wang, G.: Nested named entity recognition via an independent-layered pretrained model. IEEE Access 9, 109693–109703 (2021). https://doi.org/10.1109/ACCESS.2021.3102685
Article Google Scholar
Ju, M., Miwa, M., Ananiadou, S.: A neural layered model for nested named entity recognition. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long Papers), pp. 1446–1459. Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1131
Kišš, M., Beneš, K., Hradiš, M.: AT-ST: self-training adaptation strategy for OCR in domains with limited transcriptions. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 463–477. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_31
Chapter Google Scholar
Kohút, J., Hradiš, M.: TS-Net: OCR trained to switch between text transcription styles. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 478–493. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_32
Chapter Google Scholar
Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34(1), 50–70 (2022). https://doi.org/10.1109/TKDE.2020.2981314
Article Google Scholar
Marinho, Z., Mendes, A., Miranda, S., Nogueira, D.: Hierarchical nested named entity recognition. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 28–34. Association for Computational Linguistics, Minneapolis, Minnesota, USA (2019). https://doi.org/10.18653/v1/W19-1904
Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219 (2020). https://doi.org/10.18653/v1/2020.acl-main.645
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30 (2007). https://doi.org/10.1075/li.30.1.03nad
Nakayama, H.: seqeval: a python framework for sequence labeling evaluation (2018). https://github.com/chakki-works/seqeval
Neudecker, C., Baierer, K., Gerber, M., Christian, C., Apostolos, A., Stefan, P.: A survey of OCR evaluation tools and metrics. In: The 6th International workshop on Workshop on Historical Document Imaging and Processing, pp. 13–18 (2021)
Google Scholar
Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Third Workshop on Very Large Corpora (1995), https://aclanthology.org/W95-0107
Santos, E.A.: OCR evaluation tools for the 21st century. In: Proceedings of the Workshop on Computational Methods for Endangered Languages. vol. 1 (2019). https://doi.org/10.33011/computel.v1i.345
Shen, D., Zhang, J., Zhou, G., Su, J., Tan, C.L.: Effective adaptation of hidden Markov model-based named entity recognizer for biomedical domain. In: Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pp. 49–56. Association for Computational Linguistics, Sapporo, Japan (2003). https://doi.org/10.3115/1118958.1118965
Wajsbürt, P.: Extraction and normalization of simple and structured entities in medical documents. Ph.D. thesis, Sorbonne Université (2021)
Google Scholar
Wajsbürt, P., Taillé, Y., Tannier, X.: Effect of depth order on iterative nested named entity recognition models. In: Tucker, A., Henriques Abreu, P., Cardoso, J., Pereira Rodrigues, P., Riaño, D. (eds.) AIME 2021. LNCS (LNAI), vol. 12721, pp. 428–432. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77211-6_50
Chapter Google Scholar
Wang, Y., Tong, H., Zhu, Z., Li, Y.: Nested named entity recognition: a survey. ACM Trans. Knowl. Discov. Data 16(6), 108:1–108:29 (2022). https://doi.org/10.1145/3522593
Zheng, C., Cai, Y., Xu, J., Leung, H.f., Xu, G.: A Boundary-aware neural model for nested named entity recognition. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 357–366. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1034

Download references

Acknowledgments

This work is supported by the French National Research Agency (ANR), as part of the SODUCO project (grant ANR-18-CE38-0013) and by the French Ministry of the Armed Forces - Defence Innovation Agency (AID).

Our datasets (images and their associated text transcription), code and models are available on Zenodo (https://doi.org/10.5281/zenodo.7864174, https://doi.org/10.5281/zenodo.7867008) and HuggingFace (https://huggingface.co/nlpso).

Author information

Authors and Affiliations

LASTIG, Univ. Gustave Eiffel, IGN-ENSG, 94160, Saint-Mandé, France
Solenn Tual & Nathalie Abadie
EPITA Research Laboratory (LRE), Le Kremlin-Bicêtre, France
Joseph Chazalon & Edwin Carlinet
CRH-EHESS, Paris, France
Bertrand Duménieu

Authors

Solenn Tual
View author publications
You can also search for this author in PubMed Google Scholar
Nathalie Abadie
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Chazalon
View author publications
You can also search for this author in PubMed Google Scholar
Bertrand Duménieu
View author publications
You can also search for this author in PubMed Google Scholar
Edwin Carlinet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Solenn Tual .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tual, S., Abadie, N., Chazalon, J., Duménieu, B., Carlinet, E. (2023). A Benchmark of Nested Named Entity Recognition Approaches in Historical Structured Documents. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14189. Springer, Cham. https://doi.org/10.1007/978-3-031-41682-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-41682-8_8
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41681-1
Online ISBN: 978-3-031-41682-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Benchmark of Nested Named Entity Recognition Approaches in Historical Structured Documents