Abstract
Named Entity Recognition (NER) is a key step in the creation of structured data from digitised historical documents. Traditional NER approaches deal with flat named entities, whereas entities are often nested. For example, a postal address might contain a street name and a number. This work compares three nested NER approaches, including two state-of-the-art approaches using Transformer-based architectures. We introduce a new Transformer-based approach based on joint labelling and semantic weighting of errors, evaluated on a collection of 19th-century Paris trade directories. We evaluate approaches regarding the impact of supervised fine-tuning, unsupervised pre-training with noisy texts, and variation of IOB tagging formats. Our results show that while nested NER approaches enable extracting structured data directly, they do not benefit from the extra knowledge provided during training and reach a performance similar to the base approach on flat entities. Even though all 3 approaches perform well in terms of F1-scores, joint labelling is most suitable for hierarchically structured data. Finally, our experiments reveal the superiority of the IO tagging format on such data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abadie, N., Bacciochi, S., Carlinet, E., Chazalon, J., Cristofoli, P., Duménieu, B., Perret, J.: A dataset of French trade directories from the 19th century (FTD) (2022). https://doi.org/10.5281/zenodo.6394464
Abadie, N., Carlinet, E., Chazalon, J., Duménieu, B.: A benchmark of named entity recognition approaches in historical documents application to 19th century french directories. In: Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, Proceedings. La Rochelle, France (2022). https://doi.org/10.1007/978-3-031-06555-2_30
Agrawal, A., Tripathi, S., Vardhan, M., Sihag, V., Choudhary, G., Dragoni, N.: BERT-Based Transfer-Learning Approach for Nested Named-Entity Recognition Using Joint Labeling. Appl. Sci. 12(3), 976 (2022). https://doi.org/10.3390/app12030976
Albers, T., Kappner, K.: Perks and pitfalls of city directories as a micro-geographic data source. Collaborative Research Center Transregio 190 Discussion Paper No. 315 (2022). https://doi.org/10.5282/ubm/epub.90748
Alshammari, N., Alanazi, S.: The impact of using different annotation schemes on named entity recognition. Egypt. Inform. J. 22(3), 295–302 (2021). https://doi.org/10.1016/j.eij.2020.10.004
Bell, C., et al.: Automated data extraction from historical city directories: the rise and fall of mid-century gas stations in Providence, RI. PLoS ONE 15(8), e0220219 (2020). https://doi.org/10.1371/journal.pone.0220219
Bertinetto, L., Mueller, R., Tertikas, K., Samangooei, S., Lord, N.A.: Making better mistakes: Leveraging class hierarchies with deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Dinarelli, M., Rosset, S.: Models cascade for tree-structured named entity detection. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 1269–1278. Asian Federation of Natural Language Processing, Chiang Mai, Thailand (2011)
Dinarelli, M., Rosset, S.: Tree-structured named entity recognition on OCR data: analysis, processing and results. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1266–1272. European Language Resources Association (ELRA), Istanbul, Turkey (2012)
Ehrmann, M., Hamdi, A., Linhares Pontes, E., Romanello, M., Doucet, A.: Named entity recognition and classification in historical documents: a survey. ACM Computing Survey (2021). https://infoscience.epfl.ch/record/297355
Ehrmann, M., Romanello, M., Doucet, A., Clematide, S.: Hipe-2022 shared task named entity datasets (2022). https://doi.org/10.5281/zenodo.6375600
Finkel, J.R., Manning, C.D.: Nested named entity recognition. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 141–150. Association for Computational Linguistics, Singapore (2009)
Jia, L., Liu, S., Wei, F., Kong, B., Wang, G.: Nested named entity recognition via an independent-layered pretrained model. IEEE Access 9, 109693–109703 (2021). https://doi.org/10.1109/ACCESS.2021.3102685
Ju, M., Miwa, M., Ananiadou, S.: A neural layered model for nested named entity recognition. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long Papers), pp. 1446–1459. Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1131
Kišš, M., Beneš, K., Hradiš, M.: AT-ST: self-training adaptation strategy for OCR in domains with limited transcriptions. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 463–477. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_31
Kohút, J., Hradiš, M.: TS-Net: OCR trained to switch between text transcription styles. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12824, pp. 478–493. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_32
Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34(1), 50–70 (2022). https://doi.org/10.1109/TKDE.2020.2981314
Marinho, Z., Mendes, A., Miranda, S., Nogueira, D.: Hierarchical nested named entity recognition. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 28–34. Association for Computational Linguistics, Minneapolis, Minnesota, USA (2019). https://doi.org/10.18653/v1/W19-1904
Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219 (2020). https://doi.org/10.18653/v1/2020.acl-main.645
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30 (2007). https://doi.org/10.1075/li.30.1.03nad
Nakayama, H.: seqeval: a python framework for sequence labeling evaluation (2018). https://github.com/chakki-works/seqeval
Neudecker, C., Baierer, K., Gerber, M., Christian, C., Apostolos, A., Stefan, P.: A survey of OCR evaluation tools and metrics. In: The 6th International workshop on Workshop on Historical Document Imaging and Processing, pp. 13–18 (2021)
Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Third Workshop on Very Large Corpora (1995), https://aclanthology.org/W95-0107
Santos, E.A.: OCR evaluation tools for the 21st century. In: Proceedings of the Workshop on Computational Methods for Endangered Languages. vol. 1 (2019). https://doi.org/10.33011/computel.v1i.345
Shen, D., Zhang, J., Zhou, G., Su, J., Tan, C.L.: Effective adaptation of hidden Markov model-based named entity recognizer for biomedical domain. In: Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pp. 49–56. Association for Computational Linguistics, Sapporo, Japan (2003). https://doi.org/10.3115/1118958.1118965
Wajsbürt, P.: Extraction and normalization of simple and structured entities in medical documents. Ph.D. thesis, Sorbonne Université (2021)
Wajsbürt, P., Taillé, Y., Tannier, X.: Effect of depth order on iterative nested named entity recognition models. In: Tucker, A., Henriques Abreu, P., Cardoso, J., Pereira Rodrigues, P., Riaño, D. (eds.) AIME 2021. LNCS (LNAI), vol. 12721, pp. 428–432. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77211-6_50
Wang, Y., Tong, H., Zhu, Z., Li, Y.: Nested named entity recognition: a survey. ACM Trans. Knowl. Discov. Data 16(6), 108:1–108:29 (2022). https://doi.org/10.1145/3522593
Zheng, C., Cai, Y., Xu, J., Leung, H.f., Xu, G.: A Boundary-aware neural model for nested named entity recognition. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 357–366. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1034
Acknowledgments
This work is supported by the French National Research Agency (ANR), as part of the SODUCO project (grant ANR-18-CE38-0013) and by the French Ministry of the Armed Forces - Defence Innovation Agency (AID).
Our datasets (images and their associated text transcription), code and models are available on Zenodo (https://doi.org/10.5281/zenodo.7864174, https://doi.org/10.5281/zenodo.7867008) and HuggingFace (https://huggingface.co/nlpso).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tual, S., Abadie, N., Chazalon, J., Duménieu, B., Carlinet, E. (2023). A Benchmark of Nested Named Entity Recognition Approaches in Historical Structured Documents. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14189. Springer, Cham. https://doi.org/10.1007/978-3-031-41682-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-41682-8_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41681-1
Online ISBN: 978-3-031-41682-8
eBook Packages: Computer ScienceComputer Science (R0)