Abstract
Recent developments in Named Entity Recognition (NER) have demonstrated good results for grammatically correct texts, even in low resourced settings. However, when the NER model faces ungrammatical text, it often shows poor performance. In this study, we analyze NER performance on datasets containing errors typical for user-generated texts in the Latvian language. We explore three different strategies to increase the robustness of the named entity recognition: error injection into grammatically correct texts, augmenting grammatically correct texts with erroneous texts and augmenting grammatically correct texts with erroneous texts that contain specific types of errors. We demonstrate that in low resourced settings, the best noise-robust model could be obtained by augmenting training data with datasets containing different error types. Our best model achieves an average F1 score of 83.5 (84.1 for baseline) on grammatically correct text, while keeping good performance (79 F1 vs. 66 for baseline) on noisy texts.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arkhipov, M., Trofimova, M., Kuratov, Y., Sorokin, A.: Tuning multilingual transformers for language-specific named entity recognition. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pp. 89–93 (2019). http://dx.doi.org/10.18653/v1/W19-3712
Baldwin, T., de Marneffe, M.C., Han, B., Kim, Y.-B., Ritter, A., Xu, W.: Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. In: Proceedings of the Workshop on Noisy User-generated Text, pp. 126–135 (2015). http://dx.doi.org/10.18653/v1/W15-4319
Bergmanis, T., Stafanovičs, A., Pinnis, M.: Robust neural machine translation: modeling orthographic and interpunctual variation. In: Human Language Technologies – The Baltic Perspective, pp. 80–86 (2020). https://doi.org/10.3233/FAIA200606
Bodapati, S., Yun, H., Al-Onaizan, Y.: Robustness to capitalization errors in named entity recognition. In: Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pp. 237–242 (2019). http://dx.doi.org/10.18653/v1/D19-5531
Brown, E.W., Coden, A.R.: Capitalization recovery for text. In: Coden, A.R., Brown, E.W., Srinivasan, S. (eds.) IRTSA 2001. LNCS, vol. 2273, pp. 11–22. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45637-6_2
Chinchor, N.: MUC-7 named entity task definition. In: Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, 29 April–1 May 1998 (1998). https://www.aclweb.org/anthology/M98-1028.
Deksne, D.: Chat language normalisation using machine learning methods. In: Proceedings of the 11th International Conference on Agents and Artificial Intelligence (ICAART 2019), pp. 965–972 (2019). https://doi.org/10.5220/0007693509650972
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://www.aclweb.org/anthology/N19-1423/
Gruzitis, N., et al.: Creation of a balanced state-of-the-art multilayer corpus for NLU. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018). https://www.aclweb.org/anthology/L18-1714
Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360 (2020). https://aclanthology.org/2020.acl-main.740.pdf
Hamdi, A., Jean-Caurant, A., Sidere, N., Coustaty, M., Doucet, A.: An analysis of the performance of named entity recognition over OCRed documents. In: ACM/IEEE Joint Conference on Digital Libraries (JCDL) (2019). https://doi.org/10.1109/JCDL.2019.00057
Mayhew, S., Tsygankova, T., Roth, D.: ner and pos when nothing is capitalized. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6256–6261 (2019). http://dx.doi.org/10.18653/v1/D19-1650
Mayhew, S., Gupta, N., Roth, D.: Robust named entity recognition with truecasing pretraining. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8480–8487 (2020). https://doi.org/10.1609/aaai.v34i05.6368
Narayan, P.L., Nagesh, A., Surdeanu, M.: Exploration of noise strategies in semi-supervised named entity classification. In: Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM), pp. 186–191 (2019). http://dx.doi.org/10.18653/v1/S19-1020
Nguyen, D.Q., Vu, T., Nguyen, A.T.: BERTweet: a pre-trained language model for English Tweets. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 9–14 (2020). https://aclanthology.org/2020.emnlp-demos.2/
Pinnis, M.: Latvian and Lithuanian named entity recognition with TildeNER. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1258–1265 (2012). https://www.aclweb.org/anthology/L12-1566/
Pinnis, M.: Latvian tweet corpus and investigation of sentiment analysis for Latvian. In: Frontiers in Artificial Intelligence and Applications. Volume 307: Human Language Technologies – The Baltic Perspective, pp. 112–119 (2018)
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534 (2011). https://www.aclweb.org/anthology/D11-1141
Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRF (2020). https://arxiv.org/abs/1909.10649
Vāravs, A., Salimbajevs, A.: Restoring punctuation and capitalization using transformer models. In: Dutoit, T., Martín-Vide, C., Pironkov, G. (eds.) SLSP 2018. LNCS (LNAI), vol. 11171, pp. 91–102. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00810-9_9
Vīksna, R., Skadiņa, I.: Large language models for Latvian named entity recognition. In: Human Language Technologies – The Baltic Perspective, pp. 62–69 (2020). https://doi.org/10.3233/FAIA200603
Virtanen, A., et al.: Multilingual is not enough: BERT for Finnish (2019). https://arxiv.org/abs/1912.07076
Znotiņš, A., Cīrule, E.: NLP-PIPE: Latvian NLP tool pipeline. In: Human Language Technologies – The Baltic Perspective, pp. 183–189 (2018). https://doi.org/10.3233/978-1-61499-912-6-183
Znotiņš, A., Barzdiņš, G.: LVBERT: transformer-based model for Latvian language understanding. In: Human Language Technologies – The Baltic Perspective, pp. 111–115 (2020). https://doi.org/10.3233/FAIA200610
Acknowledgements
This research has been supported by the European Regional Development Fund within the joint research project of SIA Tilde and University of Latvia “Multilingual Artificial Intelligence Based Human Computer Interaction” No. 1.1.1.1/18/A/148.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Viksna, R., Skadiņa, I. (2021). Robustness of Named Entity Recognition: Case of Latvian. In: Espinosa-Anke, L., Martín-Vide, C., Spasić, I. (eds) Statistical Language and Speech Processing. SLSP 2021. Lecture Notes in Computer Science(), vol 13062. Springer, Cham. https://doi.org/10.1007/978-3-030-89579-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-89579-2_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89578-5
Online ISBN: 978-3-030-89579-2
eBook Packages: Computer ScienceComputer Science (R0)