Analyzing the Impact of Tokenization on Multilingual Epidemic Surveillance in Low-Resource Languages

Mutuvi, Stephen; Boros, Emanuela; Doucet, Antoine; Lejeune, Gaël; Jatowt, Adam; Odeo, Moses

doi:10.1007/978-3-031-41682-8_2

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14189))

Included in the following conference series:

International Conference on Document Analysis and Recognition

706 Accesses

Abstract

Pre-trained language models have been widely successful, particularly in settings with sufficient training data. However, achieving similar results in low-resource multilingual settings and specialized domains, such as epidemic surveillance, remains challenging. In this paper, we propose hypotheses regarding the factors that could impact the performance of an epidemic event extraction system in a multilingual low-resource scenario: the type of pre-trained language model, the quality of the pre-trained tokenizer, and the characteristics of the entities to be extracted. We perform an exhaustive analysis of these factors and observe a strong correlation between them and the observed model performance on a low-resource multilingual epidemic surveillance task. Consequently, we believe that providing language-specific adaptation and extension of multilingual tokenizers with domain-specific entities is beneficial to multilingual epidemic event extraction in low-resource settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The corpus is freely and publicly available at https://daniel.greyc.fr/public/index.php?a=corpus.
2.
All models can be found at Hugging Face website: https://huggingface.co.
3.
In all experiments, we use AdamW [19] with a learning rate of \(1e-5\) and for 20 epochs. We also considered a maximum sentence length of 164 [1].
4.
The HuggingFace https://huggingface.co/docs/transformers/ library provides a function for adding continued entities to the existing vocabulary of a tokenizer. In addition, the function includes a mechanism for discarding tokens in the extension vocabulary that appear in the original pre-trained vocabulary, ensuring that the extension vocabulary is an absolute complement to the original vocabulary. The size of the extension vocabulary varies depending on the language and pre-trained model.

References

Adelani, D.I., et al.: MasakhaNER: named entity recognition for African languages. Trans. Assoc. Comput. Linguist. 9, 1116–1131 (2021). https://doi.org/10.1162/tacl_a_00416, https://aclanthology.org/2021.tacl-1.66
Balajee, S.A., Salyer, S.J., Greene-Cramer, B., Sadek, M., Mounts, A.W.: The practice of event-based surveillance: concept and methods. Global Secur. Health Sci. Policy 6(1), 1–9 (2021)
Article Google Scholar
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1371, https://aclanthology.org/D19-1371
Brownstein, J.S., Freifeld, C.C., Reis, B.Y., Mandl, K.D.: Surveillance sans frontieres: internet-based emerging infectious disease intelligence and the healthmap project. PLoS Med. 5(7), e151 (2008)
Article Google Scholar
Chen, X., Wang, S., Fu, B., Long, M., Wang, J.: Catastrophic forgetting meets negative transfer: batch spectral shrinkage for safe transfer learning. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, 5-10 July 2020, pp. 8440–8451. Association for Computational Linguistics (2020). https://www.aclweb.org/anthology/2020.acl-main.747/
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423
Dórea, F.C., Revie, C.W.: Data-driven surveillance: effective collection, integration and interpretation of data to support decision-making. Front. in Vet. Sci. 8, 225 (2021)
Google Scholar
Faruqui, M., Kumar, S.: Multilingual open relation extraction using cross-lingual projection. arXiv preprint arXiv:1503.06450 (2015)
Feijo, D.D.V., Moreira, V.P.: Mono vs multilingual transformer-based models: a comparison across several language tasks. arXiv preprint arXiv:2007.09757 (2020)
Gage, P.: A new algorithm for data compression. C Users J. 12(2), 23–38 (1994)
Google Scholar
Garneau, N., Leboeuf, J.S., Lamontagne, L.: Predicting and interpreting embeddings for out of vocabulary words in downstream tasks. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 331–333. Association for Computational Linguistics, Brussels, Belgium, November 2018. https://doi.org/10.18653/v1/W18-5439, https://aclanthology.org/W18-5439
Gu, Y., et al.: Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare (HEALTH) 3(1), 1–23 (2021)
Google Scholar
Hedderich, M.A., Lange, L., Adel, H., Strötgen, J., Klakow, D.: A survey on recent approaches for natural language processing in low-resource scenarios. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2545–2568. Association for Computational Linguistics, June 2021. https://doi.org/10.18653/v1/2021.naacl-main.201, https://aclanthology.org/2021.naacl-main.201
Hong, J., Kim, T., Lim, H., Choo, J.: AVocaDo: strategy for adapting vocabulary to downstream domain. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 4692–4700. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://doi.org/10.18653/v1/2021.emnlp-main.385, https://aclanthology.org/2021.emnlp-main.385
Huang, K., Altosaar, J., Ranganath, R.: Clinicalbert: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 (2019)
Johnson, A.E., et al.: Mimic-iii, a freely accessible critical care database. Sci. Data 3(1), 1–9 (2016)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. Proc. Nat. Acad. Sci. 114(13), 3521–3526 (2017)
Article MathSciNet MATH Google Scholar
Klein, S., Tsarfaty, R.: Getting the## life out of living: How adequate are word-pieces for modelling complex morphology? In: Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 204–209 (2020)
Google Scholar
Lauscher, A., Ravishankar, V., Vulić, I., Glavaš, G.: From zero to hero: on the limitations of zero-shot cross-lingual transfer with multilingual transformers. arXiv preprint arXiv:2005.00633 (2020)
Lee, J., et al.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)
Article MathSciNet Google Scholar
Lejeune, G., Brixtel, R., Doucet, A., Lucas, N.: Multilingual event extraction for epidemic detection. Artif. Intell. Med. 65(2), 131–143 (2015)
Article Google Scholar
Lejeune, G., Brixtel, R., Lecluze, C., Doucet, A., Lucas, N.: Added-value of automatic multilingual text analysis for epidemic surveillance. In: Peek, N., Marín Morales, R., Peleg, M. (eds.) AIME 2013. LNCS (LNAI), vol. 7885, pp. 284–294. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38326-7_40
Chapter Google Scholar
Lin, Y., Liu, Z., Sun, M.: Neural relation extraction with multi-lingual attention. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 34–43 (2017)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, 2-4 May 2013, Workshop Track Proceedings (2013). http://arxiv.org/abs/1301.3781
Mutuvi, S., Boros, E., Doucet, A., Lejeune, G., Jatowt, A., Odeo, M.: Multilingual epidemiological text classification: a comparative study. In: COLING, International Conference on Computational Linguistics (2020)
Google Scholar
Neves, M., Leser, U.: A survey on annotation tools for the biomedical literature. Briefings Bioinf. 15(2), 327–340 (2014)
Article Google Scholar
Poerner, N., Waltinger, U., Schütze, H.: Inexpensive domain adaptation of pretrained language models: Case studies on biomedical NER and covid-19 QA. In: Findings of the Association for Computational Linguistics: EMNLP 2020, November 2020
Google Scholar
Rust, P., Pfeiffer, J., Vulić, I., Ruder, S., Gurevych, I.: How good is your tokenizer? on the monolingual performance of multilingual language models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3118–3135. Association for Computational Linguistics, August 2021. https://doi.org/10.18653/v1/2021.acl-long.243, https://aclanthology.org/2021.acl-long.243
Schick, T., Schütze, H.: Rare words: a major problem for contextualized embeddings and how to fix it by attentive mimicking. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8766–8774 (2020)
Google Scholar
Schuster, M., Nakajima, K.: Japanese and Korean voice search. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. IEEE (2012)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Association for Computational Linguistics, Berlin, Germany, August 2016. https://doi.org/10.18653/v1/P16-1162, https://aclanthology.org/P16-1162
Tai, W., Kung, H., Dong, X.L., Comiter, M., Kuo, C.F.: Exbert: extending pre-trained models with domain-specific vocabulary under constrained training resources. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 1433–1439 (2020)
Google Scholar
Tian, L., Zhang, X., Lau, J.H.: Rumour detection via zero-shot cross-lingual transfer learning. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds.) ECML PKDD 2021. LNCS (LNAI), vol. 12975, pp. 603–618. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86486-6_37
Chapter Google Scholar
Wang, X., Han, X., Lin, Y., Liu, Z., Sun, M.: Adversarial multi-lingual neural relation extraction. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1156–1166 (2018)
Google Scholar
Wang, Z., Mayhew, S., Roth, D., et al.: Cross-lingual ability of multilingual bert: an empirical study. arXiv preprint arXiv:1912.07840 (2019)
Wu, S., Dredze, M.: Beto, bentz, becas: the surprising cross-lingual effectiveness of BERT. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 833–844. Association for Computational Linguistics, Hong Kong, China, November 2019. https://doi.org/10.18653/v1/D19-1077, https://aclanthology.org/D19-1077
Wu, S., Dredze, M.: Are all languages created equal in multilingual BERT? In: Proceedings of the 5th Workshop on Representation Learning for NLP. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/v1/2020.repl4nlp-1.16, https://aclanthology.org/2020.repl4nlp-1.16
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Yangarber, R., Best, C., Von Etter, P., Fuart, F., Horby, D., Steinberger, R.: Combining information about epidemic threats from multiple sources. In: Proceedings of the MMIES Workshop, International Conference on Recent Advances in Natural Language Processing (RANLP 2007), Citeseer (2007)
Google Scholar
Yangarber, R., Jokipii, L., Rauramo, A., Huttunen, S.: Extracting information about outbreaks of infectious epidemics. In: Proceedings of HLT/EMNLP 2005 Interactive Demonstrations, pp. 22–23 (2005)
Google Scholar
Zou, B., Xu, Z., Hong, Y., Zhou, G.: Adversarial feature adaptation for cross-lingual relation classification. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 437–448 (2018)
Google Scholar

Download references

Acknowledgements

This work has been supported by the ANNA (2019-1R40226), TERMITRAD (2020-2019-8510010) and PYPA (2021-2021-12263410) projects funded by the Nouvelle-Aquitaine Region, France. It has also been supported by the French Embassy in Kenya and the French Foreign Ministry.

Author information

Authors and Affiliations

University of La Rochelle, L3i, 17000, La Rochelle, France
Stephen Mutuvi, Emanuela Boros & Antoine Doucet
Multimedia University of Kenya, Nairobi, Kenya
Stephen Mutuvi & Moses Odeo
Sorbonne Université, STIH Lab, 75006, Paris, France
Gaël Lejeune
University of Innsbruck, 6020, Innsbruck, Austria
Adam Jatowt

Authors

Stephen Mutuvi
View author publications
You can also search for this author in PubMed Google Scholar
Emanuela Boros
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Doucet
View author publications
You can also search for this author in PubMed Google Scholar
Gaël Lejeune
View author publications
You can also search for this author in PubMed Google Scholar
Adam Jatowt
View author publications
You can also search for this author in PubMed Google Scholar
Moses Odeo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephen Mutuvi .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mutuvi, S., Boros, E., Doucet, A., Lejeune, G., Jatowt, A., Odeo, M. (2023). Analyzing the Impact of Tokenization on Multilingual Epidemic Surveillance in Low-Resource Languages. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14189. Springer, Cham. https://doi.org/10.1007/978-3-031-41682-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-41682-8_2
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41681-1
Online ISBN: 978-3-031-41682-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Analyzing the Impact of Tokenization on Multilingual Epidemic Surveillance in Low-Resource Languages