Abstract
Contextualized language models are becoming omnipresent in the field of Natural Language Processing (NLP). Their learning representation capabilities show dominant results in almost all downstream NLP tasks. The main challenge that low-resource languages face is the lack of language-specific language models since their pre-training process requires high-computing capabilities and rich resources of textual data. This paper describes our efforts to pre-train the first contextual language model in the Macedonian language (MACEDONIZER), pre-trained on a 6.5 GB corpus of Macedonian texts crawled from public web domains and Wikipedia. Next, we evaluate the pre-trained version of the model on three different downstream tasks: Sentiment Analysis (SA), Natural Language Inference (NLI) and Named Entity Recognition (NER). The evaluation results are compared to the cross-lingual version of the RoBERTa model - XML-RoBERTa. The results show that MACEDONIZER achieves state-of-the-art results in all downstream tasks. Finally, the pre-trained version of the MACEDONIZER is made for free usage and further task-specific fine-tuning via HuggingFace.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Al-Garadi, M.A., et al.: Text classification models for the automatic detection of nonmedical prescription medication use from social media. BMC Med. Inform. Decis. Mak. 21(1), 1–13 (2021)
Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France, pp. 9–15. European Language Resource Association (2020). https://aclanthology.org/2020.osact-1.2
Araci, D.: FinBERT: financial sentiment analysis with pre-trained language models. CoRR abs/1908.10063 (2019). http://arxiv.org/abs/1908.10063
Arkhipov, M., Trofimova, M., Kuratov, Y., Sorokin, A.: Tuning multilingual transformers for language-specific named entity recognition. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pp. 89–93 (2019)
Brown, T.B., et al.: Language models are few-shot learners. CoRR abs/2005.14165 (2020). https://arxiv.org/abs/2005.14165
Carmo, D., Piau, M., Campiotti, I., Nogueira, R., de Alencar Lotufo, R.: PTT5: pretraining and validating the T5 model on Brazilian Portuguese data. CoRR abs/2008.09144 (2020). https://arxiv.org/abs/2008.09144
Chung, H.W., Garrette, D., Tan, K.C., Riesa, J.: Improving multilingual models with language-clustered vocabularies. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4536–4546. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-main.367. https://aclanthology.org/2020.emnlp-main.367
Clark, K., Luong, M., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. CoRR abs/2003.10555 (2020). https://arxiv.org/abs/2003.10555
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. CoRR abs/1911.02116 (2019). http://arxiv.org/abs/1911.02116
Dadas, S., Perełkiewicz, M., Poświata, R.: Pre-training polish transformer-based language models at scale. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2020. LNCS (LNAI), vol. 12416, pp. 301–314. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61534-5_27
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
Farahani, M., Gharachorloo, M., Farahani, M., Manthouri, M.: ParsBERT: transformer-based model for Persian language understanding. CoRR abs/2005.12515 (2020). https://arxiv.org/abs/2005.12515
He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. CoRR abs/2006.03654 (2020). https://arxiv.org/abs/2006.03654
Lample, G., Conneau, A.: Cross-lingual language model pretraining. CoRR abs/1901.07291 (2019). http://arxiv.org/abs/1901.07291
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. CoRR abs/1909.11942 (2019). http://arxiv.org/abs/1909.11942
Le, H., et al.: FlauBERT: unsupervised language model pre-training for French. CoRR abs/1912.05372 (2019). http://arxiv.org/abs/1912.05372
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR abs/1910.13461 (2019). http://arxiv.org/abs/1910.13461
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019). https://doi.org/10.48550/ARXIV.1907.11692. https://arxiv.org/abs/1907.11692
Livinska, H.V., Makarevych, O.: Feasibility of improving BERT for linguistic prediction on Ukrainian corpus. In: COLINS (2020)
Ljubešić, N., Lauc, D.: Bertić-the transformer language model for Bosnian, Croatian, Montenegrin and Serbian. In: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pp. 37–42 (2021)
MacCartney, B.: Natural Language Inference. Stanford University (2009)
Martin, L., et al.: CamemBERT: a tasty French language model. CoRR abs/1911.03894 (2019). http://arxiv.org/abs/1911.03894
Medhat, W., Hassan, A., Korashy, H.: Sentiment analysis algorithms and applications: a survey. Ain Shams Eng. J. 5(4), 1093–1113 (2014)
Mishev, K., Gjorgjevikj, A., Vodenska, I., Chitkushev, L.T., Trajanov, D.: Evaluation of sentiment analysis in finance: from lexicons to transformers. IEEE Access 8, 131662–131682 (2020)
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)
Pikuliak, M., et al.: SlovakBERT: Slovak masked language model. CoRR abs/2109.15254 (2021). https://arxiv.org/abs/2109.15254
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683 (2019). http://arxiv.org/abs/1910.10683
Rasmy, L., Xiang, Y., Xie, Z., Tao, C., Zhi, D.: Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit. Med. 4(1), 1–13 (2021)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108 (2019). http://arxiv.org/abs/1910.01108
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Ulčar, M., Robnik-Šikonja, M.: Finest BERT and crosloengual BERT: less is more in multilingual models. CoRR abs/2006.07890 (2020). https://arxiv.org/abs/2006.07890
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
de Vries, W., Nissim, M.: As good as new. How to successfully recycle English GPT-2 to make models for other languages. CoRR abs/2012.05628 (2020). https://arxiv.org/abs/2012.05628
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)
Acknowledgement
The work in this paper was partially financed by the Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dobreva, J. et al. (2022). MACEDONIZER - The Macedonian Transformer Language Model. In: Zdravkova, K., Basnarkov, L. (eds) ICT Innovations 2022. Reshaping the Future Towards a New Normal. ICT Innovations 2022. Communications in Computer and Information Science, vol 1740. Springer, Cham. https://doi.org/10.1007/978-3-031-22792-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-22792-9_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22791-2
Online ISBN: 978-3-031-22792-9
eBook Packages: Computer ScienceComputer Science (R0)