Skip to main content

MACEDONIZER - The Macedonian Transformer Language Model

  • Conference paper
  • First Online:
ICT Innovations 2022. Reshaping the Future Towards a New Normal (ICT Innovations 2022)

Abstract

Contextualized language models are becoming omnipresent in the field of Natural Language Processing (NLP). Their learning representation capabilities show dominant results in almost all downstream NLP tasks. The main challenge that low-resource languages face is the lack of language-specific language models since their pre-training process requires high-computing capabilities and rich resources of textual data. This paper describes our efforts to pre-train the first contextual language model in the Macedonian language (MACEDONIZER), pre-trained on a 6.5 GB corpus of Macedonian texts crawled from public web domains and Wikipedia. Next, we evaluate the pre-trained version of the model on three different downstream tasks: Sentiment Analysis (SA), Natural Language Inference (NLI) and Named Entity Recognition (NER). The evaluation results are compared to the cross-lingual version of the RoBERTa model - XML-RoBERTa. The results show that MACEDONIZER achieves state-of-the-art results in all downstream tasks. Finally, the pre-trained version of the MACEDONIZER is made for free usage and further task-specific fine-tuning via HuggingFace.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://huggingface.co/datasets/wikiann.

  2. 2.

    https://huggingface.co/datasets/snli.

  3. 3.

    https://huggingface.co/macedonizer/mk-roberta-base.

References

  1. Al-Garadi, M.A., et al.: Text classification models for the automatic detection of nonmedical prescription medication use from social media. BMC Med. Inform. Decis. Mak. 21(1), 1–13 (2021)

    Article  Google Scholar 

  2. Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France, pp. 9–15. European Language Resource Association (2020). https://aclanthology.org/2020.osact-1.2

  3. Araci, D.: FinBERT: financial sentiment analysis with pre-trained language models. CoRR abs/1908.10063 (2019). http://arxiv.org/abs/1908.10063

  4. Arkhipov, M., Trofimova, M., Kuratov, Y., Sorokin, A.: Tuning multilingual transformers for language-specific named entity recognition. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pp. 89–93 (2019)

    Google Scholar 

  5. Brown, T.B., et al.: Language models are few-shot learners. CoRR abs/2005.14165 (2020). https://arxiv.org/abs/2005.14165

  6. Carmo, D., Piau, M., Campiotti, I., Nogueira, R., de Alencar Lotufo, R.: PTT5: pretraining and validating the T5 model on Brazilian Portuguese data. CoRR abs/2008.09144 (2020). https://arxiv.org/abs/2008.09144

  7. Chung, H.W., Garrette, D., Tan, K.C., Riesa, J.: Improving multilingual models with language-clustered vocabularies. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4536–4546. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-main.367. https://aclanthology.org/2020.emnlp-main.367

  8. Clark, K., Luong, M., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. CoRR abs/2003.10555 (2020). https://arxiv.org/abs/2003.10555

  9. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. CoRR abs/1911.02116 (2019). http://arxiv.org/abs/1911.02116

  10. Dadas, S., Perełkiewicz, M., Poświata, R.: Pre-training polish transformer-based language models at scale. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2020. LNCS (LNAI), vol. 12416, pp. 301–314. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61534-5_27

    Chapter  Google Scholar 

  11. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805

  12. Farahani, M., Gharachorloo, M., Farahani, M., Manthouri, M.: ParsBERT: transformer-based model for Persian language understanding. CoRR abs/2005.12515 (2020). https://arxiv.org/abs/2005.12515

  13. He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. CoRR abs/2006.03654 (2020). https://arxiv.org/abs/2006.03654

  14. Lample, G., Conneau, A.: Cross-lingual language model pretraining. CoRR abs/1901.07291 (2019). http://arxiv.org/abs/1901.07291

  15. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. CoRR abs/1909.11942 (2019). http://arxiv.org/abs/1909.11942

  16. Le, H., et al.: FlauBERT: unsupervised language model pre-training for French. CoRR abs/1912.05372 (2019). http://arxiv.org/abs/1912.05372

  17. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR abs/1910.13461 (2019). http://arxiv.org/abs/1910.13461

  18. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019). https://doi.org/10.48550/ARXIV.1907.11692. https://arxiv.org/abs/1907.11692

  19. Livinska, H.V., Makarevych, O.: Feasibility of improving BERT for linguistic prediction on Ukrainian corpus. In: COLINS (2020)

    Google Scholar 

  20. Ljubešić, N., Lauc, D.: Bertić-the transformer language model for Bosnian, Croatian, Montenegrin and Serbian. In: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pp. 37–42 (2021)

    Google Scholar 

  21. MacCartney, B.: Natural Language Inference. Stanford University (2009)

    Google Scholar 

  22. Martin, L., et al.: CamemBERT: a tasty French language model. CoRR abs/1911.03894 (2019). http://arxiv.org/abs/1911.03894

  23. Medhat, W., Hassan, A., Korashy, H.: Sentiment analysis algorithms and applications: a survey. Ain Shams Eng. J. 5(4), 1093–1113 (2014)

    Article  Google Scholar 

  24. Mishev, K., Gjorgjevikj, A., Vodenska, I., Chitkushev, L.T., Trajanov, D.: Evaluation of sentiment analysis in finance: from lexicons to transformers. IEEE Access 8, 131662–131682 (2020)

    Article  Google Scholar 

  25. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)

    Article  Google Scholar 

  26. Pikuliak, M., et al.: SlovakBERT: Slovak masked language model. CoRR abs/2109.15254 (2021). https://arxiv.org/abs/2109.15254

  27. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  28. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683 (2019). http://arxiv.org/abs/1910.10683

  29. Rasmy, L., Xiang, Y., Xie, Z., Tao, C., Zhi, D.: Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit. Med. 4(1), 1–13 (2021)

    Article  Google Scholar 

  30. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108 (2019). http://arxiv.org/abs/1910.01108

  31. Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28

    Chapter  Google Scholar 

  32. Ulčar, M., Robnik-Šikonja, M.: Finest BERT and crosloengual BERT: less is more in multilingual models. CoRR abs/2006.07890 (2020). https://arxiv.org/abs/2006.07890

  33. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  34. de Vries, W., Nissim, M.: As good as new. How to successfully recycle English GPT-2 to make models for other languages. CoRR abs/2012.05628 (2020). https://arxiv.org/abs/2012.05628

  35. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  36. Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)

    Google Scholar 

Download references

Acknowledgement

The work in this paper was partially financed by the Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jovana Dobreva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dobreva, J. et al. (2022). MACEDONIZER - The Macedonian Transformer Language Model. In: Zdravkova, K., Basnarkov, L. (eds) ICT Innovations 2022. Reshaping the Future Towards a New Normal. ICT Innovations 2022. Communications in Computer and Information Science, vol 1740. Springer, Cham. https://doi.org/10.1007/978-3-031-22792-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-22792-9_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-22791-2

  • Online ISBN: 978-3-031-22792-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics