MACEDONIZER - The Macedonian Transformer Language Model

Dobreva, Jovana; Pavlov, Tashko; Mishev, Kostadin; Simjanoska, Monika; Tudzarski, Stojancho; Trajanov, Dimitar; Kocarev, Ljupcho

doi:10.1007/978-3-031-22792-9_5

Jovana Dobreva⁷,
Tashko Pavlov⁷,
Kostadin Mishev⁷,
Monika Simjanoska⁷,
Stojancho Tudzarski⁷,
Dimitar Trajanov⁷ &
…
Ljupcho Kocarev⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1740))

Included in the following conference series:

International Conference on ICT Innovations

346 Accesses

Abstract

Contextualized language models are becoming omnipresent in the field of Natural Language Processing (NLP). Their learning representation capabilities show dominant results in almost all downstream NLP tasks. The main challenge that low-resource languages face is the lack of language-specific language models since their pre-training process requires high-computing capabilities and rich resources of textual data. This paper describes our efforts to pre-train the first contextual language model in the Macedonian language (MACEDONIZER), pre-trained on a 6.5 GB corpus of Macedonian texts crawled from public web domains and Wikipedia. Next, we evaluate the pre-trained version of the model on three different downstream tasks: Sentiment Analysis (SA), Natural Language Inference (NLI) and Named Entity Recognition (NER). The evaluation results are compared to the cross-lingual version of the RoBERTa model - XML-RoBERTa. The results show that MACEDONIZER achieves state-of-the-art results in all downstream tasks. Finally, the pre-trained version of the MACEDONIZER is made for free usage and further task-specific fine-tuning via HuggingFace.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model

Light Pre-Trained Chinese Language Model for NLP Tasks

Notes

References

Al-Garadi, M.A., et al.: Text classification models for the automatic detection of nonmedical prescription medication use from social media. BMC Med. Inform. Decis. Mak. 21(1), 1–13 (2021)
Article Google Scholar
Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France, pp. 9–15. European Language Resource Association (2020). https://aclanthology.org/2020.osact-1.2
Araci, D.: FinBERT: financial sentiment analysis with pre-trained language models. CoRR abs/1908.10063 (2019). http://arxiv.org/abs/1908.10063
Arkhipov, M., Trofimova, M., Kuratov, Y., Sorokin, A.: Tuning multilingual transformers for language-specific named entity recognition. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pp. 89–93 (2019)
Google Scholar
Brown, T.B., et al.: Language models are few-shot learners. CoRR abs/2005.14165 (2020). https://arxiv.org/abs/2005.14165
Carmo, D., Piau, M., Campiotti, I., Nogueira, R., de Alencar Lotufo, R.: PTT5: pretraining and validating the T5 model on Brazilian Portuguese data. CoRR abs/2008.09144 (2020). https://arxiv.org/abs/2008.09144
Chung, H.W., Garrette, D., Tan, K.C., Riesa, J.: Improving multilingual models with language-clustered vocabularies. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4536–4546. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-main.367. https://aclanthology.org/2020.emnlp-main.367
Clark, K., Luong, M., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. CoRR abs/2003.10555 (2020). https://arxiv.org/abs/2003.10555
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. CoRR abs/1911.02116 (2019). http://arxiv.org/abs/1911.02116
Dadas, S., Perełkiewicz, M., Poświata, R.: Pre-training polish transformer-based language models at scale. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2020. LNCS (LNAI), vol. 12416, pp. 301–314. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61534-5_27
Chapter Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
Farahani, M., Gharachorloo, M., Farahani, M., Manthouri, M.: ParsBERT: transformer-based model for Persian language understanding. CoRR abs/2005.12515 (2020). https://arxiv.org/abs/2005.12515
He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. CoRR abs/2006.03654 (2020). https://arxiv.org/abs/2006.03654
Lample, G., Conneau, A.: Cross-lingual language model pretraining. CoRR abs/1901.07291 (2019). http://arxiv.org/abs/1901.07291
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. CoRR abs/1909.11942 (2019). http://arxiv.org/abs/1909.11942
Le, H., et al.: FlauBERT: unsupervised language model pre-training for French. CoRR abs/1912.05372 (2019). http://arxiv.org/abs/1912.05372
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR abs/1910.13461 (2019). http://arxiv.org/abs/1910.13461
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019). https://doi.org/10.48550/ARXIV.1907.11692. https://arxiv.org/abs/1907.11692
Livinska, H.V., Makarevych, O.: Feasibility of improving BERT for linguistic prediction on Ukrainian corpus. In: COLINS (2020)
Google Scholar
Ljubešić, N., Lauc, D.: Bertić-the transformer language model for Bosnian, Croatian, Montenegrin and Serbian. In: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pp. 37–42 (2021)
Google Scholar
MacCartney, B.: Natural Language Inference. Stanford University (2009)
Google Scholar
Martin, L., et al.: CamemBERT: a tasty French language model. CoRR abs/1911.03894 (2019). http://arxiv.org/abs/1911.03894
Medhat, W., Hassan, A., Korashy, H.: Sentiment analysis algorithms and applications: a survey. Ain Shams Eng. J. 5(4), 1093–1113 (2014)
Article Google Scholar
Mishev, K., Gjorgjevikj, A., Vodenska, I., Chitkushev, L.T., Trajanov, D.: Evaluation of sentiment analysis in finance: from lexicons to transformers. IEEE Access 8, 131662–131682 (2020)
Article Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)
Article Google Scholar
Pikuliak, M., et al.: SlovakBERT: Slovak masked language model. CoRR abs/2109.15254 (2021). https://arxiv.org/abs/2109.15254
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683 (2019). http://arxiv.org/abs/1910.10683
Rasmy, L., Xiang, Y., Xie, Z., Tao, C., Zhi, D.: Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit. Med. 4(1), 1–13 (2021)
Article Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108 (2019). http://arxiv.org/abs/1910.01108
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Chapter Google Scholar
Ulčar, M., Robnik-Šikonja, M.: Finest BERT and crosloengual BERT: less is more in multilingual models. CoRR abs/2006.07890 (2020). https://arxiv.org/abs/2006.07890
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
de Vries, W., Nissim, M.: As good as new. How to successfully recycle English GPT-2 to make models for other languages. CoRR abs/2012.05628 (2020). https://arxiv.org/abs/2012.05628
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)
Google Scholar

Download references

Acknowledgement

The work in this paper was partially financed by the Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje.

Author information

Authors and Affiliations

Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Skopje, North Macedonia
Jovana Dobreva, Tashko Pavlov, Kostadin Mishev, Monika Simjanoska, Stojancho Tudzarski, Dimitar Trajanov & Ljupcho Kocarev

Authors

Jovana Dobreva
View author publications
You can also search for this author in PubMed Google Scholar
Tashko Pavlov
View author publications
You can also search for this author in PubMed Google Scholar
Kostadin Mishev
View author publications
You can also search for this author in PubMed Google Scholar
Monika Simjanoska
View author publications
You can also search for this author in PubMed Google Scholar
Stojancho Tudzarski
View author publications
You can also search for this author in PubMed Google Scholar
Dimitar Trajanov
View author publications
You can also search for this author in PubMed Google Scholar
Ljupcho Kocarev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jovana Dobreva .

Editor information

Editors and Affiliations

Saints Cyril and Methodius University of Skopje, Skopje, North Macedonia
Katerina Zdravkova
Saints Cyril and Methodius University of Skopje, Skopje, North Macedonia
Lasko Basnarkov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dobreva, J. et al. (2022). MACEDONIZER - The Macedonian Transformer Language Model. In: Zdravkova, K., Basnarkov, L. (eds) ICT Innovations 2022. Reshaping the Future Towards a New Normal. ICT Innovations 2022. Communications in Computer and Information Science, vol 1740. Springer, Cham. https://doi.org/10.1007/978-3-031-22792-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-22792-9_5
Published: 01 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22791-2
Online ISBN: 978-3-031-22792-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MACEDONIZER - The Macedonian Transformer Language Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model

Light Pre-Trained Chinese Language Model for NLP Tasks

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

MACEDONIZER - The Macedonian Transformer Language Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model

Light Pre-Trained Chinese Language Model for NLP Tasks

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation