skip to main content
10.1145/3477314.3507066acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Named-entity recognition for a low-resource language using pre-trained language model

Published: 06 May 2022 Publication History

Abstract

This paper proposes a method for Named-Entity Recognition (NER) for a low-resource language, Tigrinya, using a pre-trained language model. Tigrinya is a morphologically rich language, although one of the underrepresented in the field of NLP. This is mainly due to the limited amount of annotated data available. To address this problem, we introduced the first publicly available NER dataset for Tigrinya. The dataset contains 69,309 tokens that were manually annotated based on the CoNLL 2003 Beginning, Inside, and Outside (BIO) tagging schema. Specifically, we develop a new pre-trained language model for Tigrinya based on RoBERTa, which we refer to as TigRoBERTa. First, It is trained on an unsupervised Tigrinya corpus using Masked Language Modeling (MLM). Then, we show the validity of TigRoBERTa by fine-tuning for a couple of downstream tasks, namely, NER and Part of Speech (POS) tagging. The experimental results show that the method achieved 81.05% F1-score for NER and 92% accuracy for POS tagging, which is better than or comparable to the baseline method based on the CNN-BiLSTM-CRF model.

References

[1]
David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D'souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, et al. 2021. MasakhaNER: Named Entity Recognition for African Languages. arXiv preprint arXiv:2103.11811 (2021).
[2]
Isayas Berhe Adhanom. [n. d.]. A First Look into Neural Machine Translation for Tigrinya. ([n. d.]).
[3]
Norah Alsaaran and Maha Alrabiah. 2021. Arabic Named Entity Recognition: A-BGRU Approach. CMC-COMPUTERS MATERIALS & CONTINUA 68, 1 (2021), 471--485.
[4]
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5, 2 (1994), 157--166.
[5]
Zemicheal Berihu, Gebremariam Mesfin Assres, Mulugeta Atsbaha, and Tor-Morten Grønli. 2020. Enhancing Bi-directional English-Tigrigna Machine Translation Using Hybrid Approach. In Norsk IKT-konferanse for forskning og utdanning.
[6]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.
[7]
Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4 (2016), 357--370.
[8]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[10]
Xishuang Dong, Shanta Chowdhury, Lijun Qian, Xiangfang Li, Yi Guan, Jinfeng Yang, and Qiubin Yu. 2019. Deep learning for named entity recognition on Chinese electronic medical records: Combining deep transfer learning with multitask bi-directional LSTM RNN. PloS one 14, 5 (2019), e0216046.
[11]
Roald Eiselen. 2016. Government domain named entity recognition for south African languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 3344--3348.
[12]
Awet Fesseha, Shengwu Xiong, Eshete Derb Emiru, Moussa Diallo, and Abdelghani Dahou. 2021. Text Classification Based on Convolutional Neural Networks and Word Embedding for Low-Resource Languages: Tigrinya. Information 12, 2 (2021), 52.
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[14]
Rasmus Hvingelby, Amalie Brogaard Pauli, Maria Barrett, Christina Rosted, Lasse Malm Lidegaard, and Anders Søgaard. 2020. DaNE: A named entity resource for danish. In Proceedings of the 12th Language Resources and Evaluation Conference. 4597--4604.
[15]
John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).
[16]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
[17]
ThAnh Lê and MS Burtsev. 2019. A deep neural network model for the task of Named Entity Recognition. International Journal of Machine Learning and Computing 9, 1 (2019), 8--13.
[18]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[19]
Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bidirectional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 (2016).
[20]
Michael Franklin Mbouopda and Paulin Melatagia Yonta. 2020. Named Entity Recognition in Low-resource Languages using Cross-lingual distributional word representation. Revue Africaine de la Recherche en Informatique et Mathématiques Appliquées 33 (2020).
[21]
Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica 22, 3 (2012), 276--282.
[22]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[23]
Alp Öktem, Mirko Plitt, and Grace Tang. 2020. Tigrinya neural machine translation with transfer learning for humanitarian response. arXiv preprint arXiv:2003.11523 (2020).
[24]
Omer Osman and Yoshiki Mikami. 2012. Stemming Tigrinya words for information retrieval. In Proceedings of COLING 2012: Demonstration Papers. 345--352.
[25]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[26]
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? arXiv preprint arXiv:1906.01502 (2019).
[27]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
[28]
Radim Řehřek, Petr Sojka, et al. 2011. Gensim---statistical semantics in python. Retrieved from genism. org (2011).
[29]
Erik F Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050 (2003).
[30]
Richa Sharma, Sudha Morwal, Basant Agarwal, Ramesh Chandra, and Mohammad S Khan. 2020. A deep neural network-based model for named entity recognition for Hindi language. Neural Computing and Applications 32, 20 (2020), 16191--16203.
[31]
Stephanie Strassel and Jennifer Tracey. 2016. Lorelei language packs: Data, tools, and resources for technology development in low resource languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 3273--3280.
[32]
Beth M Sundheim. 1995. Overview of results of the MUC-6 evaluation. (1995).
[33]
Yemane Tedla and Kazuhide Yamamoto. 2017. Analyzing word embeddings and improving POS tagger of tigrinya. In 2017 International Conference on Asian Language Processing (IALP). IEEE, 115--118.
[34]
Yemane Tedla and Kazuhide Yamamoto. 2017. Morphological Segmentation for English-to-Tigrinya Statistical MachineTranslation. Int. J. Asian Lang. Process 27, 2 (2017), 95--110.
[35]
Yemane Keleta Tedla, Kazuhide Yamamoto, and Ashuboda Marasinghe. 2016. Nagaoka Tigrinya Corpus: Design and development of part-of-speech tagged corpus. Nagaoka University of Technology (2016), 1--4.
[36]
Yemane Keleta Tedla, Kazuhide Yamamoto, and Ashuboda Marasinghe. 2016. Tigrinya part-of-speech tagging with morphological patterns and the new nagaoka tigrinya corpus. International Journal of Computer Applications 146, 14 (2016).
[37]
Senait Gebremichael Tesfagergish and Jurgita Kapociute-Dzikiene. 2020. Deep Learning-Based Part-of-Speech Tagging of the Tigrinya Language. In International Conference on Information and Software Technologies. Springer, 357--367.
[38]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[39]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).
[40]
Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine 13, 3 (2018), 55--75.

Cited By

View all
  • (2024)A Large Language Model Approach to Detect Hate Speech in Political Discourse Using Multiple Language CorporaProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3635964(1461-1468)Online publication date: 8-Apr-2024
  • (2024)Give us the Facts: Enhancing Large Language Models With Knowledge Graphs for Fact-Aware Language ModelingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336045436:7(3091-3110)Online publication date: 31-Jan-2024
  • (2024)CAMELON: A System for Crime Metadata Extraction and Spatiotemporal Visualization From Online News ArticlesIEEE Access10.1109/ACCESS.2024.336387912(22778-22802)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. Named-entity recognition for a low-resource language using pre-trained language model

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing
    April 2022
    2099 pages
    ISBN:9781450387132
    DOI:10.1145/3477314
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 May 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. POS tagging
    2. RoBERTa
    3. low-resource language
    4. name entity recognition
    5. pre-trained language model

    Qualifiers

    • Research-article

    Conference

    SAC '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

    Upcoming Conference

    SAC '25
    The 40th ACM/SIGAPP Symposium on Applied Computing
    March 31 - April 4, 2025
    Catania , Italy

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)108
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Large Language Model Approach to Detect Hate Speech in Political Discourse Using Multiple Language CorporaProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3635964(1461-1468)Online publication date: 8-Apr-2024
    • (2024)Give us the Facts: Enhancing Large Language Models With Knowledge Graphs for Fact-Aware Language ModelingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336045436:7(3091-3110)Online publication date: 31-Jan-2024
    • (2024)CAMELON: A System for Crime Metadata Extraction and Spatiotemporal Visualization From Online News ArticlesIEEE Access10.1109/ACCESS.2024.336387912(22778-22802)Online publication date: 2024
    • (2024)Large language models for medicine: a surveyInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02318-w16:2(1015-1040)Online publication date: 19-Aug-2024
    • (2024)A transformer-based approach to Nigerian Pidgin text generationInternational Journal of Speech Technology10.1007/s10772-024-10136-227:4(1027-1037)Online publication date: 1-Dec-2024
    • (2024)Sentiment Analysis on Service Quality of an Online Healthcare Mobile Platform Using VADER and Roberta Pretrained ModelProceedings of the 4th International Conference on Electronics, Biomedical Engineering, and Health Informatics10.1007/978-981-97-1463-6_26(383-394)Online publication date: 28-Apr-2024
    • (2024)Semi-supervised Named Entity Recognition for Low-Resource Languages Using Dual PLMsNatural Language Processing and Information Systems10.1007/978-3-031-70239-6_12(166-180)Online publication date: 25-Jun-2024
    • (2023)Self-Attention-based Data Augmentation Method for Text ClassificationProceedings of the 2023 15th International Conference on Machine Learning and Computing10.1145/3587716.3587779(239-244)Online publication date: 17-Feb-2023
    • (2023)Recent Progress on Named Entity Recognition Based on Pre-trained Language Models2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI59109.2023.00122(799-804)Online publication date: 6-Nov-2023
    • (2023)Long Text Classification Using Pre-trained Language Model for a Low-Resource Language2023 6th International Conference on Information and Computer Technologies (ICICT)10.1109/ICICT58900.2023.00026(115-120)Online publication date: Mar-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media