research-article

Named-entity recognition for a low-resource language using pre-trained language model

Authors:

Hailemariam Mehari Yohannes,

Toshiyuki AmagasaAuthors Info & Claims

SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing

Pages 837 - 844

https://doi.org/10.1145/3477314.3507066

Published: 06 May 2022 Publication History

Abstract

This paper proposes a method for Named-Entity Recognition (NER) for a low-resource language, Tigrinya, using a pre-trained language model. Tigrinya is a morphologically rich language, although one of the underrepresented in the field of NLP. This is mainly due to the limited amount of annotated data available. To address this problem, we introduced the first publicly available NER dataset for Tigrinya. The dataset contains 69,309 tokens that were manually annotated based on the CoNLL 2003 Beginning, Inside, and Outside (BIO) tagging schema. Specifically, we develop a new pre-trained language model for Tigrinya based on RoBERTa, which we refer to as TigRoBERTa. First, It is trained on an unsupervised Tigrinya corpus using Masked Language Modeling (MLM). Then, we show the validity of TigRoBERTa by fine-tuning for a couple of downstream tasks, namely, NER and Part of Speech (POS) tagging. The experimental results show that the method achieved 81.05% F1-score for NER and 92% accuracy for POS tagging, which is better than or comparable to the baseline method based on the CNN-BiLSTM-CRF model.

References

[1]

David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D'souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, et al. 2021. MasakhaNER: Named Entity Recognition for African Languages. arXiv preprint arXiv:2103.11811 (2021).

[2]

Isayas Berhe Adhanom. [n. d.]. A First Look into Neural Machine Translation for Tigrinya. ([n. d.]).

[3]

Norah Alsaaran and Maha Alrabiah. 2021. Arabic Named Entity Recognition: A-BGRU Approach. CMC-COMPUTERS MATERIALS & CONTINUA 68, 1 (2021), 471--485.

[4]

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5, 2 (1994), 157--166.

Digital Library

[5]

Zemicheal Berihu, Gebremariam Mesfin Assres, Mulugeta Atsbaha, and Tor-Morten Grønli. 2020. Enhancing Bi-directional English-Tigrigna Machine Translation Using Hybrid Approach. In Norsk IKT-konferanse for forskning og utdanning.

[6]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.

[7]

Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4 (2016), 357--370.

[8]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[10]

Xishuang Dong, Shanta Chowdhury, Lijun Qian, Xiangfang Li, Yi Guan, Jinfeng Yang, and Qiubin Yu. 2019. Deep learning for named entity recognition on Chinese electronic medical records: Combining deep transfer learning with multitask bi-directional LSTM RNN. PloS one 14, 5 (2019), e0216046.

[11]

Roald Eiselen. 2016. Government domain named entity recognition for south African languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 3344--3348.

[12]

Awet Fesseha, Shengwu Xiong, Eshete Derb Emiru, Moussa Diallo, and Abdelghani Dahou. 2021. Text Classification Based on Convolutional Neural Networks and Word Embedding for Low-Resource Languages: Tigrinya. Information 12, 2 (2021), 52.

[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[14]

Rasmus Hvingelby, Amalie Brogaard Pauli, Maria Barrett, Christina Rosted, Lasse Malm Lidegaard, and Anders Søgaard. 2020. DaNE: A named entity resource for danish. In Proceedings of the 12th Language Resources and Evaluation Conference. 4597--4604.

[15]

John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).

[16]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).

[17]

ThAnh Lê and MS Burtsev. 2019. A deep neural network model for the task of Named Entity Recognition. International Journal of Machine Learning and Computing 9, 1 (2019), 8--13.

[18]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[19]

Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bidirectional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 (2016).

[20]

Michael Franklin Mbouopda and Paulin Melatagia Yonta. 2020. Named Entity Recognition in Low-resource Languages using Cross-lingual distributional word representation. Revue Africaine de la Recherche en Informatique et Mathématiques Appliquées 33 (2020).

[21]

Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica 22, 3 (2012), 276--282.

[22]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[23]

Alp Öktem, Mirko Plitt, and Grace Tang. 2020. Tigrinya neural machine translation with transfer learning for humanitarian response. arXiv preprint arXiv:2003.11523 (2020).

[24]

Omer Osman and Yoshiki Mikami. 2012. Stemming Tigrinya words for information retrieval. In Proceedings of COLING 2012: Demonstration Papers. 345--352.

[25]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[26]

Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? arXiv preprint arXiv:1906.01502 (2019).

[27]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).

[28]

Radim Řehřek, Petr Sojka, et al. 2011. Gensim---statistical semantics in python. Retrieved from genism. org (2011).

[29]

Erik F Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050 (2003).

[30]

Richa Sharma, Sudha Morwal, Basant Agarwal, Ramesh Chandra, and Mohammad S Khan. 2020. A deep neural network-based model for named entity recognition for Hindi language. Neural Computing and Applications 32, 20 (2020), 16191--16203.

Digital Library

[31]

Stephanie Strassel and Jennifer Tracey. 2016. Lorelei language packs: Data, tools, and resources for technology development in low resource languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 3273--3280.

[32]

Beth M Sundheim. 1995. Overview of results of the MUC-6 evaluation. (1995).

[33]

Yemane Tedla and Kazuhide Yamamoto. 2017. Analyzing word embeddings and improving POS tagger of tigrinya. In 2017 International Conference on Asian Language Processing (IALP). IEEE, 115--118.

[34]

Yemane Tedla and Kazuhide Yamamoto. 2017. Morphological Segmentation for English-to-Tigrinya Statistical MachineTranslation. Int. J. Asian Lang. Process 27, 2 (2017), 95--110.

[35]

Yemane Keleta Tedla, Kazuhide Yamamoto, and Ashuboda Marasinghe. 2016. Nagaoka Tigrinya Corpus: Design and development of part-of-speech tagged corpus. Nagaoka University of Technology (2016), 1--4.

[36]

Yemane Keleta Tedla, Kazuhide Yamamoto, and Ashuboda Marasinghe. 2016. Tigrinya part-of-speech tagging with morphological patterns and the new nagaoka tigrinya corpus. International Journal of Computer Applications 146, 14 (2016).

[37]

Senait Gebremichael Tesfagergish and Jurgita Kapociute-Dzikiene. 2020. Deep Learning-Based Part-of-Speech Tagging of the Tigrinya Language. In International Conference on Information and Software Technologies. Springer, 357--367.

[38]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[39]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).

[40]

Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine 13, 3 (2018), 55--75.

Cited By

De Oliveira ABaptista CFirmino ADe Paiva AHong JPark JPrzybyłek A(2024)A Large Language Model Approach to Detect Hate Speech in Political Discourse Using Multiple Language CorporaProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3635964(1461-1468)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3605098.3635964
Yang LChen HLi ZDing XWu X(2024)Give us the Facts: Enhancing Large Language Models With Knowledge Graphs for Fact-Aware Language ModelingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336045436:7(3091-3110)Online publication date: 31-Jan-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3360454
Pongpaichet SSukosit BDuangtanawat CJamjongdamrongkit JMahacharoensuk CMatangkarat KSinghajan PNoraset TTuarob S(2024)CAMELON: A System for Crime Metadata Extraction and Spatiotemporal Visualization From Online News ArticlesIEEE Access10.1109/ACCESS.2024.336387912(22778-22802)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3363879
Show More Cited By

Index Terms

Named-entity recognition for a low-resource language using pre-trained language model
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

A method of named entity recognition for Tigrinya

This paper proposes a method for Named-Entity Recognition (NER) for a low-resource language, Tigrinya, using a pre-trained language model. Tigrinya is a morphologically rich, although one of the underrepresented in the field of NLP. This is mainly due ...
Deep Learning based Named Entity Recognition for the Bodo Language
Abstract
One of the important application of natural language processing (NLP) is Name Entity Recognition (NER). It automatically recognise and categorise named entities in a document. Named Entities can be the name of an individual, group, place, etc. It ...
A Hybrid Statistical Approach for Named Entity Recognition for Amazighe Language
BDIoT '19: Proceedings of the 4th International Conference on Big Data and Internet of Things

Recognition of named entities (NEs) from computer readable natural language text is significant task of information extraction (IE) and natural language processing (NLP). Named entity (NE) extraction is important step for processing unstructured ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing

April 2022

2099 pages

ISBN:9781450387132

DOI:10.1145/3477314

Conference Chairs:
Jiman Hong
Soongsil University
,
Miroslav Bures
Czech Technical University, Czechia
,
Program Chairs:
Juw Won Park
University of Louisville
,
Tomas Cerny
Baylor University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 May 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SAC '22

Sponsor:

SIGAPP

SAC '22: The 37th ACM/SIGAPP Symposium on Applied Computing

April 25 - 29, 2022

Virtual Event

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25

Sponsor:
sigapp

The 40th ACM/SIGAPP Symposium on Applied Computing

March 31 - April 4, 2025

Catania , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
482
Total Downloads

Downloads (Last 12 months)108
Downloads (Last 6 weeks)6

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

De Oliveira ABaptista CFirmino ADe Paiva AHong JPark JPrzybyłek A(2024)A Large Language Model Approach to Detect Hate Speech in Political Discourse Using Multiple Language CorporaProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3635964(1461-1468)Online publication date: 8-Apr-2024
https://dl.acm.org/doi/10.1145/3605098.3635964
Yang LChen HLi ZDing XWu X(2024)Give us the Facts: Enhancing Large Language Models With Knowledge Graphs for Fact-Aware Language ModelingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336045436:7(3091-3110)Online publication date: 31-Jan-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3360454
Pongpaichet SSukosit BDuangtanawat CJamjongdamrongkit JMahacharoensuk CMatangkarat KSinghajan PNoraset TTuarob S(2024)CAMELON: A System for Crime Metadata Extraction and Spatiotemporal Visualization From Online News ArticlesIEEE Access10.1109/ACCESS.2024.336387912(22778-22802)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3363879
Zheng YGan WChen ZQi ZLiang QYu P(2024)Large language models for medicine: a surveyInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02318-w16:2(1015-1040)Online publication date: 19-Aug-2024
https://doi.org/10.1007/s13042-024-02318-w
Garba KKolajo TAgbogun J(2024)A transformer-based approach to Nigerian Pidgin text generationInternational Journal of Speech Technology10.1007/s10772-024-10136-227:4(1027-1037)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1007/s10772-024-10136-2
Maulana FAdi PLestari DPurnomo AWangean D(2024)Sentiment Analysis on Service Quality of an Online Healthcare Mobile Platform Using VADER and Roberta Pretrained ModelProceedings of the 4th International Conference on Electronics, Biomedical Engineering, and Health Informatics10.1007/978-981-97-1463-6_26(383-394)Online publication date: 28-Apr-2024
https://doi.org/10.1007/978-981-97-1463-6_26
Mehari Yohannes HLynden SAmagasa TMatono A(2024)Semi-supervised Named Entity Recognition for Low-Resource Languages Using Dual PLMsNatural Language Processing and Information Systems10.1007/978-3-031-70239-6_12(166-180)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1007/978-3-031-70239-6_12
Hailemariam MLynden SMatono AAmagasa T(2023)Self-Attention-based Data Augmentation Method for Text ClassificationProceedings of the 2023 15th International Conference on Machine Learning and Computing10.1145/3587716.3587779(239-244)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3587716.3587779
Yang BLuo X(2023)Recent Progress on Named Entity Recognition Based on Pre-trained Language Models2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI59109.2023.00122(799-804)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICTAI59109.2023.00122
Yohannes HAmagasa T(2023)Long Text Classification Using Pre-trained Language Model for a Low-Resource Language2023 6th International Conference on Information and Computer Technologies (ICICT)10.1109/ICICT58900.2023.00026(115-120)Online publication date: Mar-2023
https://doi.org/10.1109/ICICT58900.2023.00026
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten