skip to main content
research-article

Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN Architecture

Published: 15 April 2024 Publication History

Abstract

Named Entity Recognition (NER) is an indispensable component of Natural Language Processing (NLP), which aims to identify and classify entities within text data. While Deep Learning (DL) models have excelled in NER for well-resourced languages such as English, Spanish, and Chinese, they face significant hurdles when dealing with low-resource languages such as Urdu. These challenges stem from the intricate linguistic characteristics of Urdu, including morphological diversity, a context-dependent lexicon, and the scarcity of training data. This study addresses these issues by focusing on Urdu Named Entity Recognition (U-NER) and introducing three key contributions. First, various pre-trained embedding methods are employed, encompassing Word2vec (W2V), GloVe, FastText, Bidirectional Encoder Representations from Transformers (BERT), and Embeddings from language models (ELMo). In particular, fine-tuning is performed on BERTBASE and ELMo using Urdu Wikipedia and news articles. Second, a novel generative Data Augmentation (DA) technique replaces Named Entities (NEs) with mask tokens, employing pre-trained masked language models to predict masked tokens, effectively expanding the training dataset. Finally, the study introduces a novel hybrid model combining a Transformer Encoder with a Convolutional Neural Network (CNN) to capture the intricate morphology of Urdu. These modules enable the model to handle polysemy, extract short- and long-range dependencies, and enhance learning capacity. Empirical experiments demonstrate that the proposed model, incorporating BERT embeddings and an innovative DA approach, attains the highest F1-score of 93.99%, highlighting its efficacy for the U-NER task.

References

[1]
Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics. 1638–1649.
[2]
Abdullah I. Alharbi and Mark Lee. 2020. Combining character and word embeddings for the detection of offensive language in Arabic. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. 91–96.
[3]
Jatin Arora and Youngja Park. 2023. Split-NER: Named entity recognition via two question-answering-based classifications. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 416–426.
[4]
Muhammad Umair Arshad, Raza Ali, Mirza Omer Beg, and Waseem Shahzad. 2023. UHated: Hate speech detection in Urdu language using transfer learning. Language Resources and Evaluation (2023), 1–20.
[5]
Pratyay Banerjee, Kuntal Kumar Pal, Murthy Devarakonda, and Chitta Baral. 2021. Biomedical named entity recognition via knowledge guidance and question answering. ACM Transactions on Computing for Healthcare 2 (2021), 1–24. Issue 4.
[6]
Cillian Berragan, Alex Singleton, Alessia Calafiore, and Jeremy Morley. 2023. Transformer based named entity recognition for place name extraction from unstructured text. International Journal of Geographical Information Science 37 (2023), 747–766. Issue 4. DOI:
[7]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.
[8]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, and Amanda Askell. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.
[9]
Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan. 2015. Joint learning of character and word embeddings. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI’15). AAAI Press, 1236–1242.
[10]
Xiang Dai and Heike Adel. 2020. An analysis of simple data augmentation for named entity recognition, Donia Scott, Nuria Bel, and Chengqing Zong (Eds.). Proceedings of the 28th International Conference on Computational Linguistics, 3861–3867. DOI:
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805http://arxiv.org/abs/1810.04805
[12]
G. David Forney. 1973. The Viterbi algorithm. Proc. IEEE 61 (1973), 268–278. Issue 3.
[13]
Johann Frei and Frank Kramer. 2023. German medical named entity recognition model and data set creation using machine translation and word alignment: Algorithm development and validation. JMIR Formative Research 7 (2023), e39077.
[14]
SaiKiranmai Gorla, Sai Sharan Tangeda, Lalita Bhanu Murthy Neti, and Aruna Malapati. 2022. Telugu named entity recognition using BERT. International Journal of Data Science and Analytics 14, 2 (2022), 127–140.
[15]
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 6645–6649.
[16]
Mourad Gridach. 2016. Character-aware neural networks for Arabic named entity recognition for social media. Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP 2016), 23–32.
[17]
Ralph Grishman and Beth M. Sundheim. 1996. Message Understanding Conference-6: A brief history. COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.
[18]
Rafiul Haq, Xiaowang Zhang, Wahab Khan, and Zhiyong Feng. 2023. Urdu named entity recognition system using deep learning approaches. Comput. J. 66, 8 (2023), 1856–1869.
[19]
Charlie Harper and R. Benjamin Gorham. 2020. From text to map: Combing named entity recognition and geographic information systems. Code4Lib Journal (2020). Issue 49.
[20]
Sepp Hochreiter and Jürgen Schmidhuber. 1996. LSTM can solve hard long time lag problems. Advances in Neural Information Processing Systems 9 (1996).
[21]
S. K. Hong and Jae-Gil Lee. 2020. DTranNER: Biomedical named entity recognition with deep learning-based label-label transition model. BMC Bioinformatics 21 (2020), 1–11.
[22]
Ze Hu and Xiaoning Ma. 2023. A novel neural network model fusion approach for improving medical named entity recognition in online health expert question-answering services. Expert Systems with Applications 223 (2023), 119880.
[23]
Sarmad Hussain. 2008. Resources for Urdu language processing. In Proceedings of the 6th Workshop on Asian Language Resources.
[24]
Faryal Jahangir, Waqas Anwar, Usama Ijaz Bajwa, and Xuan Wang. 2012. N-gram and gazetteer list based named entity recognition for Urdu: A scarce resourced language. In Proceedings of the 10th Workshop on Asian Language Resources. 95–104.
[25]
Safia Kanwal, Kamran Malik, Khurram Shahzad, Faisal Aslam, and Zubair Nawaz. 2019. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 1 (2019), 1–13.
[26]
Samreen Kazi, Maria Rahim, and Shakeel Khoja. 2023. A deep learning approach to building a framework for Urdu POS and NER. Journal of Intelligent & Fuzzy SystemsPreprint (2023), 1–11.
[27]
Mohammad Ebrahim Khademi and Mohammad Fakhredanesh. 2020. Persian automatic text summarization based on named entity recognition. Iranian Journal of Science and Technology, Transactions of Electrical Engineering (2020), 1–12.
[28]
Hamza Khalid, Ghulam Murtaza, and Qaiser Abbas. 2023. Using data augmentation and bidirectional encoder representations from transformers for improving Punjabi named entity recognition. ACM Transactions on Computing Education (2023).
[29]
Wahab Khan, Ali Daud, Fahd Alotaibi, Naif Aljohani, and Sachi Arafat. 2020. Deep recurrent neural networks with word embeddings for Urdu named entity recognition. ETRI Journal 42, 1 (2020), 90–100.
[30]
Wahab Khan, Ali Daud, Khurram Shahzad, Tehmina Amjad, Ameen Banjar, and Heba Fasihuddin. 2022. Named entity recognition using conditional random fields. Applied Sciences 12 (2022), 6391. Issue 13.
[31]
Wahab Khana, Ali Daudb, Jamal A. Nasira, and Tehmina Amjada. 2016. Named entity dataset for Urdu named entity recognition task. Language & Technology 51 (2016).
[32]
Donghwa Kim, Deokseong Seo, Suhyoun Cho, and Pilsung Kang. 2019. Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Information Sciences 477 (2019), 15–29.
[33]
Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR).
[34]
Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data augmentation using pre-trained transformer models, William M. Campbell, Alex Waibel, Dilek Hakkani-Tur, Timothy J. Hazen, Kevin Kilgour, Eunah Cho, Varun Kumar, and Hadrien Glaude (Eds.). Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, 18–26. https://aclanthology.org/2020.lifelongnlp-1.3
[35]
Andrei Kutuzov, Murhaf Fares, Stephan Oepen, and Erik Velldal. 2017. Word vectors, reuse, and replicability: Towards a community repository of large-text resources. Proceedings of the 58th Conference on Simulation and Modelling, 271–276.
[36]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. https://openreview.net/forum?id=H1eA7AEtvS
[37]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521 (2015), 436–444. Issue 7553.
[38]
Pan Liu, Yanming Guo, Fenglei Wang, and Guohui Li. 2022. Chinese named entity recognition: The state of the art. Neurocomputing 473 (22022), 37–53. DOI:
[39]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019). http://arxiv.org/abs/1907.11692
[40]
Muhammad Kamran Malik. 2017. Urdu named entity recognition and classification system using artificial neural network. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 17, 1 (2017), 1–13.
[41]
Huanru Henry Mao. 2020. A survey on self-supervised pre-training for sequential transfer learning in neural networks. CoRR abs/2007.00800 (2020). arXiv:2007.00800https://arxiv.org/abs/2007.00800
[42]
Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space, Yoshua Bengio and Yann LeCun (Eds.). 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings. http://arxiv.org/abs/1301.3781
[43]
Pedro Mota, Vera Cabarrão, and Eduardo Farah. 2022. Fast-paced improvements to named entity handling for neural machine translation. Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, 141–149.
[44]
Smruthi Mukund, Rohini Srihari, and Erik Peterson. 2010. An information-extraction system for Urdu—a resource-poor language. ACM Transactions on Asian Language Information Processing (TALLIP) 9, 4 (2010), 1–43.
[45]
Saeeda Naz, Arif Iqbal Umar, Syed Hamad Shirazi, Sajjad Ahmad Khan, Imtiaz Ahmed, and Akbar Ali Khan. 2014. Challenges of Urdu named entity recognition: A scarce resourced language. Research Journal of Applied Sciences, Engineering and Technology 8, 10 (2014), 1272–1278.
[46]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP ’14). 1532–1543.
[47]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. CoRR abs/1802.05365 (2018). http://arxiv.org/abs/1802.05365
[48]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.
[49]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21 (2020), 5485–5551. Issue 1.
[50]
Kashif Riaz. 2010. Rule-based named entity recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop. 126–135.
[51]
Sujan Kumar Saha, Sanjay Chatterji, Sandipan Dandapat, Sudeshna Sarkar, and Pabitra Mitra. 2008. A hybrid named entity recognition system for South and South East Asian languages. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages.
[52]
Selvan R. Senthamizh and K. Arutchelvan. 2022. Automatic text summarization using document clustering named entity recognition. International Journal of Advanced Computer Science and Applications 13 (2022). Issue 9.
[53]
Noam Shazeer. 2020. GLU variants improve transformer. CoRR abs/2002.05202 (2020). https://arxiv.org/abs/2002.05202
[54]
UmrinderPal Singh, Vishal Goyal, and Gurpreet Singh Lehal. 2012. Named entity recognition system for Urdu. In Proceedings of COLING 2012. 2507–2518.
[55]
Xinying Song, Alex Salcianu, Yang Song, Dave Dopson, and Denny Zhou. 2021. Fast WordPiece Tokenization, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2089–2103. DOI:
[56]
Peng Sun, Xuezhen Yang, Xiaobing Zhao, and Zhijuan Wang. 2018. An overview of named entity recognition. 2018 International Conference on Asian Language Processing (IALP), 273–278.
[57]
Xiaobing Sun, Xiangyue Liu, Jiajun Hu, and Junwu Zhu. 2014. Empirical studies on the NLP techniques for source code data preprocessing. Proceedings of the 2014 3rd International Workshop on Evidential Assessment of Software Technologies, 32–39.
[58]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. (2023).
[59]
Fida Ullah, Ihsan Ullah, and Olga Kolesnikova. 2022. Urdu named entity recognition with attention Bi-LSTM-CRF model. In Mexican International Conference on Artificial Intelligence. Springer, 3–17.
[60]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).
[61]
Shufang Xie, Yingce Xia, Lijun Wu, Yiqing Huang, Yang Fan, and Tao Qin. 2022. End-to-end entity-aware neural machine translation. Machine Learning 111 (32022), 1181–1203. Issue 3. DOI:
[62]
Zhiwei Yang, Jing Ma, Hechang Chen, Jiawei Zhang, and Yi Chang. 2022. Context-aware attentive multilevel feature fusion for named entity recognition. IEEE Transactions on Neural Networks and Learning Systems (2022).
[63]
Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems 32 (2019).

Cited By

View all
  • (2025)Semantic relationship extraction of English long sentences and quality optimization of machine translation based on BERT modelJournal of Computational Methods in Sciences and Engineering10.1177/14727978251322656Online publication date: 4-Mar-2025
  • (2024)Leveraging Hybrid Adaptive Sine Cosine Algorithm with Deep Learning for Arabic Poem Meter DetectionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3676963Online publication date: 10-Jul-2024

Index Terms

  1. Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN Architecture

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 4
    April 2024
    221 pages
    EISSN:2375-4702
    DOI:10.1145/3613577
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 April 2024
    Online AM: 15 February 2024
    Accepted: 02 February 2024
    Revised: 01 February 2024
    Received: 01 November 2023
    Published in TALLIP Volume 23, Issue 4

    Check for updates

    Author Tags

    1. Urdu
    2. Named Entity Recognition
    3. low-resource languages
    4. Asian languages

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Research and Development Program of China
    • Key Research and Development Program of Yunnan Province

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)355
    • Downloads (Last 6 weeks)29
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Semantic relationship extraction of English long sentences and quality optimization of machine translation based on BERT modelJournal of Computational Methods in Sciences and Engineering10.1177/14727978251322656Online publication date: 4-Mar-2025
    • (2024)Leveraging Hybrid Adaptive Sine Cosine Algorithm with Deep Learning for Arabic Poem Meter DetectionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3676963Online publication date: 10-Jul-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media