Abstract
Named Entity Recognition (NER) is a sequence labelling task of Natural Language Processing (NLP) which aims to assign words to a pre-established list of named entity classes. Although a variety of NER models have been proposed to date, researchers have yet to find a viable solution for a poorly resourced and inflectional language like Bengali and Hindi. Additionally, many existing NER methods for low-resourced Indian language heavily rely on handcrafted features. This paper uses a multilingual neural language model called Multilingual Representations for Indian Languages (MuRIL) to extract deep features and Conditional Random Fields (CRF) for name entity tagging. We test the proposed model on two distinct Bengali NER datasets and a Hindi dataset. For the Bengali dataset-1, our proposed model achieves a Message Understanding Conference (MUC) F1 score of 75.02% which is approximately 3% higher than the MUC F1 score obtained by a recently published Bengali NER model. For the Bengali dataset-2, our proposed model achieves a MUC F1 score of 84.19%. For the Hindi dataset our proposed model achieves a MUC F1 score of 89.79% and it outperforms some existing deep learning-based Hindi NER models. We have also evaluated the proposed models using the tag-level F1 score. For both Bengali and Hindi datasets, the proposed model also outperforms the existing models in terms of the tag level F1 score.











Similar content being viewed by others
Availability of Data and Materials
We have used three different datasets- which can be accessed by hyperlinks attached as a footnote (https://github.com/MISabic/NER-Bangla-Dataset, https://github.com/Rifat1493/Bengali-NER, https://ltrc.iiit.ac.in/icon/2013/nlptools/).
Code Availability
The code used to create the proposed model is custom code.
References
Grishman R, Sundheim B. Message understanding conference- 6: a brief history. In: COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, 1996.
Ekbal A, Bandyopadhyay S."Named entity recognition using support vector machine: a language independent approach. 2010; 39.
Gajendran S, Manjula D, Sugumaran V. Character level and word level embedding with bidirectional LSTM – Dynamic recurrent neural network for biomedical named entity recognition from literature. J Biomed Inform. 2020;112: 103609.
He B, Chen J. Named entity recognition method in network security domain based on BERT-BiLSTM-CRF. In: 2021 IEEE 21st International Conference on communication technology (ICCT), 2021.
Zhou S, Liu J, Zhong X, Zhao W. Named entity recognition using BERT with whole world masking in cybersecurity domain. In: 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA), 2021.
Rabiner L, Juang B. An introduction to hidden Markov models. IEEE ASSP Mag. 1986;3:4–16.
Berger AL, Pietra VJD, Pietra SAD. A maximum entropy approach to natural language processing. Comput Linguist. 1996;22:39–71.
Lafferty JD, McCallum A, Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, 2001.
Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81–106.
Evgeniou T, Pontil M. Support vector machines: theory and applications. 2001.
Sen O, Fuad M, Islam MN, Rabbi J, Masud M, Hasan MK, Awal MA, Ahmed Fime A, Hasan Fuad MT, Sikder D, Raihan Iftee MA. Bangla natural language processing: a comprehensive analysis of classical, machine learning, and deep learning-based methods. IEEE Access. 2022;10:38999–9044.
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San, 2016.
Banik N, Rahman MHH. GRU based Named Entity Recognition System for Bangla Online Newspapers. In: 2018 International Conference on Innovation in Engineering and Technology (ICIET), 2018.
Al-Smadi M, Al-Zboon S, Jararweh Y, Juola P. Transfer learning for Arabic named entity recognition with deep neural networks. IEEE Access. 2020;8:37736–45.
Chiu JPC, Nichols E. Named entity recognition with bidirectional LSTM-CNNs. Trans Assoc Comput Linguist. 2016;4:357–70.
Yang G, Xu H. A residual BiLSTM model for named entity recognition. IEEE Access. 2020;8:227710–8.
Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q. A comprehensive survey on transfer learning. Proc IEEE. 2021;109:43–76.
Alsaaran N, Alrabiah M. Classical arabic named entity recognition using variant deep neural network architectures and BERT. IEEE Access. 2021;9:91537–47.
Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, 2019.
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: a lite BERT for self-supervised learning of language representations, arXiv, 2019.
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A Robustly Optimized BERT Pretraining Approach, arXiv, 2019.
Lample G, Conneau A. cross-lingual language model pretraining, arXiv, 2019.
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V. Unsupervised cross-lingual representation learning at Scale, arXiv, 2019.
Khanuja S, Bansal D, Mehtani S, Khosla S, Dey A, Gopalan B, Margam DK, Aggarwal P, Nagipogu RT, Dave S, Gupta S, Gali SCB, Subramanian V, Talukdar P. MuRIL: multilingual representations for Indian languages, arXiv, 2021.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need, arXiv, 2017.
Sekine S, Nobata C. Definition, dictionaries and tagger for extended named entity hierarchy. In: LREC, 2004.
Etzioni O, Cafarella M, Downey D, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A. Unsupervised named-entity extraction from the Web: an experimental study. Artif Intell. 2005;165:91–134.
Kim J-H, Woodland PC. A rule-based named entity recognition system for speech input. In: INTERSPEECH, 2000.
Quimbaya AP, Múnera AS, Rivera RAG, Rodríguez JCD, Velandia OMM, Peña AAG, Labbé C. Named entity recognition over electronic health records through a combined dictionary-based approach. Proc Comput Sci. 2016;100:55–61.
Chaudhuri BB, Bhattacharya S. An experiment on automatic detection of named entities in Bangla. In: Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages, 2008.
Kaur Y, Kaur ER. Named entity recognition (NER) System for Hindi language using combination of rule based approach and list look up approach. 2015.
Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 1989;77:257–86.
Zhou G, Su J. Named entity recognition using an HMM-based chunk tagger. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, USA, 2002.
Sarkar K. A hidden Markov model based system for entity extraction from Social Media English Text at FIRE 2015, arXiv, 2015.
Chieu HL, Ng HT. Named entity recognition: a maximum entropy approach using global information. In: COLING 2002: The 19th International Conference on computational linguistics, 2002.
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B. Support vector machines. IEEE Intell Syst their Appl. 1998;13:18–28.
Singh TD, Nongmeikapam K, Ekbal A, Bandyopadhyay S. Named entity recognition for Manipuri using support vector machine. In: Proceedings of the 23rd Pacific Asia conference on language, Information and Computation. Hong Kong, vol. 2; 2009. p. 811–8.
Gayen V, Sarkar K. An HMM based named entity recognition system for Indian languages: the JU system at ICON 2013, arXiv, 2014.
Ekbal ASIF, Bandyopadhyay SIVAJI. Named entity recognition in Indian languages using maximum entropy approach. Int J Comput Process Lang. 2008;21:205–37.
Ekbal A, Haque R, Bandyopadhyay S. Named entity recognition in Bengali: a conditional random field approach. In: IJCNLP, 2008.
Sarkar K, Shaw SK. A memory-based learning approach for named entity recognition in Hindi. J Intell Syst. 2017;26:301–21.
Ekbal A, Bandyopadhyay S. A hidden Markov model based named entity recognition system: Bengali and Hindi as case studies. In: Pattern Recognition and Machine Intelligence, Berlin, 2007.
Drovo MD, Chowdhury M, Uday SI, Das AK. Named entity recognition in Bengali text using merged hidden Markov model and rule base approach. In: 2019 7th International Conference on Smart Computing & Communications (ICSCC), 2019.
Hasanuzzaman M, Ekbal A, Bandyopadhyay S. Maximum entropy approach for named entity recognition in Bengali and Hindi. Int J Recent Trends Eng. 2009;1(1):408–12.
Saha SK, Sarkar S, Mitra P. A hybrid feature set based maximum entropy Hindi named entity recognition. In: Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I, 2008.
Alam F, Islam MA. A proposed model for Bengali Named Entity Recognition using Maximum Entropy Markov Model Incorporated with rich linguistic feature set. In: Proceedings of the International Conference on computing advancements, New York, NY, USA, 2020.
Das A, Garain U. CRF-based Named Entity Recognition @ICON 2013, arXiv, 2014.
Chowdhury SA, Alam F, Khan N. Towards Bangla Named Entity Recognition. In: 2018 21st International Conference of Computer and Information Technology (ICCIT), 2018.
Ekbal A, Bandyopadhyay S. Bengali Named Entity Recognition Using Support Vector Machine. In: Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages, 2008.
Saha SK, Narayan S, Sarkar S, Mitra P. A composite kernel for named entity recognition. Pattern Recogn Lett. 2010;31:1591–7.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.
Yadav V, Bethard S. A survey on recent advances in named entity recognition from deep learning models. In: Proceedings of the 27th International Conference on computational linguistics, Santa Fe, New Mexico, USA, 2018.
Shijia E, Xiang Y. Chinese named entity recognition with character-word mixed embedding. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, New York, NY, USA, 2017.
Goyal A, Gupta V, Kumar M. A deep learning-based bilingual Hindi and Punjabi Named Entity Recognition System Using Enhanced Word Embeddings. Know-Based Syst. 2021;234.
Saurav JR, Haque S, Chowdhury F. End to end parts of speech tagging and named entity recognition in Bangla Language. In: 2019 International Conference on Bangla Speech and Language Processing (ICBSLP), 2019.
Rahman Rifat MJ, Abujar S, Haider Noori SR, Hossain SA. Bengali Named Entity Recognition: a survey with deep learning benchmark. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), 2019.
Karim R, Islam MAM, Simanto SR, Chowdhury SA, Roy K, Neon AA, Hasan M, Firoze A, Rahman RM. A step towards information extraction: named entity recognition in Bangla using deep learning. J Intell Fuzzy Syst. 2019;37:7401–13.
Ashrafi I, Mohammad M, Mauree AS, Nijhum GMA, Karim R, Mohammed N, Momen S. Banner: a cost-sensitive contextualized model for bangla named entity recognition. IEEE Access. 2020;8:58206–26.
Athavale V, Bharadwaj S, Pamecha M, Prabhu A, Shrivastava M. Towards Deep Learning in Hindi NER: an approach to tackle the Labelled Data Sparsity. In: Proceedings of the 13th International Conference on natural language processing, Varanasi, 2016.
Shah B, Kopparapu SK. “A Deep Learning approach for Hindi Named Entity Recognition,” 2019. arXiv:1911.01421.
Ajees AP, Manju K, Mary Idicula S. An improved word representation for deep learning based NER in Indian languages. Information. 2019;10(6):186–206. https://doi.org/10.3390/info10060186.
Sharma R, Morwal S, Agarwal B, Chandra R, Khan MS. A deep neural network-based model for named entity recognition for Hindi language. Neural Comput Appl. 2020;32:16191–203.
Sharma R, Morwal S, Agarwal B. Named entity recognition using neural language model and CRF for Hindi language. Comput Speech Lang. 2022;74: 101356.
R, IMA, SS, CS, RK, NA, HM, FA, Karim RM. NER-Bangla-Dataset. [Online]. https://github.com/MISabic/NER-Bangla-Dataset. Accessed 30 June 2022.
Rifat MJR. Rifat1493/Bengali-NER. GitHub, [Online]. https://github.com/Rifat1493/Bengali-NER. Accessed 30 June 2022.
I. 2013. NLP Tools Contest on Named Entity Recognition in Indian languages, 2013," Icon 2013, 2013. [Online]. https://ltrc.iiit.ac.in/icon/2013/nlptools/. Dataset released and accessed on 21st October 2013
Vilain M, Burger J, Aberdeen J, Connolly D, Hirschman L. A model-theoretic coreference scoring scheme. In: Proceedings of the 6th Conference on Message Understanding, USA, 1995.
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
Conceptualization: Kamal Sarkar. Methodology: Kaushik Bose, Kamal Sarkar. Software: Kaushik Bose. Validation: Kaushik Bose, Kamal Sarkar. Formal analysis: Kamal Sarkar, Kaushik Bose. Investigation: Kaushik Bose, Kamal Sarkar. Resources: Kaushik Bose. Data Curation: Kaushik Bose. Writing—Original Draft: Kaushik Bose, Kamal Sarkar. Writing—Review and Editing: Kaushik Bose, Kamal Sarkar. Visualization: Kaushik Bose. Supervision: Kamal Sarkar. Project administration: Kamal Sarkar.
Corresponding author
Ethics declarations
Conflict of Interest
The authors have no conflict of interest.
Ethics Approval
Not applicable.
Research Involving Human Participants and/or Animals
During research, no human participants or animals were involved.
Informed Consent
This article does not involve any studies with human participants or animals performed by any of the authors.
Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bose, K., Sarkar, K. Named Entity Recognition in Bengali and Hindi Using MuRIL and Conditional Random Fields. SN COMPUT. SCI. 5, 856 (2024). https://doi.org/10.1007/s42979-024-03211-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-024-03211-7