Skip to main content

Advertisement

Log in

Named Entity Recognition in Bengali and Hindi Using MuRIL and Conditional Random Fields

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Named Entity Recognition (NER) is a sequence labelling task of Natural Language Processing (NLP) which aims to assign words to a pre-established list of named entity classes. Although a variety of NER models have been proposed to date, researchers have yet to find a viable solution for a poorly resourced and inflectional language like Bengali and Hindi. Additionally, many existing NER methods for low-resourced Indian language heavily rely on handcrafted features. This paper uses a multilingual neural language model called Multilingual Representations for Indian Languages (MuRIL) to extract deep features and Conditional Random Fields (CRF) for name entity tagging. We test the proposed model on two distinct Bengali NER datasets and a Hindi dataset. For the Bengali dataset-1, our proposed model achieves a Message Understanding Conference (MUC) F1 score of 75.02% which is approximately 3% higher than the MUC F1 score obtained by a recently published Bengali NER model. For the Bengali dataset-2, our proposed model achieves a MUC F1 score of 84.19%. For the Hindi dataset our proposed model achieves a MUC F1 score of 89.79% and it outperforms some existing deep learning-based Hindi NER models. We have also evaluated the proposed models using the tag-level F1 score. For both Bengali and Hindi datasets, the proposed model also outperforms the existing models in terms of the tag level F1 score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Availability of Data and Materials

We have used three different datasets- which can be accessed by hyperlinks attached as a footnote (https://github.com/MISabic/NER-Bangla-Dataset, https://github.com/Rifat1493/Bengali-NER, https://ltrc.iiit.ac.in/icon/2013/nlptools/).

Code Availability

The code used to create the proposed model is custom code.

Notes

  1. https://pypi.org/project/tf2crf.

References

  1. Grishman R, Sundheim B. Message understanding conference- 6: a brief history. In: COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, 1996.

  2. Ekbal A, Bandyopadhyay S."Named entity recognition using support vector machine: a language independent approach. 2010; 39.

  3. Gajendran S, Manjula D, Sugumaran V. Character level and word level embedding with bidirectional LSTM – Dynamic recurrent neural network for biomedical named entity recognition from literature. J Biomed Inform. 2020;112: 103609.

    Article  Google Scholar 

  4. He B, Chen J. Named entity recognition method in network security domain based on BERT-BiLSTM-CRF. In: 2021 IEEE 21st International Conference on communication technology (ICCT), 2021.

  5. Zhou S, Liu J, Zhong X, Zhao W. Named entity recognition using BERT with whole world masking in cybersecurity domain. In: 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA), 2021.

  6. Rabiner L, Juang B. An introduction to hidden Markov models. IEEE ASSP Mag. 1986;3:4–16.

    Article  Google Scholar 

  7. Berger AL, Pietra VJD, Pietra SAD. A maximum entropy approach to natural language processing. Comput Linguist. 1996;22:39–71.

    Google Scholar 

  8. Lafferty JD, McCallum A, Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, 2001.

  9. Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81–106.

    Article  Google Scholar 

  10. Evgeniou T, Pontil M. Support vector machines: theory and applications. 2001.

  11. Sen O, Fuad M, Islam MN, Rabbi J, Masud M, Hasan MK, Awal MA, Ahmed Fime A, Hasan Fuad MT, Sikder D, Raihan Iftee MA. Bangla natural language processing: a comprehensive analysis of classical, machine learning, and deep learning-based methods. IEEE Access. 2022;10:38999–9044.

    Article  Google Scholar 

  12. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San, 2016.

  13. Banik N, Rahman MHH. GRU based Named Entity Recognition System for Bangla Online Newspapers. In: 2018 International Conference on Innovation in Engineering and Technology (ICIET), 2018.

  14. Al-Smadi M, Al-Zboon S, Jararweh Y, Juola P. Transfer learning for Arabic named entity recognition with deep neural networks. IEEE Access. 2020;8:37736–45.

    Article  Google Scholar 

  15. Chiu JPC, Nichols E. Named entity recognition with bidirectional LSTM-CNNs. Trans Assoc Comput Linguist. 2016;4:357–70.

    Article  Google Scholar 

  16. Yang G, Xu H. A residual BiLSTM model for named entity recognition. IEEE Access. 2020;8:227710–8.

    Article  Google Scholar 

  17. Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q. A comprehensive survey on transfer learning. Proc IEEE. 2021;109:43–76.

    Article  Google Scholar 

  18. Alsaaran N, Alrabiah M. Classical arabic named entity recognition using variant deep neural network architectures and BERT. IEEE Access. 2021;9:91537–47.

    Article  Google Scholar 

  19. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, 2019.

  20. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: a lite BERT for self-supervised learning of language representations, arXiv, 2019.

  21. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A Robustly Optimized BERT Pretraining Approach, arXiv, 2019.

  22. Lample G, Conneau A. cross-lingual language model pretraining, arXiv, 2019.

  23. Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V. Unsupervised cross-lingual representation learning at Scale, arXiv, 2019.

  24. Khanuja S, Bansal D, Mehtani S, Khosla S, Dey A, Gopalan B, Margam DK, Aggarwal P, Nagipogu RT, Dave S, Gupta S, Gali SCB, Subramanian V, Talukdar P. MuRIL: multilingual representations for Indian languages, arXiv, 2021.

  25. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need, arXiv, 2017.

  26. Sekine S, Nobata C. Definition, dictionaries and tagger for extended named entity hierarchy. In: LREC, 2004.

  27. Etzioni O, Cafarella M, Downey D, Popescu A-M, Shaked T, Soderland S, Weld DS, Yates A. Unsupervised named-entity extraction from the Web: an experimental study. Artif Intell. 2005;165:91–134.

    Article  Google Scholar 

  28. Kim J-H, Woodland PC. A rule-based named entity recognition system for speech input. In: INTERSPEECH, 2000.

  29. Quimbaya AP, Múnera AS, Rivera RAG, Rodríguez JCD, Velandia OMM, Peña AAG, Labbé C. Named entity recognition over electronic health records through a combined dictionary-based approach. Proc Comput Sci. 2016;100:55–61.

    Article  Google Scholar 

  30. Chaudhuri BB, Bhattacharya S. An experiment on automatic detection of named entities in Bangla. In: Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages, 2008.

  31. Kaur Y, Kaur ER. Named entity recognition (NER) System for Hindi language using combination of rule based approach and list look up approach. 2015.

  32. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 1989;77:257–86.

    Article  Google Scholar 

  33. Zhou G, Su J. Named entity recognition using an HMM-based chunk tagger. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, USA, 2002.

  34. Sarkar K. A hidden Markov model based system for entity extraction from Social Media English Text at FIRE 2015, arXiv, 2015.

  35. Chieu HL, Ng HT. Named entity recognition: a maximum entropy approach using global information. In: COLING 2002: The 19th International Conference on computational linguistics, 2002.

  36. Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B. Support vector machines. IEEE Intell Syst their Appl. 1998;13:18–28.

    Article  Google Scholar 

  37. Singh TD, Nongmeikapam K, Ekbal A, Bandyopadhyay S. Named entity recognition for Manipuri using support vector machine. In: Proceedings of the 23rd Pacific Asia conference on language, Information and Computation. Hong Kong, vol. 2; 2009. p. 811–8.

  38. Gayen V, Sarkar K. An HMM based named entity recognition system for Indian languages: the JU system at ICON 2013, arXiv, 2014.

  39. Ekbal ASIF, Bandyopadhyay SIVAJI. Named entity recognition in Indian languages using maximum entropy approach. Int J Comput Process Lang. 2008;21:205–37.

    Article  Google Scholar 

  40. Ekbal A, Haque R, Bandyopadhyay S. Named entity recognition in Bengali: a conditional random field approach. In: IJCNLP, 2008.

  41. Sarkar K, Shaw SK. A memory-based learning approach for named entity recognition in Hindi. J Intell Syst. 2017;26:301–21.

    Google Scholar 

  42. Ekbal A, Bandyopadhyay S. A hidden Markov model based named entity recognition system: Bengali and Hindi as case studies. In: Pattern Recognition and Machine Intelligence, Berlin, 2007.

  43. Drovo MD, Chowdhury M, Uday SI, Das AK. Named entity recognition in Bengali text using merged hidden Markov model and rule base approach. In: 2019 7th International Conference on Smart Computing & Communications (ICSCC), 2019.

  44. Hasanuzzaman M, Ekbal A, Bandyopadhyay S. Maximum entropy approach for named entity recognition in Bengali and Hindi. Int J Recent Trends Eng. 2009;1(1):408–12.

    Google Scholar 

  45. Saha SK, Sarkar S, Mitra P. A hybrid feature set based maximum entropy Hindi named entity recognition. In: Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I, 2008.

  46. Alam F, Islam MA. A proposed model for Bengali Named Entity Recognition using Maximum Entropy Markov Model Incorporated with rich linguistic feature set. In: Proceedings of the International Conference on computing advancements, New York, NY, USA, 2020.

  47. Das A, Garain U. CRF-based Named Entity Recognition @ICON 2013, arXiv, 2014.

  48. Chowdhury SA, Alam F, Khan N. Towards Bangla Named Entity Recognition. In: 2018 21st International Conference of Computer and Information Technology (ICCIT), 2018.

  49. Ekbal A, Bandyopadhyay S. Bengali Named Entity Recognition Using Support Vector Machine. In: Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages, 2008.

  50. Saha SK, Narayan S, Sarkar S, Mitra P. A composite kernel for named entity recognition. Pattern Recogn Lett. 2010;31:1591–7.

    Article  Google Scholar 

  51. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.

    Article  Google Scholar 

  52. Yadav V, Bethard S. A survey on recent advances in named entity recognition from deep learning models. In: Proceedings of the 27th International Conference on computational linguistics, Santa Fe, New Mexico, USA, 2018.

  53. Shijia E, Xiang Y. Chinese named entity recognition with character-word mixed embedding. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, New York, NY, USA, 2017.

  54. Goyal A, Gupta V, Kumar M. A deep learning-based bilingual Hindi and Punjabi Named Entity Recognition System Using Enhanced Word Embeddings. Know-Based Syst. 2021;234.

  55. Saurav JR, Haque S, Chowdhury F. End to end parts of speech tagging and named entity recognition in Bangla Language. In: 2019 International Conference on Bangla Speech and Language Processing (ICBSLP), 2019.

  56. Rahman Rifat MJ, Abujar S, Haider Noori SR, Hossain SA. Bengali Named Entity Recognition: a survey with deep learning benchmark. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), 2019.

  57. Karim R, Islam MAM, Simanto SR, Chowdhury SA, Roy K, Neon AA, Hasan M, Firoze A, Rahman RM. A step towards information extraction: named entity recognition in Bangla using deep learning. J Intell Fuzzy Syst. 2019;37:7401–13.

    Article  Google Scholar 

  58. Ashrafi I, Mohammad M, Mauree AS, Nijhum GMA, Karim R, Mohammed N, Momen S. Banner: a cost-sensitive contextualized model for bangla named entity recognition. IEEE Access. 2020;8:58206–26.

    Article  Google Scholar 

  59. Athavale V, Bharadwaj S, Pamecha M, Prabhu A, Shrivastava M. Towards Deep Learning in Hindi NER: an approach to tackle the Labelled Data Sparsity. In: Proceedings of the 13th International Conference on natural language processing, Varanasi, 2016.

  60. Shah B, Kopparapu SK. “A Deep Learning approach for Hindi Named Entity Recognition,” 2019. arXiv:1911.01421.

  61. Ajees AP, Manju K, Mary Idicula S. An improved word representation for deep learning based NER in Indian languages. Information. 2019;10(6):186–206. https://doi.org/10.3390/info10060186.

    Article  Google Scholar 

  62. Sharma R, Morwal S, Agarwal B, Chandra R, Khan MS. A deep neural network-based model for named entity recognition for Hindi language. Neural Comput Appl. 2020;32:16191–203.

    Article  Google Scholar 

  63. Sharma R, Morwal S, Agarwal B. Named entity recognition using neural language model and CRF for Hindi language. Comput Speech Lang. 2022;74: 101356.

    Article  Google Scholar 

  64. R, IMA, SS, CS, RK, NA, HM, FA, Karim RM. NER-Bangla-Dataset. [Online]. https://github.com/MISabic/NER-Bangla-Dataset. Accessed 30 June 2022.

  65. Rifat MJR. Rifat1493/Bengali-NER. GitHub, [Online]. https://github.com/Rifat1493/Bengali-NER. Accessed 30 June 2022.

  66. I. 2013. NLP Tools Contest on Named Entity Recognition in Indian languages, 2013," Icon 2013, 2013. [Online]. https://ltrc.iiit.ac.in/icon/2013/nlptools/. Dataset released and accessed on 21st October 2013

  67. Vilain M, Burger J, Aberdeen J, Connolly D, Hirschman L. A model-theoretic coreference scoring scheme. In: Proceedings of the 6th Conference on Message Understanding, USA, 1995.

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: Kamal Sarkar. Methodology: Kaushik Bose, Kamal Sarkar. Software: Kaushik Bose. Validation: Kaushik Bose, Kamal Sarkar. Formal analysis: Kamal Sarkar, Kaushik Bose. Investigation: Kaushik Bose, Kamal Sarkar. Resources: Kaushik Bose. Data Curation: Kaushik Bose. Writing—Original Draft: Kaushik Bose, Kamal Sarkar. Writing—Review and Editing: Kaushik Bose, Kamal Sarkar. Visualization: Kaushik Bose. Supervision: Kamal Sarkar. Project administration: Kamal Sarkar.

Corresponding author

Correspondence to Kamal Sarkar.

Ethics declarations

Conflict of Interest

The authors have no conflict of interest.

Ethics Approval

Not applicable.

Research Involving Human Participants and/or Animals

During research, no human participants or animals were involved.

Informed Consent

This article does not involve any studies with human participants or animals performed by any of the authors.

Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bose, K., Sarkar, K. Named Entity Recognition in Bengali and Hindi Using MuRIL and Conditional Random Fields. SN COMPUT. SCI. 5, 856 (2024). https://doi.org/10.1007/s42979-024-03211-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-024-03211-7

Keywords