skip to main content
research-article

Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained Transformers

Published: 08 February 2024 Publication History

Abstract

Developing effective natural language processing (NLP) tools for low-resourced languages poses significant challenges. This article centers its attention on the task of Part-of-speech (POS) tagging and chunking, which pertains to the identification and categorization of linguistic units within sentences. POS tagging and Chunking have already produced positive results in English and other European languages. However, in Indian languages, particularly in Odia language, it is not yet well explored because of the lack of supporting tools, resources, and its complex linguistic morphology. This study presents the building of a manually annotated dataset for Odia phrase chunking task and the development of a deep learning-based model specifically tailored to accommodate the distinctive properties of the language. The process of annotating the Odia chunking corpus involved the utilization of inside-outside-begin labels, which were tagged by using designed Odia chunking tagset. We utilize the constructed Odia chunking dataset to build Odia chunker based on deep learning techniques, employing state-of-the-art architectures. Various techniques, such as Recurrent Neural Networks, Convolutional Neural Networks, and transformer-based models, are investigated to determine the most effective approach for Odia POS tagging and chunking. In addition, we conduct experiments utilizing diverse input representations, including Odia word embeddings, character-level representations, and sub-word units, to effectively capture the complex linguistic characteristics of the Odia language. Numerous experiments are conducted that evaluate the performance of our Odia POS tagger and chunker, employing standard evaluation metrics and making comparisons with existing approaches. The results demonstrate that our transformer-based tagger and chunker achieves superior accuracy and robustness in identifying and categorizing linguistic POS tags and chunks within Odia sentences. It outperforms existing work and exhibits consistent performance across diverse linguistic contexts and sentence structures. The developed Odia POS tagger and chunker have enormous potential for a variety of NLP applications, including information extraction, syntactic parsing, and machine translation, all of which are tailored to the low-resource Odia language. This work contributes to developing NLP tools and technologies for low-resource languages, thereby facilitating enhanced language processing capabilities in various linguistic contexts.

References

[1]
Steven P. Abney. 1992. Parsing by chunks. Principle-based Parsing: Computation and Psycholinguistics. Kluwer Academic, Norwell, MA, 257–278.
[2]
V. Advaith, Anushka Shivkumar, and B. S. Sowmya Lakshmi. 2022. Parts of speech tagging for kannada and hindi languages using ML and DL models. In Proceedings of the IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT). IEEE, 1–5.
[3]
Mohammed Albared, Nazlia Omar, Mohd Juzaiddin Ab Aziz, and Mohd Zakree Ahmad Nazri. 2010. Automatic part of speech tagging for Arabic: An experiment using Bigram hidden Markov model. In Proceedings of the 5th International Conference on Rough Set and Knowledge Technology (RSKT’10). Springer, 361–370.
[4]
Randah Alharbi, Walid Magdy, Kareem Darwish, Ahmed AbdelAli, and Hamdy Mubarak. 2018. Part-of-speech tagging for Arabic Gulf dialect using Bi-LSTM. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).
[5]
Wajid Ali and Sarmad Hussain. 2010. A hybrid approach to Urdu verb phrase chunking. In Proceedings of the 8th Workshop on Asian Language Resouces. 137–143.
[6]
Wasan AlKhwiter and Nora Al-Twairesh. 2021. Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM. Comput. Speech Lang. 65 (2021), 101138.
[7]
Alexandre Allauzen and Hélene Bonneau-Maynard. 2008. Training and evaluation of POS taggers on the French MULTITAG corpus. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’08).
[8]
Antonis Anastasopoulos, Marika Lekakou, Josep Quer, Eleni Zimianiti, Justin DeBenedetto, and David Chiang. 2018. Part-of-speech tagging on an endangered language: A parallel Griko-Italian Resource. In Proceedings of the 27th International Conference on Computational Linguistics (COLING’18). 2529–2539.
[9]
Sneha Asopa, Pooja Asopa, Iti Mathur, and Nisheeth Joshi. 2016. Rule based chunker for Hindi. In Proceedings of the 2nd International Conference on Contemporary Computing and Informatics (IC3I’16). IEEE, 442–445.
[10]
Akshar Bharati and Prashanth R. Mannem. 2007. Introduction to shallow parsing contest on south asian languages. In Proceedings of the IJCAI and the Workshop On Shallow Parsing for South Asian Languages (SPSAL’07). Citeseer, 1–8.
[11]
Santosh Kumar Bharti, Rajeev Kumar Gupta, Samir Patel, and Manan Shah. 2022. Context-based bigram model for POS tagging in Hindi: A heuristic approach. Ann. Data Sci. (Aug. 2022), 1–32.
[12]
Ondrej Bojar, Vojtech Diatka, Pavel Rychlỳ, Pavel Stranák, Vít Suchomel, Ales Tamchyna, and Daniel Zeman. 2014. HindEnCorp-Hindi-English and Hindi-only Corpus for Machine Translation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14). 3550–3555.
[13]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8440–8451.
[14]
Tusarkanta Dalai, Tapas Kumar Mishra, and Pankaj K. Sa. 2023. Part-of-speech tagging of Odia language using statistical and deep learning based approaches. ACM Trans. Asian Low-Resour. Lang. Info. Process. 22, 6 (2023), 1–24.
[15]
Sandipan Dandapat et al. 2007. Part of speech tagging and chunking with maximum entropy model. In Proceedings of the IJCAI Workshop on Shallow Parsing for South Asian Languages. 29–32.
[16]
Bishwa Ranjan Das and Srikanta Patnaik. 2014. A novel approach for Odia part of speech tagging using artificial neural network. In Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA’13). Springer, 147–154.
[17]
Bishwa Ranjan Das, Smrutirekha Sahoo, Chandra Sekhar Panda, and Srikanta Patnaik. 2015. Part of speech tagging in Odia using support vector machine. Procedia Comput. Sci. 48 (2015), 507–512.
[18]
VLADO Delić, M. Sečujski, and Aleksandar Kupusinac. 2009. Transformation-based part-of-speech tagging for Serbian language. Proc. CIMMACS 9 (2009), 98–103.
[19]
V. V. Devadath and Dipti Misra Sharma. 2016. Significance of an accurate sandhi-splitter in shallow parsing of dravidian languages. In Proceedings of the ACL Student Research Workshop. 37–42.
[20]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. Association for Computational Linguistics, 4171–4186.
[21]
Toqeer Ehsan, Javairia Khalid, Saadia Ambreen, Asad Mustafa, and Sarmad Hussain. 2022. Improving phrase chunking by using contextualized word embeddings for a morphologically rich language. Arab. J. Sci. Eng. (2022), 1–19.
[22]
Fériel Ben Fraj, Maroua Kessentini et al. 2012. Combinatorial classification for chunking Arabic text. Int. J. Artif. Intell. Appl. 3, 5 (2012), 63–71.
[23]
Zellig Harris. 1962. String Analysis of Language Structure. Mouton and Co., The Hague.
[24]
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. Retrieved from https://arXiv:1508.01991
[25]
Sangkeun Jung, Changki Lee, and Hyunsun Hwang. 2018. End-to-end Korean part-of-speech tagging using copying mechanism. ACM Trans. Asian Low-Resour. Lang. Info. Process. 17, 3 (2018), 1–8.
[26]
Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, N. C. Gokul, Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Proceedings of the Association for Computational Linguistics (EMNLP’20). 4948–4961.
[27]
Wahab Khan, Ali Daud, Jamal Abdul Nasir, Tehmina Amjad, Sachi Arafat, Naif Aljohani, and Fahd S. Alotaibi. 2019. Urdu part of speech tagging using conditional random fields. Lang. Resour. Eval. 53 (2019), 331–362.
[28]
Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave et al. 2021. Muril: Multilingual representations for Indian languages. Retrieved from https://arXiv:2103.10730
[29]
Minh Le Nguyen, Huong Thao Nguyen, Phuong-Thai Nguyen, Tu-Bao Ho, and Akira Shimazu. 2009. An empirical study of Vietnamese noun phrase chunking with discriminative sequence models. In Proceedings of the 7th Workshop on Asian Language Resources (ALR’09). 9–16.
[30]
Boro Milovanović and Ranka Stanković. 2020. Part of speech tagging for serbian language using natural language toolkit. History 5 (2020), 4–230.
[31]
Pruthwik Mishra and Dipti Misra Sharma. 2022. Building Odia shallow parser. Retrieved from https://arXiv:2204.08960
[32]
Rajesh Kumar Mundotiya, Vikrant Kumar, Arpit Mehta, and Anil Kumar Singh. 2020. Attention-based domain adaption using transfer learning for part-of-speech tagging: An experiment on the Hindi language. In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation. 471–477.
[33]
Samira Noferesti and Mehrnoush Shamsfard. 2014. A rule-based model and genetic algorithm combination for persian text chunking. Int. J. Comput. Appl. 21, 2 (2014), 133–140.
[34]
Kishorjit Nongmeikapam, Thiyam Ibungomacha Singh, Ngariyanbam Mayekleima Chanu, and Sivaji Bandyopadhyay. 2014. Manipuri chunking: An incremental model with pos and rmwe. In Proceedings of the 11th International Conference on Natural Language Processing. 277–286.
[35]
Atul Ku Ojha, Pitambar Behera, Srishti Singh, and Girish N. Jha. 2015. Training & evaluation of POS taggers in Indo-Aryan languages: A case of Hindi, Odia and Bhojpuri. In Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. 524–529.
[36]
Subrata Pan and Diganta Saha. 2022. Performance evaluation of part-of-speech tagging for Bengali text. J. Inst. Eng. (India): Ser. B 103, 2 (2022), 577–589.
[37]
Seong-Bae Park and Byoung-Tak Zhang. 2003. Text chunking by combining hand-crafted rules and memory-based learning. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 497–504.
[38]
Sagarika Pattnaik, Ajit Kumar Nayak, and Srikanta Patnaik. 2020. A semi-supervised learning of HMM to build a POS tagger for a low resourced language. J. Info. Commun. Converg. Eng. 18, 4 (2020), 207–215.
[39]
Ankur Priyadarshi and Sujan Kumar Saha. 2020. Towards the first Maithili part of speech tagger: Resource creation and system development. Comput. Speech Lang. 62 (2020), 101054.
[40]
Tathagata Raha, Sainik Mahata, Dipankar Das, and Sivaji Bandyopadhyay. 2019. Development of POS tagger for English-Bengali code-mixed data. In Proceedings of the 16th International Conference on Natural Language Processing. 143–149.
[41]
Pooja Rai, Sanjay Chatterji, and Byung-Gyu Kim. 2023. Deep learning-based sequence labeling tools for Nepali. ACM Trans. Asian Low-Resour. Lang. Info. Process. 22, 8 (2023), 1–23.
[42]
Hyun-Je Song and Seong-Bae Park. 2020. Korean part-of-speech tagging based on morpheme generation. ACM Trans. Asian Low-Resour. Lang. Info. Process. 19, 3 (2020), 1–10.
[43]
Chetana Tailor and Bankim Patel. 2021. Chunker for gujarati language using hybrid approach. In Proceedings of the Conference on Rising Threats in Expert Applications and Solutions (FICR-TEAS’20). Springer, 77–84.
[44]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Adv. Neural Info. Process. Syst. 30 (2017).
[45]
Sunita Warjri, Partha Pakray, Saralin A. Lyngdoh, and Arnab Kumar Maji. 2021. Part-of-speech (pos) tagging using deep learning-based approaches on the designed khasi pos corpus. Trans. Asian Low-Resour. Lang. Info. Process. 21, 3 (2021), 1–24.
[46]
Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics (COLING’18).
[47]
Jie Yang and Yue Zhang. 2018. NCRF++: An open-source neural sequence labeling toolkit. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.
[48]
Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen Zhou. 2017. Neural models for sequence chunking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.

Cited By

View all
  • (2024)Multilingual Neural Machine Translation for Indic to Indic LanguagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365202623:5(1-32)Online publication date: 10-May-2024

Index Terms

  1. Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained Transformers

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 2
      February 2024
      340 pages
      EISSN:2375-4702
      DOI:10.1145/3613556
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 February 2024
      Online AM: 15 December 2023
      Accepted: 14 December 2023
      Revised: 12 December 2023
      Received: 04 October 2023
      Published in TALLIP Volume 23, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Part of speech (POS)
      2. chunking
      3. low-resource language
      4. deep learning
      5. transformers

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)207
      • Downloads (Last 6 weeks)16
      Reflects downloads up to 02 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Multilingual Neural Machine Translation for Indic to Indic LanguagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365202623:5(1-32)Online publication date: 10-May-2024

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media