skip to main content
research-article

Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained Transformers

Published:08 February 2024Publication History
Skip Abstract Section

Abstract

Developing effective natural language processing (NLP) tools for low-resourced languages poses significant challenges. This article centers its attention on the task of Part-of-speech (POS) tagging and chunking, which pertains to the identification and categorization of linguistic units within sentences. POS tagging and Chunking have already produced positive results in English and other European languages. However, in Indian languages, particularly in Odia language, it is not yet well explored because of the lack of supporting tools, resources, and its complex linguistic morphology. This study presents the building of a manually annotated dataset for Odia phrase chunking task and the development of a deep learning-based model specifically tailored to accommodate the distinctive properties of the language. The process of annotating the Odia chunking corpus involved the utilization of inside-outside-begin labels, which were tagged by using designed Odia chunking tagset. We utilize the constructed Odia chunking dataset to build Odia chunker based on deep learning techniques, employing state-of-the-art architectures. Various techniques, such as Recurrent Neural Networks, Convolutional Neural Networks, and transformer-based models, are investigated to determine the most effective approach for Odia POS tagging and chunking. In addition, we conduct experiments utilizing diverse input representations, including Odia word embeddings, character-level representations, and sub-word units, to effectively capture the complex linguistic characteristics of the Odia language. Numerous experiments are conducted that evaluate the performance of our Odia POS tagger and chunker, employing standard evaluation metrics and making comparisons with existing approaches. The results demonstrate that our transformer-based tagger and chunker achieves superior accuracy and robustness in identifying and categorizing linguistic POS tags and chunks within Odia sentences. It outperforms existing work and exhibits consistent performance across diverse linguistic contexts and sentence structures. The developed Odia POS tagger and chunker have enormous potential for a variety of NLP applications, including information extraction, syntactic parsing, and machine translation, all of which are tailored to the low-resource Odia language. This work contributes to developing NLP tools and technologies for low-resource languages, thereby facilitating enhanced language processing capabilities in various linguistic contexts.

REFERENCES

  1. [1] Abney Steven P.. 1992. Parsing by chunks. Principle-based Parsing: Computation and Psycholinguistics. Kluwer Academic, Norwell, MA, 257278.Google ScholarGoogle Scholar
  2. [2] Advaith V., Shivkumar Anushka, and Lakshmi B. S. Sowmya. 2022. Parts of speech tagging for kannada and hindi languages using ML and DL models. In Proceedings of the IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT). IEEE, 15.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Albared Mohammed, Omar Nazlia, Aziz Mohd Juzaiddin Ab, and Nazri Mohd Zakree Ahmad. 2010. Automatic part of speech tagging for Arabic: An experiment using Bigram hidden Markov model. In Proceedings of the 5th International Conference on Rough Set and Knowledge Technology (RSKT’10). Springer, 361370.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Alharbi Randah, Magdy Walid, Darwish Kareem, AbdelAli Ahmed, and Mubarak Hamdy. 2018. Part-of-speech tagging for Arabic Gulf dialect using Bi-LSTM. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).Google ScholarGoogle Scholar
  5. [5] Ali Wajid and Hussain Sarmad. 2010. A hybrid approach to Urdu verb phrase chunking. In Proceedings of the 8th Workshop on Asian Language Resouces. 137143.Google ScholarGoogle Scholar
  6. [6] AlKhwiter Wasan and Al-Twairesh Nora. 2021. Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM. Comput. Speech Lang. 65 (2021), 101138.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Allauzen Alexandre and Bonneau-Maynard Hélene. 2008. Training and evaluation of POS taggers on the French MULTITAG corpus. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’08).Google ScholarGoogle Scholar
  8. [8] Anastasopoulos Antonis, Lekakou Marika, Quer Josep, Zimianiti Eleni, DeBenedetto Justin, and Chiang David. 2018. Part-of-speech tagging on an endangered language: A parallel Griko-Italian Resource. In Proceedings of the 27th International Conference on Computational Linguistics (COLING’18). 25292539.Google ScholarGoogle Scholar
  9. [9] Asopa Sneha, Asopa Pooja, Mathur Iti, and Joshi Nisheeth. 2016. Rule based chunker for Hindi. In Proceedings of the 2nd International Conference on Contemporary Computing and Informatics (IC3I’16). IEEE, 442445.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Bharati Akshar and Mannem Prashanth R.. 2007. Introduction to shallow parsing contest on south asian languages. In Proceedings of the IJCAI and the Workshop On Shallow Parsing for South Asian Languages (SPSAL’07). Citeseer, 18.Google ScholarGoogle Scholar
  11. [11] Bharti Santosh Kumar, Gupta Rajeev Kumar, Patel Samir, and Shah Manan. 2022. Context-based bigram model for POS tagging in Hindi: A heuristic approach. Ann. Data Sci. (Aug. 2022), 132.Google ScholarGoogle Scholar
  12. [12] Bojar Ondrej, Diatka Vojtech, Rychlỳ Pavel, Stranák Pavel, Suchomel Vít, Tamchyna Ales, and Zeman Daniel. 2014. HindEnCorp-Hindi-English and Hindi-only Corpus for Machine Translation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14). 35503555.Google ScholarGoogle Scholar
  13. [13] Conneau Alexis, Khandelwal Kartikay, Goyal Naman, Chaudhary Vishrav, Wenzek Guillaume, Guzmán Francisco, Grave Édouard, Ott Myle, Zettlemoyer Luke, and Stoyanov Veselin. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 84408451.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Dalai Tusarkanta, Mishra Tapas Kumar, and Sa Pankaj K.. 2023. Part-of-speech tagging of Odia language using statistical and deep learning based approaches. ACM Trans. Asian Low-Resour. Lang. Info. Process. 22, 6 (2023), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Dandapat Sandipan et al. 2007. Part of speech tagging and chunking with maximum entropy model. In Proceedings of the IJCAI Workshop on Shallow Parsing for South Asian Languages. 2932.Google ScholarGoogle Scholar
  16. [16] Das Bishwa Ranjan and Patnaik Srikanta. 2014. A novel approach for Odia part of speech tagging using artificial neural network. In Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA’13). Springer, 147154.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Das Bishwa Ranjan, Sahoo Smrutirekha, Panda Chandra Sekhar, and Patnaik Srikanta. 2015. Part of speech tagging in Odia using support vector machine. Procedia Comput. Sci. 48 (2015), 507512.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Delić VLADO, Sečujski M., and Kupusinac Aleksandar. 2009. Transformation-based part-of-speech tagging for Serbian language. Proc. CIMMACS 9 (2009), 98103.Google ScholarGoogle Scholar
  19. [19] Devadath V. V. and Sharma Dipti Misra. 2016. Significance of an accurate sandhi-splitter in shallow parsing of dravidian languages. In Proceedings of the ACL Student Research Workshop. 3742.Google ScholarGoogle Scholar
  20. [20] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. Association for Computational Linguistics, 41714186.Google ScholarGoogle Scholar
  21. [21] Ehsan Toqeer, Khalid Javairia, Ambreen Saadia, Mustafa Asad, and Hussain Sarmad. 2022. Improving phrase chunking by using contextualized word embeddings for a morphologically rich language. Arab. J. Sci. Eng. (2022), 119.Google ScholarGoogle Scholar
  22. [22] Fraj Fériel Ben, Kessentini Maroua et al. 2012. Combinatorial classification for chunking Arabic text. Int. J. Artif. Intell. Appl. 3, 5 (2012), 6371.Google ScholarGoogle Scholar
  23. [23] Harris Zellig. 1962. String Analysis of Language Structure. Mouton and Co., The Hague.Google ScholarGoogle Scholar
  24. [24] Huang Zhiheng, Xu Wei, and Yu Kai. 2015. Bidirectional LSTM-CRF models for sequence tagging. Retrieved from https://arXiv:1508.01991Google ScholarGoogle Scholar
  25. [25] Jung Sangkeun, Lee Changki, and Hwang Hyunsun. 2018. End-to-end Korean part-of-speech tagging using copying mechanism. ACM Trans. Asian Low-Resour. Lang. Info. Process. 17, 3 (2018), 18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Kakwani Divyanshu, Kunchukuttan Anoop, Golla Satish, Gokul N. C., Bhattacharyya Avik, Khapra Mitesh M., and Kumar Pratyush. 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Proceedings of the Association for Computational Linguistics (EMNLP’20). 49484961.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Khan Wahab, Daud Ali, Nasir Jamal Abdul, Amjad Tehmina, Arafat Sachi, Aljohani Naif, and Alotaibi Fahd S.. 2019. Urdu part of speech tagging using conditional random fields. Lang. Resour. Eval. 53 (2019), 331362.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Khanuja Simran, Bansal Diksha, Mehtani Sarvesh, Khosla Savya, Dey Atreyee, Gopalan Balaji, Margam Dilip Kumar, Aggarwal Pooja, Nagipogu Rajiv Teja, Dave Shachi et al. 2021. Muril: Multilingual representations for Indian languages. Retrieved from https://arXiv:2103.10730Google ScholarGoogle Scholar
  29. [29] Nguyen Minh Le, Nguyen Huong Thao, Nguyen Phuong-Thai, Ho Tu-Bao, and Shimazu Akira. 2009. An empirical study of Vietnamese noun phrase chunking with discriminative sequence models. In Proceedings of the 7th Workshop on Asian Language Resources (ALR’09). 916.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Milovanović Boro and Stanković Ranka. 2020. Part of speech tagging for serbian language using natural language toolkit. History 5 (2020), 4230.Google ScholarGoogle Scholar
  31. [31] Mishra Pruthwik and Sharma Dipti Misra. 2022. Building Odia shallow parser. Retrieved from https://arXiv:2204.08960Google ScholarGoogle Scholar
  32. [32] Mundotiya Rajesh Kumar, Kumar Vikrant, Mehta Arpit, and Singh Anil Kumar. 2020. Attention-based domain adaption using transfer learning for part-of-speech tagging: An experiment on the Hindi language. In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation. 471477.Google ScholarGoogle Scholar
  33. [33] Noferesti Samira and Shamsfard Mehrnoush. 2014. A rule-based model and genetic algorithm combination for persian text chunking. Int. J. Comput. Appl. 21, 2 (2014), 133140.Google ScholarGoogle Scholar
  34. [34] Nongmeikapam Kishorjit, Singh Thiyam Ibungomacha, Chanu Ngariyanbam Mayekleima, and Bandyopadhyay Sivaji. 2014. Manipuri chunking: An incremental model with pos and rmwe. In Proceedings of the 11th International Conference on Natural Language Processing. 277286.Google ScholarGoogle Scholar
  35. [35] Ojha Atul Ku, Behera Pitambar, Singh Srishti, and Jha Girish N.. 2015. Training & evaluation of POS taggers in Indo-Aryan languages: A case of Hindi, Odia and Bhojpuri. In Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. 524529.Google ScholarGoogle Scholar
  36. [36] Pan Subrata and Saha Diganta. 2022. Performance evaluation of part-of-speech tagging for Bengali text. J. Inst. Eng. (India): Ser. B 103, 2 (2022), 577589.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Park Seong-Bae and Zhang Byoung-Tak. 2003. Text chunking by combining hand-crafted rules and memory-based learning. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 497504.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Pattnaik Sagarika, Nayak Ajit Kumar, and Patnaik Srikanta. 2020. A semi-supervised learning of HMM to build a POS tagger for a low resourced language. J. Info. Commun. Converg. Eng. 18, 4 (2020), 207215.Google ScholarGoogle Scholar
  39. [39] Priyadarshi Ankur and Saha Sujan Kumar. 2020. Towards the first Maithili part of speech tagger: Resource creation and system development. Comput. Speech Lang. 62 (2020), 101054.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Raha Tathagata, Mahata Sainik, Das Dipankar, and Bandyopadhyay Sivaji. 2019. Development of POS tagger for English-Bengali code-mixed data. In Proceedings of the 16th International Conference on Natural Language Processing. 143149.Google ScholarGoogle Scholar
  41. [41] Rai Pooja, Chatterji Sanjay, and Kim Byung-Gyu. 2023. Deep learning-based sequence labeling tools for Nepali. ACM Trans. Asian Low-Resour. Lang. Info. Process. 22, 8 (2023), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Song Hyun-Je and Park Seong-Bae. 2020. Korean part-of-speech tagging based on morpheme generation. ACM Trans. Asian Low-Resour. Lang. Info. Process. 19, 3 (2020), 110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Tailor Chetana and Patel Bankim. 2021. Chunker for gujarati language using hybrid approach. In Proceedings of the Conference on Rising Threats in Expert Applications and Solutions (FICR-TEAS’20). Springer, 7784.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. Adv. Neural Info. Process. Syst. 30 (2017).Google ScholarGoogle Scholar
  45. [45] Warjri Sunita, Pakray Partha, Lyngdoh Saralin A., and Maji Arnab Kumar. 2021. Part-of-speech (pos) tagging using deep learning-based approaches on the designed khasi pos corpus. Trans. Asian Low-Resour. Lang. Info. Process. 21, 3 (2021), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Yang Jie, Liang Shuailong, and Zhang Yue. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics (COLING’18).Google ScholarGoogle Scholar
  47. [47] Yang Jie and Zhang Yue. 2018. NCRF++: An open-source neural sequence labeling toolkit. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Zhai Feifei, Potdar Saloni, Xiang Bing, and Zhou Bowen. 2017. Neural models for sequence chunking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained Transformers

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 2
        February 2024
        340 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3613556
        • Editor:
        • Imed Zitouni
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 February 2024
        • Online AM: 15 December 2023
        • Accepted: 14 December 2023
        • Revised: 12 December 2023
        • Received: 4 October 2023
        Published in tallip Volume 23, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text