Abstract
Developing effective natural language processing (NLP) tools for low-resourced languages poses significant challenges. This article centers its attention on the task of Part-of-speech (POS) tagging and chunking, which pertains to the identification and categorization of linguistic units within sentences. POS tagging and Chunking have already produced positive results in English and other European languages. However, in Indian languages, particularly in Odia language, it is not yet well explored because of the lack of supporting tools, resources, and its complex linguistic morphology. This study presents the building of a manually annotated dataset for Odia phrase chunking task and the development of a deep learning-based model specifically tailored to accommodate the distinctive properties of the language. The process of annotating the Odia chunking corpus involved the utilization of inside-outside-begin labels, which were tagged by using designed Odia chunking tagset. We utilize the constructed Odia chunking dataset to build Odia chunker based on deep learning techniques, employing state-of-the-art architectures. Various techniques, such as Recurrent Neural Networks, Convolutional Neural Networks, and transformer-based models, are investigated to determine the most effective approach for Odia POS tagging and chunking. In addition, we conduct experiments utilizing diverse input representations, including Odia word embeddings, character-level representations, and sub-word units, to effectively capture the complex linguistic characteristics of the Odia language. Numerous experiments are conducted that evaluate the performance of our Odia POS tagger and chunker, employing standard evaluation metrics and making comparisons with existing approaches. The results demonstrate that our transformer-based tagger and chunker achieves superior accuracy and robustness in identifying and categorizing linguistic POS tags and chunks within Odia sentences. It outperforms existing work and exhibits consistent performance across diverse linguistic contexts and sentence structures. The developed Odia POS tagger and chunker have enormous potential for a variety of NLP applications, including information extraction, syntactic parsing, and machine translation, all of which are tailored to the low-resource Odia language. This work contributes to developing NLP tools and technologies for low-resource languages, thereby facilitating enhanced language processing capabilities in various linguistic contexts.
- [1] . 1992. Parsing by chunks. Principle-based Parsing: Computation and Psycholinguistics. Kluwer Academic, Norwell, MA, 257–278.Google Scholar
- [2] . 2022. Parts of speech tagging for kannada and hindi languages using ML and DL models. In Proceedings of the IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT). IEEE, 1–5.Google ScholarCross Ref
- [3] . 2010. Automatic part of speech tagging for Arabic: An experiment using Bigram hidden Markov model. In Proceedings of the 5th International Conference on Rough Set and Knowledge Technology (RSKT’10). Springer, 361–370.Google ScholarCross Ref
- [4] . 2018. Part-of-speech tagging for Arabic Gulf dialect using Bi-LSTM. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).Google Scholar
- [5] . 2010. A hybrid approach to Urdu verb phrase chunking. In Proceedings of the 8th Workshop on Asian Language Resouces. 137–143.Google Scholar
- [6] . 2021. Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM. Comput. Speech Lang. 65 (2021), 101138.Google ScholarCross Ref
- [7] . 2008. Training and evaluation of POS taggers on the French MULTITAG corpus. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’08).Google Scholar
- [8] . 2018. Part-of-speech tagging on an endangered language: A parallel Griko-Italian Resource. In Proceedings of the 27th International Conference on Computational Linguistics (COLING’18). 2529–2539.Google Scholar
- [9] . 2016. Rule based chunker for Hindi. In Proceedings of the 2nd International Conference on Contemporary Computing and Informatics (IC3I’16). IEEE, 442–445.Google ScholarCross Ref
- [10] . 2007. Introduction to shallow parsing contest on south asian languages. In Proceedings of the IJCAI and the Workshop On Shallow Parsing for South Asian Languages (SPSAL’07). Citeseer, 1–8.Google Scholar
- [11] . 2022. Context-based bigram model for POS tagging in Hindi: A heuristic approach. Ann. Data Sci. (Aug. 2022), 1–32.Google Scholar
- [12] . 2014. HindEnCorp-Hindi-English and Hindi-only Corpus for Machine Translation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14). 3550–3555.Google Scholar
- [13] . 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8440–8451.Google ScholarCross Ref
- [14] . 2023. Part-of-speech tagging of Odia language using statistical and deep learning based approaches. ACM Trans. Asian Low-Resour. Lang. Info. Process. 22, 6 (2023), 1–24.Google ScholarDigital Library
- [15] . 2007. Part of speech tagging and chunking with maximum entropy model. In Proceedings of the IJCAI Workshop on Shallow Parsing for South Asian Languages. 29–32.Google Scholar
- [16] . 2014. A novel approach for Odia part of speech tagging using artificial neural network. In Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA’13). Springer, 147–154.Google ScholarCross Ref
- [17] . 2015. Part of speech tagging in Odia using support vector machine. Procedia Comput. Sci. 48 (2015), 507–512.Google ScholarCross Ref
- [18] . 2009. Transformation-based part-of-speech tagging for Serbian language. Proc. CIMMACS 9 (2009), 98–103.Google Scholar
- [19] . 2016. Significance of an accurate sandhi-splitter in shallow parsing of dravidian languages. In Proceedings of the ACL Student Research Workshop. 37–42.Google Scholar
- [20] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. Association for Computational Linguistics, 4171–4186.Google Scholar
- [21] . 2022. Improving phrase chunking by using contextualized word embeddings for a morphologically rich language. Arab. J. Sci. Eng. (2022), 1–19.Google Scholar
- [22] . 2012. Combinatorial classification for chunking Arabic text. Int. J. Artif. Intell. Appl. 3, 5 (2012), 63–71.Google Scholar
- [23] . 1962. String Analysis of Language Structure. Mouton and Co., The Hague.Google Scholar
- [24] . 2015. Bidirectional LSTM-CRF models for sequence tagging. Retrieved from https://arXiv:1508.01991Google Scholar
- [25] . 2018. End-to-end Korean part-of-speech tagging using copying mechanism. ACM Trans. Asian Low-Resour. Lang. Info. Process. 17, 3 (2018), 1–8.Google ScholarDigital Library
- [26] . 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Proceedings of the Association for Computational Linguistics (EMNLP’20). 4948–4961.Google ScholarCross Ref
- [27] . 2019. Urdu part of speech tagging using conditional random fields. Lang. Resour. Eval. 53 (2019), 331–362.Google ScholarDigital Library
- [28] . 2021. Muril: Multilingual representations for Indian languages. Retrieved from https://arXiv:2103.10730Google Scholar
- [29] . 2009. An empirical study of Vietnamese noun phrase chunking with discriminative sequence models. In Proceedings of the 7th Workshop on Asian Language Resources (ALR’09). 9–16.Google ScholarDigital Library
- [30] . 2020. Part of speech tagging for serbian language using natural language toolkit. History 5 (2020), 4–230.Google Scholar
- [31] . 2022. Building Odia shallow parser. Retrieved from https://arXiv:2204.08960Google Scholar
- [32] . 2020. Attention-based domain adaption using transfer learning for part-of-speech tagging: An experiment on the Hindi language. In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation. 471–477.Google Scholar
- [33] . 2014. A rule-based model and genetic algorithm combination for persian text chunking. Int. J. Comput. Appl. 21, 2 (2014), 133–140.Google Scholar
- [34] . 2014. Manipuri chunking: An incremental model with pos and rmwe. In Proceedings of the 11th International Conference on Natural Language Processing. 277–286.Google Scholar
- [35] . 2015. Training & evaluation of POS taggers in Indo-Aryan languages: A case of Hindi, Odia and Bhojpuri. In Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. 524–529.Google Scholar
- [36] . 2022. Performance evaluation of part-of-speech tagging for Bengali text. J. Inst. Eng. (India): Ser. B 103, 2 (2022), 577–589.Google ScholarCross Ref
- [37] . 2003. Text chunking by combining hand-crafted rules and memory-based learning. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 497–504.Google ScholarDigital Library
- [38] . 2020. A semi-supervised learning of HMM to build a POS tagger for a low resourced language. J. Info. Commun. Converg. Eng. 18, 4 (2020), 207–215.Google Scholar
- [39] . 2020. Towards the first Maithili part of speech tagger: Resource creation and system development. Comput. Speech Lang. 62 (2020), 101054.Google ScholarDigital Library
- [40] . 2019. Development of POS tagger for English-Bengali code-mixed data. In Proceedings of the 16th International Conference on Natural Language Processing. 143–149.Google Scholar
- [41] . 2023. Deep learning-based sequence labeling tools for Nepali. ACM Trans. Asian Low-Resour. Lang. Info. Process. 22, 8 (2023), 1–23.Google ScholarDigital Library
- [42] . 2020. Korean part-of-speech tagging based on morpheme generation. ACM Trans. Asian Low-Resour. Lang. Info. Process. 19, 3 (2020), 1–10.Google ScholarDigital Library
- [43] . 2021. Chunker for gujarati language using hybrid approach. In Proceedings of the Conference on Rising Threats in Expert Applications and Solutions (FICR-TEAS’20). Springer, 77–84.Google ScholarCross Ref
- [44] . 2017. Attention is all you need. Adv. Neural Info. Process. Syst. 30 (2017).Google Scholar
- [45] . 2021. Part-of-speech (pos) tagging using deep learning-based approaches on the designed khasi pos corpus. Trans. Asian Low-Resour. Lang. Info. Process. 21, 3 (2021), 1–24.Google ScholarDigital Library
- [46] . 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics (COLING’18).Google Scholar
- [47] . 2018. NCRF++: An open-source neural sequence labeling toolkit. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Google ScholarCross Ref
- [48] . 2017. Neural models for sequence chunking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.Google ScholarCross Ref
Index Terms
- Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained Transformers
Recommendations
Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches
Automatic part-of-speech (POS) tagging is a preprocessing step of many natural language processing tasks, such as named entity recognition, speech processing, information extraction, word sense disambiguation, and machine translation. It has already ...
Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus
Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a ...
SVM based Manipuri POS tagging using SVM based identified reduplicated MWE (RMWE)
CUBE '12: Proceedings of the CUBE International Information Technology ConferenceThe Reduplicated Multiword Expression (RMWE) is identified using Support Vector Machine (SVM) and these identified RMWE is used as a feature for the SVM based POS tagging of Manipuri, which is a very highly agglutinative Indian Schedule Language. A ...
Comments