skip to main content
research-article

Multi Task Learning Based Shallow Parsing for Indian Languages

Published: 16 August 2024 Publication History

Abstract

Shallow Parsing is an important step for many Natural Language Processing tasks. Although shallow parsing has a rich history for resource rich languages, it is not the case for most Indian languages. Shallow Parsing consists of POS Tagging and Chunking. Our study focuses on developing shallow parsers for Indian languages. As part of shallow parsing, we included morph analysis as well.
For the study, we first consolidated available shallow parsing corpora for seven Indian Languages (Hindi, Kannada, Bangla, Malayalam, Marathi, Urdu, Telugu) for which treebanks are publicly available. We then trained models to achieve state-of-the-art performance for shallow parsing in these languages for multiple domains. Since analyzing the performance of model predictions at sentence level is more realistic, we report the performance of these shallow parsers not only at the token level, but also at the sentence level. We also present machine learning techniques for multi-task shallow parsing. Our experiments show that fine-tuned contextual embedding with multi-task learning improves the performance of multiple as well as individual shallow parsing tasks across different domains. We show the transfer learning capability of these models by creating shallow parsers (only with POS and Chunk) for Gujarati, Odia, and Punjabi for which no treebanks are available.
As a part of this work, we will be releasing the Indian Languages Shallow Linguistic (ILSL) benchmarks for 10 Indian languages, including both the major language families Indo-Aryan and Dravidian as common building blocks that can be used to evaluate and understand various linguistic phenomena found in Indian languages and how well newer approaches can tackle them.

References

[1]
Steven P. Abney. 1992. Parsing by chunks. Principle-Based Parsing. Studies in Linguistics and Philosophy, Vol. 44, Springer, Dordrecht.
[2]
Ankita Agarwal, Pramila, Shashi Pal Singh, Ajai Kumar, and Hemant Darbari. 2014. Morphological analyser for Hindi-A rule based implementation. Int. J. Advan. Comput. Res. 4, 1 (2014), 19.
[3]
P. J. Antony and K. P. Soman. 2011. Parts of speech tagging for Indian languages: A literature survey. Int. J. Comput. Applic. 34, 8 (2011), 0975–8887.
[4]
Rafiya Begum, Samar Husain, Arun Dhwaj, Dipti Misra Sharma, Lakshmi Bai, and Rajeev Sangal. 2008. Dependency annotation scheme for Indian languages. In Proceedings of the 3rd International Joint Conference on Natural Language Processing.
[5]
Akshar Bharati, Rajeev Sangal, and Dipti M. Sharma. 2007. SSF: Shakti standard format guide. Lang. Technol. Res. Centre, Int. Instit. Inf. Technol., Hyderabad, India (2007), 1–25.
[6]
Akshar Bharati, Rajeev Sangal, Dipti Misra Sharma, and Lakshmi Bai. 2006. AnnCorra: Annotating corpora guidelines for POS and chunk annotation for Indian languages. LTRC-TR31 (2006), 1–38.
[7]
Riyaz Ahmad Bhat and Dipti Misra Sharma. 2012. Dependency treebank of Urdu and its evaluation. In Proceedings of the 6th Linguistic Annotation Workshop. 157–165.
[8]
Phil Blunsom. 2004. Hidden Markov models. Lecture Notes, August 15, 18-19 (2004), 48.
[9]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Computat. Ling. 5 (2017), 135–146.
[10]
Thorsten Brants. 2000. TnT—A statistical part-of-speech tagger. arXiv preprint cs/0003055 (2000).
[11]
Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s GPT-2—How can I help you? Towards the use of pretrained language models for task-oriented dialogue systems. arXiv preprint arXiv:1907.05774 (2019).
[12]
Berfu Büyüköz, Ali Hürriyetoğlu, and Arzucan Özgür. 2020. Analyzing ELMo and DistilBERT on socio-political news classification. In Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020. 9–18.
[13]
Rich Caruana. 1997. Multitask learning. Mach. Learn. 28, 1 (1997), 41–75.
[14]
François Chollet. 2015. Keras. Retrieved from https://keras.io
[15]
Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra, and Pratyush Kumar. 2022. IndicBART: A pre-trained model for natural language generation of Indic languages. In Findings of the Association for Computational Linguistics.
[16]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[17]
Sanjay K. Dwivedi and Pramod P. Sukhadeve. 2010. Machine translation system in indian perspectives. J. Comput. Sci. 6, 10 (2010), 1111.
[18]
Sharon Goldwater and Tom Griffiths. 2007. A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 744–751.
[19]
Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. 2005. Bidirectional LSTM networks for improved phoneme classification and recognition. In Proceedings of the International Conference on Artificial Neural Networks. Springer, 799–804.
[20]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[21]
Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018).
[22]
Girish Nath Jha. 2010. The TDIL program and the Indian Langauge Corpora Intitiative (ILCI). In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10).
[23]
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016).
[24]
Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N. C., Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of EMNLP.
[25]
Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian, and Partha Talukdar. 2021. MuRIL: Multilingual representations for Indian languages. arXiv preprint arXiv:2103.10730 (2021).
[26]
John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).
[27]
Jieh-Sheng Lee and Jieh Hsiang. 2020. Patent claim generation by fine-tuning OpenAI GPT-2. World Patent Inf. 62 (2020), 101983.
[28]
Wang Ling, Chris Dyer, Alan W. Black, Isabel Trancoso, Ramón Fermandez, Silvio Amir, Luís Marujo, and Tiago Luís. 2015. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1520–1530. DOI:DOI:
[29]
Deepak Kumar Malladi and Prashanth Mannem. 2013. Statistical morphological analyzer for Hindi. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 1007–1011.
[30]
Christopher D. Manning. 2011. Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 171–189.
[31]
Alec Marantz. 1997. No escape from syntax: Don’t try morphological analysis in the privacy of your own lexicon. Univ. Pennsyl. Work. Papers Ling. 4, 2 (1997), 14.
[32]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[33]
Pruthwik Mishra, Vandan Mujadia, and Dipti Misra Sharma. 2017. POS tagging for resource poor languages through feature projection. In Proceedings of the 14th International Conference on Natural Language Processing (ICON’17). 50–55.
[34]
Pruthwik Mishra and Dipti Misra Sharma. 2022. Building Odia shallow parser. arXiv preprint arXiv:2204.08960 (2022).
[35]
Vandan Mujadia and Dipti Sharma. 2020. NMT based similar language translation for Hindi–Marathi. In Proceedings of the 5th Conference on Machine Translation. Association for Computational Linguistics, Online, 414–417. Retrieved from DOI:https://www.aclweb.org/anthology/2020.wmt-1.48
[36]
Nandini Mundra, Sumanth Doddapaneni, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully, and Mitesh M. Khapra. 2023. A comprehensive analysis of adapter efficiency. arXiv preprint arXiv:2305.07491 (2023).
[37]
Sneha Nallani, Manish Shrivastava, and Dipti Misra Sharma. 2020. A fully expanded dependency treebank for Telugu. In Proceedings of the 5th Workshop on Indian Language Data: Resources and Evaluation (WILDRE’20). 39–44.
[38]
Joakim Nivre, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 1659–1666.
[39]
William S. Noble. 2006. What is a support vector machine? Nat. Biotechnol. 24, 12 (2006), 1565–1567.
[40]
Atul Ku Ojha, Pitambar Behera, Srishti Singh, and Girish N. Jha. 2015. Training & evaluation of POS taggers in Indo-Aryan languages: A case of Hindi, Odia and Bhojpuri. In Proceedings of 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. 524–529.
[41]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.
[42]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics.
[43]
Avinesh P. V. S. and G. Karthik. 2007. Part-of-speech tagging and chunking using conditional random fields and transformation based learning. Shallow Pars. South Asian Lang. 21 (2007), 21–24.
[44]
Rajeev Sangal, Vineet Chaitanya, and Akshar Bharati. 1995. Natural Language Processing: A Paninian Perspective. PHI Learning Pvt. Ltd.
[45]
Rico Sennrich and Barry Haddow. 2016. Linguistic input features improve neural machine translation. arXiv preprint arXiv:1606.02892 (2016).
[46]
Arnav Sharma, Sakshi Gupta, Raveesh Motlani, Piyush Bansal, Manish Srivastava, Radhika Mamidi, and Dipti M. Sharma. 2016. Shallow parsing pipeline for Hindi-English code-mixed social media text. arXiv preprint arXiv:1604.03136 (2016).
[47]
Manish Shrivastava and Pushpak Bhattacharyya. 2008. Hindi POS tagger using naive stemming: Harnessing morphological information without extensive linguistic knowledge. In Proceedings of the International Conference on NLP (ICON’08). Citeseer.
[48]
Saikrishna Srirampur, Ravi Chandibhamar, and Radhika Mamidi. 2014. Statistical morph analyzer (SMA++) for Indian languages. In Proceedings of the 1t Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects. 103–109.
[49]
Charles Sutton and Andrew McCallum. 2012. An introduction to conditional random fields. Found. Trends® Mach. Learn. 4, 4 (2012), 267–373.
[50]
John Sylak-Glassman, Christo Kirov, David Yarowsky, and Roger Que. 2015. A language-independent feature schema for inflectional morphology. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 674–680.
[51]
Juhi Tandon, Himani Chaudhry, Riyaz Ahmad Bhat, and Dipti Misra Sharma. 2016. Conversion from Paninian Karakas to universal dependencies for Hindi dependency treebank. In Proceedings of the 10th Linguistic Annotation Workshop Held in Conjunction with ACL 2016 (LAW-X’16). 141–150.
[52]
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. arXiv preprint arXiv:1905.05950 (2019).
[53]
Sajeetha Thavareesan and Sinnathamby Mahesan. 2020. Word embedding-based part of speech tagging in Tamil texts. In Proceedings of the IEEE 15th International Conference on Industrial and Information Systems (ICIIS’20). IEEE, 478–482.
[54]
Ketan Kumar Todi, Pruthwik Mishra, and Dipti Misra Sharma. 2018. Building a Kannada POS tagger using machine learning and neural network models. arXiv preprint arXiv:1808.03175 (2018).
[55]
Vladimir Vapnik, Isabel Guyon, and Trevor Hastie. 1995. Support vector machines. Mach. Learn 20, 3 (1995), 273–297.
[56]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).
[57]
Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. 2013. Morfessor 2.0: Python implementation and extensions for morfessor baseline. Aalto University publication series SCIENCE + TECHNOLOGY, 25/2013. Aalto University, Helsinki, (2013).
[58]
Joseph Worsham and Jugal Kalita. 2020. Multi-task learning for natural language processing in the 2020s: Where are we going? Pattern Recog. Lett. 136 (2020), 120–126.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 9
September 2024
186 pages
EISSN:2375-4702
DOI:10.1145/3613646
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 August 2024
Online AM: 11 May 2024
Accepted: 30 April 2024
Revised: 24 December 2023
Received: 30 May 2023
Published in TALLIP Volume 23, Issue 9

Check for updates

Author Tags

  1. Indic languages
  2. shallow parsing
  3. pos tagging
  4. chunking
  5. morph analysis
  6. natural language processing (NLP)

Qualifiers

  • Research-article

Funding Sources

  • Ministry of Electronics and Information Technology (MeitY), Government of India

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 199
    Total Downloads
  • Downloads (Last 12 months)199
  • Downloads (Last 6 weeks)19
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media