A Study on the Importance of Linguistic Suffixes in Maithili POS Tagger Development

Priyadarshi, Ankur; Saha, Sujan Kumar

doi:10.1007/978-3-030-66187-8_2

Ankur Priyadarshi¹² &
Sujan Kumar Saha¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11987))

Included in the following conference series:

International Conference on Mining Intelligence and Knowledge Exploration

216 Accesses
1 Citations

Abstract

This paper presents our study on the effect of morphological inflections in the performance of a Maithili Part of Speech (POS) tagger. In the last few years, substantial effort is devoted to developing morphological analyzers and POS taggers in several Indian languages including Hindi, Bengali, Tamil, Telugu, Kannada, Punjabi and Marathi. But we did not find any open POS tagger or morphological analyzers in Maithili. However, Maithili is one of the official languages of India with around 50 million native speakers. So, we worked on developing a POS tagger in Maithili. For the development, we used a manually annotated in-house Maithili corpus containing 52,190 tokens. The tagset contains 27 tags. We first trained conditional random fields (CRF) classifier with various combination of word unigram, bigram, fixed-length suffix, and prefix features. There we observed that the fixed-length suffixes do not show the expected accuracy improvement. However, during the manual corpus annotation, we observed that suffixes played as a helpful clue. So, instead of using the fixed-length suffixes, we worked on identifying the morphological inflections in Mathili. When we used these morphological suffixes in the system, we found a noticeable performance improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Arulmozhi, P., Sobha, L.: A hybrid POS tagger for a relatively free word order language. In: Proceedings of the First National Symposium on Modeling and Shallow Parsing of Indian Languages, pp. 79–85 (2006)
Google Scholar
Bharati, A., Chaitanya, V., Sangal, R., Ramakrishnamacharyulu, K.: Natural Language Processing: A Paninian Perspective. Prentice-Hall of India, New Delhi (1995)
Google Scholar
Dandapat, S.: Part-of-speech tagging for Bengali. Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur (2009)
Google Scholar
Dandapat, S., Sarkar, S., Basu, A.: Automatic part-of-speech tagging for Bengali: an approach for morphologically rich languages in a poor resource scenario. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 221–224. Association for Computational Linguistics (2007)
Google Scholar
Ekbal, A., Haque, R., Bandyopadhyay, S.: Bengali part of speech tagging using conditional random field. In: Proceedings of Seventh International Symposium on Natural Language Processing (SNLP 2007), pp. 131–136 (2007)
Google Scholar
Garg, N., Goyal, V., Preet, S.: Rule based Hindi part of speech tagger. In: Proceedings of COLING 2012: Demonstration Papers, pp. 163–174 (2012)
Google Scholar
Greene, B.B., Rubin, G.M.: Automatic grammatical tagging of English. Department of Linguistics, Brown University (1971)
Google Scholar
Harris, Z.S.: String analysis of sentence structure, no. 1, Mouton (1962)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)
Google Scholar
Modi, D., Nain, N.: Part-of-speech tagging of Hindi corpus using rule-based method. In: Afzalpulkar, N., Srivastava, V., Singh, G., Bhatnagar, D. (eds.) Proceedings of the International Conference on Recent Cognizance in Wireless Communication & Image Processing, pp. 241–247. Springer, New Delhi (2016). https://doi.org/10.1007/978-81-322-2638-3_28
Chapter Google Scholar
Priyadarshi, A., Saha, S.K.: Towards the first Maithili part of speech tagger: resource creation and system development. Comput. Speech Lang. 62, 101054 (2019)
Article Google Scholar
Ranjan, P., Basu, H.V.S.S.A.: Part of speech tagging and local word grouping techniques for natural language parsing in Hindi. In: Proceedings of the 1st International Conference on Natural Language Processing (ICON 2003). Citeseer (2003)
Google Scholar
Sharma, S.K., Lehal, G.S.: Using hidden Markov model to improve the accuracy of Punjabi POS tagger. In: 2011 IEEE International Conference on Computer Science and Automation Engineering, vol. 2, pp. 697–701. IEEE (2011)
Google Scholar
Shrivastava, M., Bhattacharyya, P.: Hindi POS tagger using Naive stemming: harnessing morphological information without extensive linguistic knowledge. In: International Conference on NLP (ICON 2008), Pune, India (2008)
Google Scholar
Singh, S., Gupta, K., Shrivastava, M., Bhattacharyya, P.: Morphological richness offsets resource demand-experiences in constructing a POS tagger for Hindi. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 779–786. Association for Computational Linguistics (2006)
Google Scholar

Download references

Funding

This work was supported by Science and Engineering Research Board, India [Grant No: EEQ/2016/000241].

Author information

Authors and Affiliations

Computer Science and Engineering, Birla Institute of Technology, Mesra, Ranchi, Jharkhand, India
Ankur Priyadarshi & Sujan Kumar Saha

Authors

Ankur Priyadarshi
View author publications
You can also search for this author in PubMed Google Scholar
Sujan Kumar Saha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sujan Kumar Saha .

Editor information

Editors and Affiliations

National Institute of Technology, Goa, India
Purushothama B. R.
National Institute of Technology, Goa, India
Veena Thenkanidiyoor
Indian Institute of Information Technology, Sri City, India
Rajendra Prasath
Indian Institute of Information Technology, Sri City, India
Odelu Vanga

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Priyadarshi, A., Saha, S.K. (2020). A Study on the Importance of Linguistic Suffixes in Maithili POS Tagger Development. In: B. R., P., Thenkanidiyoor, V., Prasath, R., Vanga, O. (eds) Mining Intelligence and Knowledge Exploration. MIKE 2019. Lecture Notes in Computer Science(), vol 11987. Springer, Cham. https://doi.org/10.1007/978-3-030-66187-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-66187-8_2
Published: 20 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66186-1
Online ISBN: 978-3-030-66187-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics