Bridging the Gap: Towards Linguistic Resource Development for the Low-Resource Lambani Language

Dasare, Ashwini; Chowdhury, Amartya Roy; Menon, Aditya Srinivas; Anand, Konjengbam; Deepak, K. T.; Prasanna, S. R. M.

doi:10.1007/978-3-031-48312-7_10

Ashwini Dasare¹⁴,
Amartya Roy Chowdhury¹³,
Aditya Srinivas Menon¹⁵,
Konjengbam Anand¹³,
K. T. Deepak¹⁴ &
…
S. R. M. Prasanna¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14339))

Included in the following conference series:

International Conference on Speech and Computer

Abstract

Language technology development is crucial for many downstream applications such as machine translation and language understanding. The lack of linguistic resources makes it challenging for technology development of under-resource languages. This paper aims at developing linguistic tools for Lambamni, an under-resourced tribal language of India through corpora creation, annotation, and transfer learning from contact language. Based on the annotated corpora, we develop the Lambani language tagset and our investigation focused on various methods for developing a Part-of-Speech (POS) tagger and also creating a morphology dictionary for Lambani. A total of eight BIS tagset is found to be present for Lambani language. The experimental results revealed that the statistical approach with GMM-HMM (Gaussian Mixture Model - Hidden Markov Model) achieved POS tagging accuracy of 96% despite the limited dataset containing 6,893 sentences. This success in a low-resource setting highlights the promising potential of GMM-HMM in overcoming challenges posed by the scarcity of annotated data in under-resourced languages. The experiments not only showcase the effectiveness of the proposed methods for low-resource language processing but also shed light on their applications and open new directions for research in language revitalization and the development of digital tools for zero-resource languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

National Council of Educational Research and Training. https://ncert.nic.in/textbook.php
Aggarwal, N., Randhawa, A.K.: A survey on parts of speech tagging for Indian languages. In: IJCA Proceedings on International Conference on Advancements in Engineering and Technology, ICAET 2015, vol. 3, pp. 29–31 (2015)
Google Scholar
Anand Kumar, M., Dhanalakshmi, V., Soman, K., Rajendran, S.: A sequence labeling approach to morphological analyzer for Tamil language. Int. J. Comput. Sci. Eng. 2(06), 1944–1951 (2010)
Google Scholar
Antony, P., Kumar, M.A., Soman, K.: Paradigm based morphological analyzer for Kannada language using machine learning approach. Int. J. Adv. Comput. Sci. Technol. (2010). ISSN 0973–6107
Google Scholar
Antony, P., Soman, K.: Parts of speech tagging for Indian languages: a literature survey. Int. J. Comput. Appl. 34(8), 0975–8887 (2011)
Google Scholar
Boopathy, S.: Languages of Tamil Nadu: Lambadi, an Indo-Aryan dialect. Census of India 1961, Tamil Nadu ix, part XII (1972)
Google Scholar
Burman, J.R.: Ethnography of a Denotified Tribe: The Laman Banjara. Mittal Publications (2010)
Google Scholar
Chandramouli, C., General, R.: Census of India 2011. Provisional Population Totals. Government of India, New Delhi, pp. 409–413 (2011)
Google Scholar
Chowdhury, A., Deepak, K.T., Prasanna, S. M.: Machine translation for a very low-resource language - layer freezing approach on transfer learning. In: Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022), pp. 48–55. Association for Computational Linguistics, Gyeongju (2022)
Google Scholar
Dasare, A., Deepak, K.T., Prasanna, M., Samudra Vijaya, K.: Text to speech system for lambani - a zero resource, tribal language of India. In: 2022 25th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pp. 1–6 (2022)
Google Scholar
Dhumal Deshmukh, R., Kiwelekar, A.: Deep learning techniques for part of speech tagging by natural language processing. In: 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), pp. 76–81 (2020)
Google Scholar
Dixit, V., Dethe, S., Joshi, R.K.: Design and implementation of a morphology-based spellchecker for Marathi, and Indian language. Arch. Control Sci. 15(3), 301 (2005)
MATH Google Scholar
Ekbal, A., Bandyopadhyay, S.: Part of speech tagging in Bengali using support vector machine. In: 2008 International Conference on Information Technology, pp. 106–111 (2008)
Google Scholar
Francis, M.: A comprehensive survey on parts of speech tagging approaches in Dravidian languages. In: The IIER International Conference, Beijing, China, 26 July 2015 (2015)
Google Scholar
Gadde, P., Yeleti, M.V.: Improving statistical POS tagging using linguistic feature for Hindi and Telugu. In: ICON, pp. 1–8 (2008)
Google Scholar
Gessler, L., Zeldes, A.: MicroBERT: effective training of low-resource monolingual BERTs through parameter reduction and multitask learning. In: Proceedings of the The 2nd Workshop on Multi-lingual Representation Learning (MRL), pp. 86–99. Association for Computational Linguistics, Abu Dhabi (Hybrid) (2022)
Google Scholar
Hymes, D.: Morris swadesh. Word 26(1), 119–138 (1970)
Article Google Scholar
of Indian Standard, B.: Linguistic resources - pos tag set for Indian languages - guidelines for designing tagsets and specification. https://tdil-dc.in/tdildcMain/articles/134692DraftPOSTagstandard.pdf
Kumar, D., Singh, M., Shukla, S.: FST based morphological analyzer for Hindi language. Int. J. Comput. Sci. 9 (2012)
Google Scholar
Metry, K.: tribal languages in 8th schedule. AGPE Royal Gondwana Res. J. Hist. Sci. Econ. Polit. Social Sci. 2(1), 19–30 (2020)
Google Scholar
Naik, C., Naik, D.P.: Banjara stastical report Karnatka state, India (2012)
Google Scholar
Prathibha, R., Padma, M.: Development of morpholoical analyzer for kannada verbs. In: IET, pp. 22–27 (2013)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Sarkar, K., Gayen, V.: A trigram HMM-based POS tagger for Indian languages. In: Satapathy, S.C., Udgata, S.K., Biswal, B.N. (eds.) Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA). AISC, vol. 199, pp. 205–212. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35314-7_24
Chapter Google Scholar
Smit, P., Virpioja, S., Grönroos, S.A., Kurimo, M.: Morfessor 2.0: toolkit for statistical morphological segmentation. In: Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 21–24. Association for Computational Linguistics, Gothenburg (2014)
Google Scholar
Srivastava, P., Chauhan, K., Aggarwal, D., Shukla, A., Dhar, J., Jain, V.P.: Deep learning based unsupervised POS tagging for Sanskrit. In: Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence, ACAI 2018. Association for Computing Machinery, New York (2018)
Google Scholar
Sunitha, K.N., Kalyani, N.: A novel approach to improve rule based Telugu morphological analyzer. In: 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC), pp. 1649–1652 (2009)
Google Scholar
Trail, R.L.: The grammar of Lamani. SIL of the University of Oklahoma (1970)
Google Scholar
Yu, X., Vu, N.T., Kuhn, J.: Ensemble self-training for low-resource languages: grapheme-to-phoneme conversion and morphological inflection. In: Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 70–78 (2020)
Google Scholar

Download references

Acknowledgement

We would like to thank Prashant Bannulmath, Sunita Rathod, Rajeshwari Naik and Sunil Rathod for helping us in developing the Lambani POS corpus. The authors would also like to thank “Anatganak", high-performance computation (HPC) facility, IIT Dharwad, for enabling us to perform our experiments, and Ministry of Electronics and Information Technology (MeitY), Govt. of India, for supporting us through the “Speech to Speech translation for tribal languages" project.

Author information

Authors and Affiliations

Indian Institute of Technology Dharwad, Dharwad, India
Amartya Roy Chowdhury, Konjengbam Anand & S. R. M. Prasanna
Indian Institute of Information Technology Dharwad, Dharwad, India
Ashwini Dasare & K. T. Deepak
Indian Institute of Information Technology Kottayam, Kottayam, India
Aditya Srinivas Menon

Authors

Ashwini Dasare
View author publications
You can also search for this author in PubMed Google Scholar
Amartya Roy Chowdhury
View author publications
You can also search for this author in PubMed Google Scholar
Aditya Srinivas Menon
View author publications
You can also search for this author in PubMed Google Scholar
Konjengbam Anand
View author publications
You can also search for this author in PubMed Google Scholar
K. T. Deepak
View author publications
You can also search for this author in PubMed Google Scholar
S. R. M. Prasanna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ashwini Dasare , Amartya Roy Chowdhury , Aditya Srinivas Menon , K. T. Deepak or S. R. M. Prasanna .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
Indian Institute of Information Technology Dharwad, Dharwad, India
K. T. Deepak
Indian Institute of Technology Dharwad, Dharwad, India
Rajesh M. Hegde
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal
Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dasare, A., Chowdhury, A.R., Menon, A.S., Anand, K., Deepak, K.T., Prasanna, S.R.M. (2023). Bridging the Gap: Towards Linguistic Resource Development for the Low-Resource Lambani Language. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14339. Springer, Cham. https://doi.org/10.1007/978-3-031-48312-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-48312-7_10
Published: 22 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48311-0
Online ISBN: 978-3-031-48312-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Bridging the Gap: Towards Linguistic Resource Development for the Low-Resource Lambani Language