Abstract
In an attempt to expand the inclusiveness of Natural Language Processing, this paper focuses on developing resources and building machine learning models to identify four languages of the Northern Indo-Aryan family, also known as Pahari languages—Nepali, Garhwali, Kumaoni, and Dogri. This is the first attempt towards building identification models for Pahari languages and developing a plain text corpus for Garhwali and Kumaoni, both of which are lesser-known and under-resourced languages/mother tongues of India. The collected corpus, including data in Nepali and Dogri, is statistically analyzed at the word level. We also trained traditional machine learning models for Pahari language identification on this corpus and found that character n-grams based Linear Support Vector Machines performed best with 99.28% accuracy.
Similar content being viewed by others
Notes
Due to non-availability of resources for other Western Pahari languages, we only consider Dogri language from this group.
Languages-of-India.
garhwali-to-be-compulsory-in-pauri-schools.
cm-releases-kumaoni-books-for-school-students.
delhi-govt-sets-up-academy-to-promote-garhwali-kumaoni-jaunsari-languages-culture.
Text within [ ] is the ITRANS romanized form of the preceding text in Devanagari.
Other popular transliteration schemes like WX IAST also ignores this combination.
References
Bharadwaja Kumar, G., Murthy, K. N., & Chaudhuri, B. (2007). Statistical analyses of Telugu text corpora. IJDL. International Journal of Dravidian linguistics, 36(2), 71–99.
Cavnar, W. B. , & Trenkle, J. M. (1994). N-gram-based text categorization. In Proceedings of sdair-94, 3rd annual symposium on document analysis and information retrieval (Vol. 161175).
Chang, J. C. , & Lin, C.- C. (2014). Recurrent-neural-network for language detection on twitter code-switching corpus. arXiv:1412.4314.
Çöltekin, Ç. , Rama, T. , & Blaschke, V. (2018). Tübingen-oslo team at the VarDial 2018 evaluation campaign: An analysis of n-gram features in language variety identification. In Proceedings of the fifth workshop on nlp for similar languages, varieties and dialects (vardial 2018) (pp. 55–65).
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
Devlin, J. , Chang, M. , Lee, K. , & Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1423
Doane, D. P. (1976). Aesthetic frequency classifications. The American Statistician, 30(4), 181–183.
Dubey, P. , et al. (2013). Machine translation system for Hindi-Dogri language pair. In 2013 international conference on machine intelligence and research advancement (pp. 422–425).
Dunning, T. (1994). Statistical identification of language. Computing Research Laboratory: New Mexico State University Las Cruces, NM, USA.
Duvenhage, B. (2019). Short text language identification for under resourced languages. arXiv:1911.07555.
Elfardy, H. , & Diab, M. (2013). Sentence level dialect identification in Arabic. In Proceedings of the 51st annual meeting of the association for computational linguistics (Vol. 2: Short papers) (pp. 456–461).
Grierson, G. A. (1916). Linguistic survey of India. Vol. 9: Indo-Aryan family: Central group; Part IV: Specimens of the Pahārī languages and Gujurī. Calcutta: Govt. of India, Central Publication Branch.
Gupta, C. P. , & Bal, B. K. (2015). Detecting sentiment in nepali texts: A bootstrap approach for sentiment analysis of texts in the nepali language. In 2015 international conference on cognitive computing and information processing (ccip) (pp. 1–4).
Harrat, S. , Meftouh, K. , Abbas, M. , Jamoussi, S. , Saad, M. , & Smaili, K. (2015). Cross-dialectal arabic processing. In International conference on intelligent text processing and computational linguistics (pp. 620–632).
Indhuja, K. , Indu, M. , Sreejith, C. , Sreekrishnapuram, P. , & Raj, P. R. (2014). Text based language identification system for indian languages following devanagiri script. International Journal of Engineering 3(4).
Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., & Lindén, K. (2019). Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, 65, 675–782.
Jauhiainen, T. S. , Jauhiainen, H. A. , Linden, B. K. J. , et al.(2018). Iterative language model adaptation for Indo-Aryan language identification. In Proceedings of the fifth workshop on nlp for similar languages, varieties and dialects (vardial 2018).
Jha, G. N. (2012). The tdil program and the indian language corpora initiative. In Language resources and evaluation conference.
Joshi, M. (2010). On the origin of the Neo Indo-Aryan Pahādī Language of Uttarakhand and Western Nepal Himalaya. Lingua Posnaniensis, 52(2), 51–65.
Khanuja, S. , Bansal, D. , Mehtani, S. , Khosla, S. , Dey, A. , Gopalan, B. , et al. (2021). Muril: Multilingual representations for indian languages.
Khubchandani, L. M. (1991). India as a sociolinguistic area. Language Sciences, 13(2), 265–288. https://doi.org/10.1016/0388-0001(91)90018-V
Koehn, P. , Guzmán, F. , Chaudhary, V. , & Pino, J. (2019, August). Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions. In Proceedings of the fourth conference on machine translation (Vol. 3: Shared task papers, day 2) (pp. 54–72). Florence, ItalyAssociation for Computational Linguistics. https://doi.org/10.18653/v1/W19-5404
Kumar, R. , Lahiri, B. , Alok, D. , Ojha, A. K. , Jain, M. , Basit, A. , & Dawer, Y. (2018). Automatic identification of closely-related indian languages: Resources and experiments. In Proceedings of the eleventh international conference on language resources and evaluation (lrec).
Lamsal, R. (2020). A large scale nepali text corpus. IEEE Dataport. https://doi.org/10.21227/jxrd-d245
Ljubesic, N. , Mikelic, N. , & Boras, D. (2007). Language identification: How to distinguish similar languages? In 2007 29th international conference on information technology interfaces (pp. 541–546).
Maharjan, S. , Blair, E. , Bethard, S. , & Solorio, T. (2015). Developing language-tagged corpora for code-switching tweets. In Proceedings of the 9th linguistic annotation workshop (pp. 72–84).
Mallikarjun, B. (2019). Metamorphosis of ‘ Hindi’ in Modern India–A study of Census of India. Language in India 19(8).
Malmasi, S. , & Dras, M. (2015). Automatic language identification for Persian and Dari texts. In Proceedings of pacling (pp. 59–64).
Malmasi, S. , Zampieri, M. , Ljubešić, N. , Nakov, P. , Ali, A. , & Tiedemann, J. (2016). Discriminating between similar languages and arabic dialect identification: A report on the third dsl shared task. In Proceedings of the third workshop on nlp for similar languages, varieties and dialects (vardial3) (pp. 1–14).
Martins, B. , & Silva, M. J. (2005). Language identification in web pages. In Proceedings of the 2005 acm symposium on applied computing (pp. 764–768).
Masica, C. P. (1993). The Indo-Aryan Languages. Cambridge University Press.
Mathur, P. , Misra, A. , & Budur, E. (2017). Lide: Language identification from text documents. arXiv:1701.03682.
McCallum, A. , Nigam, K. , et al.(1998). A comparison of event models for Naive Bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, pp. 41–48).
Mundotiya, R. K. , Singh, M. K. , Kapur, R. , Mishra, S. , & Singh, A. K. (2021, sep). Linguistic resources for bhojpuri, magahi, and maithili: Statistics about them, their similarity estimates, and baselines for three applications. ACM Transcation on Asian Low-Resource Language Information Processinghttps://doi.org/10.1145/3458250
Murthy, K. N., & Kumar, G. B. (2006). Language identification from small text samples. Journal of Quantitative Linguistics, 13(01), 57–80.
Mustonen, S. (1965). Multiple discriminant analysis in linguistic problems. Statistical Methods in Linguistics 437–44.
Nakkeerar, R. (2011). Nepali in Sikkim. Linguistic Survey of India-Sikkim, Part II23–120.
Padró, M. , & Padró, L. (2004). Comparing methods for language identification. Procesamiento del lenguaje natural 33.
Paul, A. , Purkayastha, B. S. , & Sarkar, S. (2015). Hidden markov model based part of speech tagging for nepali language. In 2015 international symposium on advanced computing and communication (isacc) (pp. 149–156).
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine Learning in Python Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Piryani, R., Piryani, B., Singh, V. K., & Pinto, D. (2020). Sentiment analysis in nepali: Exploring machine learning and lexicon-based approaches. Journal of Intelligent & Fuzzy Systems, 39(2), 2201–2212.
Ranaivo-Malançon, B. (2006). Automatic identification of close languages-case study: Malay and indonesian. ECTI Transactions on Computer and Information Technology (ECTI-CIT) 2(2):126–134.
Riyal, M. K. , Upadhyay, R. K. , & Kumar, S. (2021). Entropic analysis of Garhwali text. In Recent developments in acoustics (pp. 43–49). Springer.
Sarkar, S. , Roy, A. , & Purkayastha, B. (2014). A comparative analysis of particle swarm optimization and k-means algorithm for text clustering using nepali wordnet. International Journal on Natural Language Computing (IJNLC) 3(3).
Scannell, K. P. (2007). The crúbadán project: Corpus building for under-resourced languages. In Building and exploring web corpora: Proceedings of the 3rd web as corpus workshop (Vol 4, pp. 5–15).
Schütze, H. , Manning, C. D. , & Raghavan, P. 2008. Introduction to information retrieval (Vol 39). Cambridge University Press.
Shahi, T. B., Dhamala, T. N., & Balami, B. (2013). Support vector machines based part of speech tagging for nepali text. International Journal of Computer Applications, 70, 24.
Shahi, T. B. , & Pant, A. K. (2018). Nepali news classification using naïve bayes, support vector machines and neural networks. In 2018 international conference on communication information and computing technology (iccict) (pp. 1–5).
Shahi, T. B. , & Sitaula, C. (2021). Natural language processing for nepali text: a review. Artificial Intelligence Review 1–29.
Singh, A., Kour, A., & Jamwal, S. S. (2016). English to Dogri translation system using MOSES. Circulation in Computer Science, 1(1), 45–49.
Singh, O. M. , Padia, A. , & Joshi, A. (2019). Named entity recognition for nepali language. In 2019 IEEE 5th international conference on collaboration and internet computing (cic) (pp. 184–190).
Sitaula, C. (2012). Semantic text clustering using enhanced vector space model using nepali language. Computer Sciences and Telecommunications, 4, 41–46.
Sitaula, C., Basnet, A., & Aryal, S. (2021). Vector representation based on a supervised codebook for nepali documents classification. PeerJ Computer Science, 7, e412.
Stroński, K. (2014). On the syntax and semantics of the past perfect participle and gerundive in early New Indo Arian Evidence from Eastern Pahari. Folia Linguistica, 35, 275–306.
Tan, L. , Zampieri, M. , Ljubešic, N. , & Tiedemann, J. (2014). Merging comparable data sources for the discrimination of similar languages: The dsl corpus collection. In Proceedings of the 7th workshop on building and using comparable corpora (bucc) (pp. 11–15). Reykjavik, Iceland.
Thapa, L. B. R. , & Bal, B. K. (2016). Classifying sentiments in nepali subjective texts. In 2016 7th international conference on information, intelligence, systems & applications (iisa) (pp. 1–6).
Tiedemann, J. , & Ljubešić, N. (2012). Efficient discrimination between closely related languages. In Proceedings of coling 2012 (pp. 2619–2634).
Xue, L. , Constant, N. , Roberts, A. , Kale, M. , Al-Rfou, R. , Siddhant, A. , et al. (2021). mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 483–498).
Zaidan, O. F., & Callison-Burch, C. (2014). Arabic dialect identification. Computational Linguistics, 40(1), 171–202.
Zampieri, M. , & Gebre, B. G. (2012). Automatic identification of language varieties: The case of Portuguese. In Konvens 2012-the 11th conference on natural language processing (pp. 233–237).
Zampieri, M. , Gebre, B. G. , & Diwersy, S. (2013). N-gram language models and POS distribution for the identification of spanish varieties (ngrammes et traits morphosyntaxiques pour la identification de variétés de l’espagnol)[in french]. In Proceedings of taln 2013 (volume 2: Short papers) (pp. 580–587).
Zampieri, M. , Malmasi, S. , Nakov, P. , Ali, A. , Shon, S. , Glass, J. , et al.(2018). Language identification and morphosyntactic tagging. the second VarDial evaluation campaign. In Proceedings of the fifth workshop on nlp for similar languages, varieties and dialects (vardial) (pp. 1–17).
Acknowledgements
The authors thank the anonymous reviewers for their comments and criticisms.
Funding
No funds, grants, or other support was received for conducting this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Corpus and source code available at https://github.com/rachana2010/PahariLI.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gusain, R., Dash, S.R., Parida, S. et al. Automatic language identification: a case study of Pahari languages. Lang Resources & Evaluation 57, 1361–1387 (2023). https://doi.org/10.1007/s10579-023-09651-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-023-09651-6