Skip to main content
Log in

Automatic language identification: a case study of Pahari languages

  • Special Focus: Applications of established methods to new language
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In an attempt to expand the inclusiveness of Natural Language Processing, this paper focuses on developing resources and building machine learning models to identify four languages of the Northern Indo-Aryan family, also known as Pahari languages—Nepali, Garhwali, Kumaoni, and Dogri. This is the first attempt towards building identification models for Pahari languages and developing a plain text corpus for Garhwali and Kumaoni, both of which are lesser-known and under-resourced languages/mother tongues of India. The collected corpus, including data in Nepali and Dogri, is statistically analyzed at the word level. We also trained traditional machine learning models for Pahari language identification on this corpus and found that character n-grams based Linear Support Vector Machines performed best with 99.28% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Due to non-availability of resources for other Western Pahari languages, we only consider Dogri language from this group.

  2. Languages-of-India.

  3. garhwali-to-be-compulsory-in-pauri-schools.

  4. cm-releases-kumaoni-books-for-school-students.

  5. delhi-govt-sets-up-academy-to-promote-garhwali-kumaoni-jaunsari-languages-culture.

  6. Text within [ ] is the ITRANS romanized form of the preceding text in Devanagari.

  7. Other popular transliteration schemes like WX IAST also ignores this combination.

References

  • Bharadwaja Kumar, G., Murthy, K. N., & Chaudhuri, B. (2007). Statistical analyses of Telugu text corpora. IJDL. International Journal of Dravidian linguistics, 36(2), 71–99.

    Google Scholar 

  • Cavnar, W. B. , & Trenkle, J. M. (1994). N-gram-based text categorization. In Proceedings of sdair-94, 3rd annual symposium on document analysis and information retrieval (Vol. 161175).

  • Chang, J. C. , & Lin, C.- C. (2014). Recurrent-neural-network for language detection on twitter code-switching corpus. arXiv:1412.4314.

  • Çöltekin, Ç. , Rama, T. , & Blaschke, V. (2018). Tübingen-oslo team at the VarDial 2018 evaluation campaign: An analysis of n-gram features in language variety identification. In Proceedings of the fifth workshop on nlp for similar languages, varieties and dialects (vardial 2018) (pp. 55–65).

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

    Article  Google Scholar 

  • Devlin, J. , Chang, M. , Lee, K. , & Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1423

  • Doane, D. P. (1976). Aesthetic frequency classifications. The American Statistician, 30(4), 181–183.

    Google Scholar 

  • Dubey, P. , et al. (2013). Machine translation system for Hindi-Dogri language pair. In 2013 international conference on machine intelligence and research advancement (pp. 422–425).

  • Dunning, T. (1994). Statistical identification of language. Computing Research Laboratory: New Mexico State University Las Cruces, NM, USA.

    Google Scholar 

  • Duvenhage, B. (2019). Short text language identification for under resourced languages. arXiv:1911.07555.

  • Elfardy, H. , & Diab, M. (2013). Sentence level dialect identification in Arabic. In Proceedings of the 51st annual meeting of the association for computational linguistics (Vol. 2: Short papers) (pp. 456–461).

  • Grierson, G. A. (1916). Linguistic survey of India. Vol. 9: Indo-Aryan family: Central group; Part IV: Specimens of the Pahārī languages and Gujurī. Calcutta: Govt. of India, Central Publication Branch.

  • Gupta, C. P. , & Bal, B. K. (2015). Detecting sentiment in nepali texts: A bootstrap approach for sentiment analysis of texts in the nepali language. In 2015 international conference on cognitive computing and information processing (ccip) (pp. 1–4).

  • Harrat, S. , Meftouh, K. , Abbas, M. , Jamoussi, S. , Saad, M. , & Smaili, K. (2015). Cross-dialectal arabic processing. In International conference on intelligent text processing and computational linguistics (pp. 620–632).

  • Indhuja, K. , Indu, M. , Sreejith, C. , Sreekrishnapuram, P. , & Raj, P. R. (2014). Text based language identification system for indian languages following devanagiri script. International Journal of Engineering 3(4).

  • Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., & Lindén, K. (2019). Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, 65, 675–782.

    Article  Google Scholar 

  • Jauhiainen, T. S. , Jauhiainen, H. A. , Linden, B. K. J. , et al.(2018). Iterative language model adaptation for Indo-Aryan language identification. In Proceedings of the fifth workshop on nlp for similar languages, varieties and dialects (vardial 2018).

  • Jha, G. N. (2012). The tdil program and the indian language corpora initiative. In Language resources and evaluation conference.

  • Joshi, M. (2010). On the origin of the Neo Indo-Aryan Pahādī Language of Uttarakhand and Western Nepal Himalaya. Lingua Posnaniensis, 52(2), 51–65.

    Article  Google Scholar 

  • Khanuja, S. , Bansal, D. , Mehtani, S. , Khosla, S. , Dey, A. , Gopalan, B. , et al. (2021). Muril: Multilingual representations for indian languages.

  • Khubchandani, L. M. (1991). India as a sociolinguistic area. Language Sciences, 13(2), 265–288. https://doi.org/10.1016/0388-0001(91)90018-V

    Article  Google Scholar 

  • Koehn, P. , Guzmán, F. , Chaudhary, V. , & Pino, J. (2019, August). Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions. In Proceedings of the fourth conference on machine translation (Vol. 3: Shared task papers, day 2) (pp. 54–72). Florence, ItalyAssociation for Computational Linguistics. https://doi.org/10.18653/v1/W19-5404

  • Kumar, R. , Lahiri, B. , Alok, D. , Ojha, A. K. , Jain, M. , Basit, A. , & Dawer, Y. (2018). Automatic identification of closely-related indian languages: Resources and experiments. In Proceedings of the eleventh international conference on language resources and evaluation (lrec).

  • Lamsal, R. (2020). A large scale nepali text corpus. IEEE Dataport. https://doi.org/10.21227/jxrd-d245

  • Ljubesic, N. , Mikelic, N. , & Boras, D. (2007). Language identification: How to distinguish similar languages? In 2007 29th international conference on information technology interfaces (pp. 541–546).

  • Maharjan, S. , Blair, E. , Bethard, S. , & Solorio, T. (2015). Developing language-tagged corpora for code-switching tweets. In Proceedings of the 9th linguistic annotation workshop (pp. 72–84).

  • Mallikarjun, B. (2019). Metamorphosis of ‘ Hindi’ in Modern India–A study of Census of India. Language in India 19(8).

  • Malmasi, S. , & Dras, M. (2015). Automatic language identification for Persian and Dari texts. In Proceedings of pacling (pp. 59–64).

  • Malmasi, S. , Zampieri, M. , Ljubešić, N. , Nakov, P. , Ali, A. , & Tiedemann, J. (2016). Discriminating between similar languages and arabic dialect identification: A report on the third dsl shared task. In Proceedings of the third workshop on nlp for similar languages, varieties and dialects (vardial3) (pp. 1–14).

  • Martins, B. , & Silva, M. J. (2005). Language identification in web pages. In Proceedings of the 2005 acm symposium on applied computing (pp. 764–768).

  • Masica, C. P. (1993). The Indo-Aryan Languages. Cambridge University Press.

  • Mathur, P. , Misra, A. , & Budur, E. (2017). Lide: Language identification from text documents. arXiv:1701.03682.

  • McCallum, A. , Nigam, K. , et al.(1998). A comparison of event models for Naive Bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, pp. 41–48).

  • Mundotiya, R. K. , Singh, M. K. , Kapur, R. , Mishra, S. , & Singh, A. K. (2021, sep). Linguistic resources for bhojpuri, magahi, and maithili: Statistics about them, their similarity estimates, and baselines for three applications. ACM Transcation on Asian Low-Resource Language Information Processinghttps://doi.org/10.1145/3458250

  • Murthy, K. N., & Kumar, G. B. (2006). Language identification from small text samples. Journal of Quantitative Linguistics, 13(01), 57–80.

    Article  Google Scholar 

  • Mustonen, S. (1965). Multiple discriminant analysis in linguistic problems. Statistical Methods in Linguistics 437–44.

  • Nakkeerar, R. (2011). Nepali in Sikkim. Linguistic Survey of India-Sikkim, Part II23–120.

  • Padró, M. , & Padró, L. (2004). Comparing methods for language identification. Procesamiento del lenguaje natural 33.

  • Paul, A. , Purkayastha, B. S. , & Sarkar, S. (2015). Hidden markov model based part of speech tagging for nepali language. In 2015 international symposium on advanced computing and communication (isacc) (pp. 149–156).

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine Learning in Python Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    Google Scholar 

  • Piryani, R., Piryani, B., Singh, V. K., & Pinto, D. (2020). Sentiment analysis in nepali: Exploring machine learning and lexicon-based approaches. Journal of Intelligent & Fuzzy Systems, 39(2), 2201–2212.

    Article  Google Scholar 

  • Ranaivo-Malançon, B. (2006). Automatic identification of close languages-case study: Malay and indonesian. ECTI Transactions on Computer and Information Technology (ECTI-CIT) 2(2):126–134.

  • Riyal, M. K. , Upadhyay, R. K. , & Kumar, S. (2021). Entropic analysis of Garhwali text. In Recent developments in acoustics (pp. 43–49). Springer.

  • Sarkar, S. , Roy, A. , & Purkayastha, B. (2014). A comparative analysis of particle swarm optimization and k-means algorithm for text clustering using nepali wordnet. International Journal on Natural Language Computing (IJNLC) 3(3).

  • Scannell, K. P. (2007). The crúbadán project: Corpus building for under-resourced languages. In Building and exploring web corpora: Proceedings of the 3rd web as corpus workshop (Vol 4, pp. 5–15).

  • Schütze, H. , Manning, C. D. , & Raghavan, P. 2008. Introduction to information retrieval (Vol 39). Cambridge University Press.

  • Shahi, T. B., Dhamala, T. N., & Balami, B. (2013). Support vector machines based part of speech tagging for nepali text. International Journal of Computer Applications, 70, 24.

    Google Scholar 

  • Shahi, T. B. , & Pant, A. K. (2018). Nepali news classification using naïve bayes, support vector machines and neural networks. In 2018 international conference on communication information and computing technology (iccict) (pp. 1–5).

  • Shahi, T. B. , & Sitaula, C. (2021). Natural language processing for nepali text: a review. Artificial Intelligence Review 1–29.

  • Singh, A., Kour, A., & Jamwal, S. S. (2016). English to Dogri translation system using MOSES. Circulation in Computer Science, 1(1), 45–49.

    Article  Google Scholar 

  • Singh, O. M. , Padia, A. , & Joshi, A. (2019). Named entity recognition for nepali language. In 2019 IEEE 5th international conference on collaboration and internet computing (cic) (pp. 184–190).

  • Sitaula, C. (2012). Semantic text clustering using enhanced vector space model using nepali language. Computer Sciences and Telecommunications, 4, 41–46.

    Google Scholar 

  • Sitaula, C., Basnet, A., & Aryal, S. (2021). Vector representation based on a supervised codebook for nepali documents classification. PeerJ Computer Science, 7, e412.

    Article  Google Scholar 

  • Stroński, K. (2014). On the syntax and semantics of the past perfect participle and gerundive in early New Indo Arian Evidence from Eastern Pahari. Folia Linguistica, 35, 275–306.

    Article  Google Scholar 

  • Tan, L. , Zampieri, M. , Ljubešic, N. , & Tiedemann, J. (2014). Merging comparable data sources for the discrimination of similar languages: The dsl corpus collection. In Proceedings of the 7th workshop on building and using comparable corpora (bucc) (pp. 11–15). Reykjavik, Iceland.

  • Thapa, L. B. R. , & Bal, B. K. (2016). Classifying sentiments in nepali subjective texts. In 2016 7th international conference on information, intelligence, systems & applications (iisa) (pp. 1–6).

  • Tiedemann, J. , & Ljubešić, N. (2012). Efficient discrimination between closely related languages. In Proceedings of coling 2012 (pp. 2619–2634).

  • Xue, L. , Constant, N. , Roberts, A. , Kale, M. , Al-Rfou, R. , Siddhant, A. , et al. (2021). mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 483–498).

  • Zaidan, O. F., & Callison-Burch, C. (2014). Arabic dialect identification. Computational Linguistics, 40(1), 171–202.

    Article  Google Scholar 

  • Zampieri, M. , & Gebre, B. G. (2012). Automatic identification of language varieties: The case of Portuguese. In Konvens 2012-the 11th conference on natural language processing (pp. 233–237).

  • Zampieri, M. , Gebre, B. G. , & Diwersy, S. (2013). N-gram language models and POS distribution for the identification of spanish varieties (ngrammes et traits morphosyntaxiques pour la identification de variétés de l’espagnol)[in french]. In Proceedings of taln 2013 (volume 2: Short papers) (pp. 580–587).

  • Zampieri, M. , Malmasi, S. , Nakov, P. , Ali, A. , Shon, S. , Glass, J. , et al.(2018). Language identification and morphosyntactic tagging. the second VarDial evaluation campaign. In Proceedings of the fifth workshop on nlp for similar languages, varieties and dialects (vardial) (pp. 1–17).

Download references

Acknowledgements

The authors thank the anonymous reviewers for their comments and criticisms.

Funding

No funds, grants, or other support was received for conducting this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rachana Gusain.

Ethics declarations

Conflicts of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Corpus and source code available at https://github.com/rachana2010/PahariLI.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gusain, R., Dash, S.R., Parida, S. et al. Automatic language identification: a case study of Pahari languages. Lang Resources & Evaluation 57, 1361–1387 (2023). https://doi.org/10.1007/s10579-023-09651-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-023-09651-6

Keywords

Navigation