Automatic language identification: a case study of Pahari languages

Gusain, Rachana; Dash, Satya Ranjan; Parida, Shantipriya; Jha, Girish Nath

doi:10.1007/s10579-023-09651-6

Automatic language identification: a case study of Pahari languages

Special Focus: Applications of established methods to new language
Published: 12 May 2023

Volume 57, pages 1361–1387, (2023)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Rachana Gusain ORCID: orcid.org/0000-0002-0250-9682¹,
Satya Ranjan Dash²,
Shantipriya Parida³ &
…
Girish Nath Jha⁴

325 Accesses
2 Altmetric
Explore all metrics

Abstract

In an attempt to expand the inclusiveness of Natural Language Processing, this paper focuses on developing resources and building machine learning models to identify four languages of the Northern Indo-Aryan family, also known as Pahari languages—Nepali, Garhwali, Kumaoni, and Dogri. This is the first attempt towards building identification models for Pahari languages and developing a plain text corpus for Garhwali and Kumaoni, both of which are lesser-known and under-resourced languages/mother tongues of India. The collected corpus, including data in Nepali and Dogri, is statistically analyzed at the word level. We also trained traditional machine learning models for Pahari language identification on this corpus and found that character n-grams based Linear Support Vector Machines performed best with 99.28% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

A review of unsupervised feature selection methods

Article 29 January 2019

A review of semi-supervised learning for text classification

Article 31 January 2023

Survey on supervised machine learning techniques for automatic text classification

Article 19 January 2019

Notes

Due to non-availability of resources for other Western Pahari languages, we only consider Dogri language from this group.
Languages-of-India.
garhwali-to-be-compulsory-in-pauri-schools.
cm-releases-kumaoni-books-for-school-students.
delhi-govt-sets-up-academy-to-promote-garhwali-kumaoni-jaunsari-languages-culture.
Text within [ ] is the ITRANS romanized form of the preceding text in Devanagari.
Other popular transliteration schemes like WX IAST also ignores this combination.

References

Bharadwaja Kumar, G., Murthy, K. N., & Chaudhuri, B. (2007). Statistical analyses of Telugu text corpora. IJDL. International Journal of Dravidian linguistics, 36(2), 71–99.
Google Scholar
Cavnar, W. B. , & Trenkle, J. M. (1994). N-gram-based text categorization. In Proceedings of sdair-94, 3rd annual symposium on document analysis and information retrieval (Vol. 161175).
Chang, J. C. , & Lin, C.- C. (2014). Recurrent-neural-network for language detection on twitter code-switching corpus. arXiv:1412.4314.
Çöltekin, Ç. , Rama, T. , & Blaschke, V. (2018). Tübingen-oslo team at the VarDial 2018 evaluation campaign: An analysis of n-gram features in language variety identification. In Proceedings of the fifth workshop on nlp for similar languages, varieties and dialects (vardial 2018) (pp. 55–65).
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
Article Google Scholar
Devlin, J. , Chang, M. , Lee, K. , & Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1423
Doane, D. P. (1976). Aesthetic frequency classifications. The American Statistician, 30(4), 181–183.
Google Scholar
Dubey, P. , et al. (2013). Machine translation system for Hindi-Dogri language pair. In 2013 international conference on machine intelligence and research advancement (pp. 422–425).
Dunning, T. (1994). Statistical identification of language. Computing Research Laboratory: New Mexico State University Las Cruces, NM, USA.
Google Scholar
Duvenhage, B. (2019). Short text language identification for under resourced languages. arXiv:1911.07555.
Elfardy, H. , & Diab, M. (2013). Sentence level dialect identification in Arabic. In Proceedings of the 51st annual meeting of the association for computational linguistics (Vol. 2: Short papers) (pp. 456–461).
Grierson, G. A. (1916). Linguistic survey of India. Vol. 9: Indo-Aryan family: Central group; Part IV: Specimens of the Pahārī languages and Gujurī. Calcutta: Govt. of India, Central Publication Branch.
Gupta, C. P. , & Bal, B. K. (2015). Detecting sentiment in nepali texts: A bootstrap approach for sentiment analysis of texts in the nepali language. In 2015 international conference on cognitive computing and information processing (ccip) (pp. 1–4).
Harrat, S. , Meftouh, K. , Abbas, M. , Jamoussi, S. , Saad, M. , & Smaili, K. (2015). Cross-dialectal arabic processing. In International conference on intelligent text processing and computational linguistics (pp. 620–632).
Indhuja, K. , Indu, M. , Sreejith, C. , Sreekrishnapuram, P. , & Raj, P. R. (2014). Text based language identification system for indian languages following devanagiri script. International Journal of Engineering 3(4).
Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., & Lindén, K. (2019). Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, 65, 675–782.
Article Google Scholar
Jauhiainen, T. S. , Jauhiainen, H. A. , Linden, B. K. J. , et al.(2018). Iterative language model adaptation for Indo-Aryan language identification. In Proceedings of the fifth workshop on nlp for similar languages, varieties and dialects (vardial 2018).
Jha, G. N. (2012). The tdil program and the indian language corpora initiative. In Language resources and evaluation conference.
Joshi, M. (2010). On the origin of the Neo Indo-Aryan Pahādī Language of Uttarakhand and Western Nepal Himalaya. Lingua Posnaniensis, 52(2), 51–65.
Article Google Scholar
Khanuja, S. , Bansal, D. , Mehtani, S. , Khosla, S. , Dey, A. , Gopalan, B. , et al. (2021). Muril: Multilingual representations for indian languages.
Khubchandani, L. M. (1991). India as a sociolinguistic area. Language Sciences, 13(2), 265–288. https://doi.org/10.1016/0388-0001(91)90018-V
Article Google Scholar
Koehn, P. , Guzmán, F. , Chaudhary, V. , & Pino, J. (2019, August). Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions. In Proceedings of the fourth conference on machine translation (Vol. 3: Shared task papers, day 2) (pp. 54–72). Florence, ItalyAssociation for Computational Linguistics. https://doi.org/10.18653/v1/W19-5404
Kumar, R. , Lahiri, B. , Alok, D. , Ojha, A. K. , Jain, M. , Basit, A. , & Dawer, Y. (2018). Automatic identification of closely-related indian languages: Resources and experiments. In Proceedings of the eleventh international conference on language resources and evaluation (lrec).
Lamsal, R. (2020). A large scale nepali text corpus. IEEE Dataport. https://doi.org/10.21227/jxrd-d245
Ljubesic, N. , Mikelic, N. , & Boras, D. (2007). Language identification: How to distinguish similar languages? In 2007 29th international conference on information technology interfaces (pp. 541–546).
Maharjan, S. , Blair, E. , Bethard, S. , & Solorio, T. (2015). Developing language-tagged corpora for code-switching tweets. In Proceedings of the 9th linguistic annotation workshop (pp. 72–84).
Mallikarjun, B. (2019). Metamorphosis of ‘ Hindi’ in Modern India–A study of Census of India. Language in India 19(8).
Malmasi, S. , & Dras, M. (2015). Automatic language identification for Persian and Dari texts. In Proceedings of pacling (pp. 59–64).
Malmasi, S. , Zampieri, M. , Ljubešić, N. , Nakov, P. , Ali, A. , & Tiedemann, J. (2016). Discriminating between similar languages and arabic dialect identification: A report on the third dsl shared task. In Proceedings of the third workshop on nlp for similar languages, varieties and dialects (vardial3) (pp. 1–14).
Martins, B. , & Silva, M. J. (2005). Language identification in web pages. In Proceedings of the 2005 acm symposium on applied computing (pp. 764–768).
Masica, C. P. (1993). The Indo-Aryan Languages. Cambridge University Press.
Mathur, P. , Misra, A. , & Budur, E. (2017). Lide: Language identification from text documents. arXiv:1701.03682.
McCallum, A. , Nigam, K. , et al.(1998). A comparison of event models for Naive Bayes text classification. In AAAI-98 workshop on learning for text categorization (Vol. 752, pp. 41–48).
Mundotiya, R. K. , Singh, M. K. , Kapur, R. , Mishra, S. , & Singh, A. K. (2021, sep). Linguistic resources for bhojpuri, magahi, and maithili: Statistics about them, their similarity estimates, and baselines for three applications. ACM Transcation on Asian Low-Resource Language Information Processinghttps://doi.org/10.1145/3458250
Murthy, K. N., & Kumar, G. B. (2006). Language identification from small text samples. Journal of Quantitative Linguistics, 13(01), 57–80.
Article Google Scholar
Mustonen, S. (1965). Multiple discriminant analysis in linguistic problems. Statistical Methods in Linguistics 437–44.
Nakkeerar, R. (2011). Nepali in Sikkim. Linguistic Survey of India-Sikkim, Part II23–120.
Padró, M. , & Padró, L. (2004). Comparing methods for language identification. Procesamiento del lenguaje natural 33.
Paul, A. , Purkayastha, B. S. , & Sarkar, S. (2015). Hidden markov model based part of speech tagging for nepali language. In 2015 international symposium on advanced computing and communication (isacc) (pp. 149–156).
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine Learning in Python Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Google Scholar
Piryani, R., Piryani, B., Singh, V. K., & Pinto, D. (2020). Sentiment analysis in nepali: Exploring machine learning and lexicon-based approaches. Journal of Intelligent & Fuzzy Systems, 39(2), 2201–2212.
Article Google Scholar
Ranaivo-Malançon, B. (2006). Automatic identification of close languages-case study: Malay and indonesian. ECTI Transactions on Computer and Information Technology (ECTI-CIT) 2(2):126–134.
Riyal, M. K. , Upadhyay, R. K. , & Kumar, S. (2021). Entropic analysis of Garhwali text. In Recent developments in acoustics (pp. 43–49). Springer.
Sarkar, S. , Roy, A. , & Purkayastha, B. (2014). A comparative analysis of particle swarm optimization and k-means algorithm for text clustering using nepali wordnet. International Journal on Natural Language Computing (IJNLC) 3(3).
Scannell, K. P. (2007). The crúbadán project: Corpus building for under-resourced languages. In Building and exploring web corpora: Proceedings of the 3rd web as corpus workshop (Vol 4, pp. 5–15).
Schütze, H. , Manning, C. D. , & Raghavan, P. 2008. Introduction to information retrieval (Vol 39). Cambridge University Press.
Shahi, T. B., Dhamala, T. N., & Balami, B. (2013). Support vector machines based part of speech tagging for nepali text. International Journal of Computer Applications, 70, 24.
Google Scholar
Shahi, T. B. , & Pant, A. K. (2018). Nepali news classification using naïve bayes, support vector machines and neural networks. In 2018 international conference on communication information and computing technology (iccict) (pp. 1–5).
Shahi, T. B. , & Sitaula, C. (2021). Natural language processing for nepali text: a review. Artificial Intelligence Review 1–29.
Singh, A., Kour, A., & Jamwal, S. S. (2016). English to Dogri translation system using MOSES. Circulation in Computer Science, 1(1), 45–49.
Article Google Scholar
Singh, O. M. , Padia, A. , & Joshi, A. (2019). Named entity recognition for nepali language. In 2019 IEEE 5th international conference on collaboration and internet computing (cic) (pp. 184–190).
Sitaula, C. (2012). Semantic text clustering using enhanced vector space model using nepali language. Computer Sciences and Telecommunications, 4, 41–46.
Google Scholar
Sitaula, C., Basnet, A., & Aryal, S. (2021). Vector representation based on a supervised codebook for nepali documents classification. PeerJ Computer Science, 7, e412.
Article Google Scholar
Stroński, K. (2014). On the syntax and semantics of the past perfect participle and gerundive in early New Indo Arian Evidence from Eastern Pahari. Folia Linguistica, 35, 275–306.
Article Google Scholar
Tan, L. , Zampieri, M. , Ljubešic, N. , & Tiedemann, J. (2014). Merging comparable data sources for the discrimination of similar languages: The dsl corpus collection. In Proceedings of the 7th workshop on building and using comparable corpora (bucc) (pp. 11–15). Reykjavik, Iceland.
Thapa, L. B. R. , & Bal, B. K. (2016). Classifying sentiments in nepali subjective texts. In 2016 7th international conference on information, intelligence, systems & applications (iisa) (pp. 1–6).
Tiedemann, J. , & Ljubešić, N. (2012). Efficient discrimination between closely related languages. In Proceedings of coling 2012 (pp. 2619–2634).
Xue, L. , Constant, N. , Roberts, A. , Kale, M. , Al-Rfou, R. , Siddhant, A. , et al. (2021). mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 483–498).
Zaidan, O. F., & Callison-Burch, C. (2014). Arabic dialect identification. Computational Linguistics, 40(1), 171–202.
Article Google Scholar
Zampieri, M. , & Gebre, B. G. (2012). Automatic identification of language varieties: The case of Portuguese. In Konvens 2012-the 11th conference on natural language processing (pp. 233–237).
Zampieri, M. , Gebre, B. G. , & Diwersy, S. (2013). N-gram language models and POS distribution for the identification of spanish varieties (ngrammes et traits morphosyntaxiques pour la identification de variétés de l’espagnol)[in french]. In Proceedings of taln 2013 (volume 2: Short papers) (pp. 580–587).
Zampieri, M. , Malmasi, S. , Nakov, P. , Ali, A. , Shon, S. , Glass, J. , et al.(2018). Language identification and morphosyntactic tagging. the second VarDial evaluation campaign. In Proceedings of the fifth workshop on nlp for similar languages, varieties and dialects (vardial) (pp. 1–17).

Download references

Acknowledgements

The authors thank the anonymous reviewers for their comments and criticisms.

Funding

No funds, grants, or other support was received for conducting this study.

Author information

Authors and Affiliations

Doon University, Dehradun, Uttarakhand, India
Rachana Gusain
KIIT University, Bhubaneswar, Odisha, India
Satya Ranjan Dash
Silo AI, Helsinki, Finland
Shantipriya Parida
Jawaharlal Nehru University, New Delhi, India
Girish Nath Jha

Authors

Rachana Gusain
View author publications
You can also search for this author in PubMed Google Scholar
Satya Ranjan Dash
View author publications
You can also search for this author in PubMed Google Scholar
Shantipriya Parida
View author publications
You can also search for this author in PubMed Google Scholar
Girish Nath Jha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rachana Gusain.

Ethics declarations

Conflicts of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Corpus and source code available at https://github.com/rachana2010/PahariLI.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gusain, R., Dash, S.R., Parida, S. et al. Automatic language identification: a case study of Pahari languages. Lang Resources & Evaluation 57, 1361–1387 (2023). https://doi.org/10.1007/s10579-023-09651-6

Download citation

Accepted: 22 February 2023
Published: 12 May 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s10579-023-09651-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic language identification: a case study of Pahari languages

Abstract

Access this article

Similar content being viewed by others

A review of unsupervised feature selection methods

A review of semi-supervised learning for text classification

Survey on supervised machine learning techniques for automatic text classification

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic language identification: a case study of Pahari languages

Abstract

Access this article

Similar content being viewed by others

A review of unsupervised feature selection methods

A review of semi-supervised learning for text classification

Survey on supervised machine learning techniques for automatic text classification

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation