Abstract
Because of the proliferation of Nepali textual documents online, researchers in Nepal and overseas have started working towards its automated analysis for quick inferences, using different machine learning (ML) algorithms, ranging from traditional ML-based algorithms to recent deep learning (DL)-based algorithms. However, researchers are still unaware about the recent trends of NLP research direction in the Nepali language. In this paper, we survey different natural language processing (NLP) research works with associated resources in Nepali language. Furthermore, we organize the NLP approaches, techniques, and application tasks used in the Nepali language processing using the comprehensive taxonomy for each of them. Finally, we discuss and analyze based on such assimilated information for further improvement in NLP research works in the Nepali language. Our thorough survey bestows the detailed backgrounds and motivations to researchers, which not only opens up new potential avenues but also ushers towards further progress of NLP research works in the Nepali language.
Similar content being viewed by others
Notes
http://www.mpp.org.np, (accessed date: 02/07/2021).
www.ltk.org.np, (accessed date: 02/07/2021).
http://www.elra.info/en/catalogues/free-resources/nepali-corpora/ (accessed date: 17/02/2021).
https://data.ldcil.org/a-gold-standard-nepali-raw-text-corpus (accessed at 17/02/2021).
https://ieee-dataport.org/open-access/large-scale-nepali-text-corpus (accessed date: 16/02/2021).
https://github.com/sndsabin/Nepali-News-Classifier (accessed date: 17/01/2021), Information and Language Processing Research Lab, Kathmandu University, Nepal.
https://www.kaggle.com/ashokpant/nepali-news-dataset-large (accessed date :16/02/2021).
https://ieee-dataport.org/documents/nepaliliinguistic (accessed date: 16/02/2021).
http://xixona.dlsi.ua.es/~fran/apertium2-documentation.pdf (accessed date: 13/02/2021).
https://ekantipur.com/ (accessed date: 13/02/2021).
https://nagariknews.nagariknetwork.com/ (accessed date: 13/02/2021).
References
Acharya P, Bal BK (2018) A comparative study of SMT and NMT: case study of English-Nepali language pair. In: SLTU, pp 90–93
Acharya S, Pant AK, Gyawali PK (2015) Deep learning based large scale handwritten devanagari character recognition. In: 2015 9th International conference on software, knowledge, information management and applications (SKIMA). IEEE, pp 1–6
Adhikari A, Ghimire S (2019) Nepali image captioning. In: 2019 artificial intelligence for transforming business and society (AITB), IEEE 1:1–6
Bachchan RK, Timalsina AK (2018) Plagiarism detection framework using monte carlo based artificial neural network for Nepali language. 2018 IEEE 3rd international conference on computing. Communication and security (ICCCS). IEEE, pp 122–127
Bal BK (2009) Towards building advanced natural language applications–an overview of the existing primary resources and applications in Nepali. In: Proceedings of the 7th workshop on Asian language resources (ALR7), Association for Computational Linguistics, Suntec, Singapore, pp 165–170
Bal BK, Shrestha P (2004) A morphological analyzer and a stemmer for Nepali. PAN Localization, Working Papers 2007:324–331
Bal BK, Shrestha P, Pustakalaya MP, PatanDhoka N (2007) Architectural and system design of the Nepali grammar checker. PAN Localization Working Paper
Bam S, Shahi T (2014) Named entity recognition for Nepali text using support vector machines. Intell Inf Manag 6(2):21–29. https://doi.org/10.4236/iim.2014.62004
Basnet A, Timalsina AK (2018) Improving Nepali news recommendation using classification based on LSTM recurrent neural networks. In: 2018 IEEE 3rd international conference on computing. Communication and Security (ICCCS), IEEE, pp 138–142
Basnet A, Timalsina AK (2018) Improving Nepali news recommendation using classification based on lstm recurrent neural networks. In: Proceedings of international conference on computing, Communication and Security (ICCCS), pp 138–142
Bhala RV, Abirami S (2014) Trends in word sense disambiguation. Artif Intell Rev 42(2):159–171
Bhat SM, Rai R (2012) Building morphological analyzer for Nepali. J Modern Lang 22(1):45–58
Bista S, Keshari B, Bhatta J, Parajuli K (2005) Dobhase: online English to Nepali machine translation system. In: The proceedings of the 26th Annual conference of the Linguistic Society of Nepal
Bista S, Khatiwada L, Keshari B (2004) Nepali lexicon development. PAN Localization, Working Papers 2007:311–15
Borah S, Choden U, Lepcha N (2017) Design of a morph analyzer for non-declinable adjectives of nepali language. In: Proceedings of the 2017 international conference on machine learning and soft computing, pp 126–130
Brown PF, Della Pietra VJ, Desouza PV, Lai JC, Mercer RL (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–480
Carpuat M, Wu D (2007) Improving statistical machine translation using word sense disambiguation. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 61–72
Chhetri I, Dey G, Das SK, Borah S (2015) Development of a morph analyser for Nepali noun token. In: 2015 international conference on advances in computer engineering and applications. IEEE, pp 984–987
Choudhary N, Ramamoorthy L (2019) LDC-IL raw text corpora: an overview. Linguistic resources for AI/NLP in Indian languages. Central Institute of Indian Languages, Mysuru pp 1–10
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Dangol D, Shrestha RD, Timalsina A (2018) Automated news classification using n-gram model and key features of Nepali language. SCITECH Nepal 13(1):64–69
Daud A, Khan W, Che D (2017) Urdu language processing: a survey. Artif Intell Rev 47(3):279–311
Dey A, Paul A, Purkayastha BS (2014) Named entity recognition for Nepali language: a semi hybrid approach. Int J Eng Innov Technol (IJEIT) 3:21–25
Dhungana UR, Shakya S (2014) Word sense disambiguation in Nepali language. In: 2014 Fourth international conference on digital information and communication technology and its applications (DICTAP). IEEE, pp 46–50
Ekbal A, Bandyopadhyay S (2008) Bengali named entity recognition using support vector machine. In: Proceedings of the IJCNLP-08 workshop on named entity recognition for south and south east Asian Languages
Gupta CP, Bal BK (2015) Detecting sentiment in Nepali texts: a bootstrap approach for sentiment analysis of texts in the Nepali language. In: 2015 international conference on cognitive computing and information processing (CCIP). IEEE, pp 1–4
Guzmán F, Chen P, Ott M, Pino J, Lample G, Koehn P, Chaudhary V, Ranzato M (2019) Two new evaluation datasets for low-resource machine translation: Nepali-English and Sinhala-English. CoRR abs/1902.01382. http://arxiv.org/abs/1902.01382
Hung C, Chen SJ (2016) Word sense disambiguation based sentiment lexicons for sentiment classification. Knowl-Based Syst 110:224–232
Kafle K, Sharma D, Subedi A, Timalsina AK (2016) Improving Nepali document classification by neural network. In: Proceedings of IOE graduate conference, pp 317–322
Khanal R (2019) Linguistic geography of nepalese languages. Third Pole J Geogr Educ 18:45–54. https://doi.org/10.3126/ttp.v18i0.27994
Khatiwada R (2009) Nepali. J Int Phon Assoc 39(3):373–380
Lamsal R (2020) A large scale Nepali text corpus. IEEEdataport. https://doi.org/10.21227/jxrd-d245
Lappin S, Leass HJ (1994) An algorithm for pronominal anaphora resolution. Comput Linguist 20(4):535–561
Laskar SR, Pakray P, Bandyopadhyay S (2019) Neural machine translation: Hindi-Nepali. In: Proceedings of the fourth conference on machine translation (Volume 3: Shared Task Papers, Day 2), pp 202–207
Lesk M (1986) Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In: Proceedings of the 5th annual international conference on Systems documentation, pp 24–26
Lewis DD (1998) Naive (bayes) at forty: the independence assumption in information retrieval. In: European conference on machine learning. Springer, pp 4–15
MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA 1:281–297
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:13013781
Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41
Pant AK, Panday SP, Joshi SR (2012) Off-line nepali handwritten character recognition using multilayer perceptron and radial basis function neural networks. In: 2012 third Asian Himalayas international conference on internet, IEEE, pp 1–5
Pant N, Bal BK (2016) Improving Nepali ocr performance by using hybrid recognition approaches. In: 2016 7th international conference on information, intelligence, systems & applications (IISA). IEEE, pp 1–6
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318
Paul A, Purkayastha BS (2018) English to Nepali statistical machine translation system. In: Proceedings of the international conference on computing and communication systems. Springer, pp 423–431
Paul A, Purkayastha BS, Sarkar S (2015) Hidden Markov model based part of speech tagging for Nepali language. In: 2015 international symposium on advanced computing and communication (ISACC). IEEE, pp 149–156
Piryani R, Piryani B, Singh VK, Pinto D (2020) Sentiment analysis in Nepali: exploring machine learning and lexicon-based approaches. J Intell Fuzzy Syst (Preprint):1–12
Poli R, Kennedy J, Blackwell T (2007) Particle swarm optimization. Swarm Intell 1(1):33–57
Prabha G, Jyothsna P, Shahina K, Premjith B, Soman K (2018) A deep learning approach for part-of-speech tagging in nepali language. In: 2018 international conference on advances in computing. Communications and informatics (ICACCI). IEEE, pp 1132–1136
Prajwal R, Prasad KL, Bal BK (2008) Report on Nepali computational grammar. Madan Puraskar Pustakalaya https://www.academia.edu/2414578/Report_on_Nepali_Computational_Grammar
Prasain B (2008) Computational analysis of Nepali basic verbs (written forms). Nepalese Linguistics 23:262–270
Prasain B, Khatiwada L, Bal B, Shrestha P (2008) Part-of-speech tagset for Nepali. Madan Puraskar Pustakalaya, Unpublished
Regmi S, Bal BK, Kultsova M (2017) Analyzing facts and opinions in Nepali subjective texts. In: 2017 8th international conference on information, intelligence, systems & applications (IISA). IEEE, pp 1–4
Salton G, McGill MJ (1983) Introduction to modern information retrieval. Mcgraw-Hill, New York
Sarkar S, Roy A, Purkayastha B (2014) A comparative analysis of particle swarm optimization and K-means algorithm for text clustering using Nepali wordnet. Int J Nat Lang Comput (IJNLC) 3(3):83–92. http://www.airccse.org/journal/ijnlc/papers/3314ijnlc08.pdf
Senapati A, Poudyal A, Adhikary P, Kaushar S, Mahajan A, Saha BN (2020) A machine learning approach to anaphora resolution in Nepali language. In: 2020 international conference on computational performance evaluation (ComPE). IEEE, pp 436–441
Shah KB, Chaudhary KK, Ghimire A (2018) Nepali text to speech synthesis system using FreeTTS. SCITECH Nepal 13(1):24–31
Shahi TB, Dhamala TN, Balami B (2013) Support vector machines based part of speech tagging for Nepali text. Int J Comput Appl 70(24):38–42. https://doi.org/10.5120/12217-8374
Shahi TB, Pant AK (2018) Nepali news classification using naïve bayes, support vector machines and neural networks. In: 2018 International conference on communication information and computing technology (ICCICT). IEEE, pp 1–5
Shahi TB, Shakya S (2018) Nepali SMS filtering using decision trees, neural network and support vector machine. In: 2018 international conference on advances in computing. Communication Control and Networking (ICACCCN). IEEE, pp 1038–1042
Shahi TB, Yadav A et al (2014) Mobile sms spam filtering for Nepali text using naïve bayesian and support vector machine. Int J Intell Sci 4(01):24–28
Shrestha BB, Bal BK (2020) Named-entity based sentiment analysis of Nepali news media texts. In: Proceedings of the 6th workshop on natural language processing techniques for educational applications, pp 114–120
Shrestha I, Dhakal SS (2016) A new stemmer for Nepali language. In: 2016 2nd international conference on advances in computing, communication, & automation (ICACCA). IEEE, pp 1–5
Shrestha N, Hall PA, Bista SK (2008) Resources for nepali word sense disambiguation. In: 2008 international conference on natural language processing and knowledge engineering. IEEE, pp 1–5
Singh OM, Padia A, Joshi A (2019) Named entity recognition for nepali language. In: 2019 IEEE 5th international conference on collaboration and internet computing (CIC). IEEE, pp 184–190
Singh OM, Timilsina S, Bal BK, Joshi A (2020) Aspect based abusive sentiment detection in Nepali social media texts. In: 2020 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM). IEEE, pp 301–308
Singh J, Gupta V (2017) A systematic review of text stemming techniques. Artif Intell Rev 48(2):157–217
Sitaula C (2012) Semantic text clustering using enhanced vector space model using Nepali language. Comput Sci Telecommun 4:41–46
Sitaula C (2013) A hybrid algorithm for stemming of Nepali text. Intell Inf Manag. https://doi.org/10.4236/iim.2013.54014
Sitaula C (2014) Semantic orientation of texts using iterative finite state machine. J Comput Sci Control Syst 7(1):51
Sitaula C, Ojha YR (2013) Semantic sentence similarity using finite state machine. Intell Inf Manag 5(6):171–174
Sitaula C, Basnet A, Aryal S (2021) Vector representation based on a supervised codebook for nepali documents classification. PeerJ Comput Sci 7:e412
Subba S, Paudel N, Shahi TB (2019) Nepali text document classification using deep neural network. Tribhuvan Univ J 33(1):11–22
Tamrakar S, Bal BK, Thapa RB (2020) Aspect based sentiment analysis of Nepali text using support vector machine and naive bayes. Tech J 2(1):22–29
Taylor P, Black AW, Caley R (1998) The architecture of the festival speech synthesis system. In: The third ESCA/COCOSDA workshop (ETRW) on speech synthesis
Thakur SK, Singh VK (2014) A lexicon pool augmented Naive Bayes classifier for Nepali text. In: Proceedings of seventh international conference on contemporary computing (IC3), pp 542–546
Thapa LBR, Bal BK (2016) Classifying sentiments in Nepali subjective texts. In: 2016 7th international conference on information, intelligence, systems & applications (IISA). IEEE, pp 1–6
Wang Y, Wang M, Fujita H (2020) Word sense disambiguation: a comprehensive knowledge exploitation framework. Knowl-Based Syst 190(105):030. https://doi.org/10.1016/j.knosys.2019.105030
Yadava YP, Hardie A, Lohani RR, Regmi BN, Gurung S, Gurung A, McEnery T, Allwood J, Hall P (2008) Construction and annotation of a corpus of contemporary Nepali. Corpora 3(2):213–225
Yajnik A (2017) Part of speech tagging using statistical approach for Nepali text. World Acad Sci Eng Technol Int J Comput Electr Autom Control Inf Eng 11(1):76–79
Yajnik A (2018) Ann based pos tagging for nepali text. Int J Nat Lang Comput 7:13–18
Zhong Z, Ng HT (2012) Word sense disambiguation improves information retrieval. In: Proceedings of the 50th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 273–282
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shahi, T.B., Sitaula, C. Natural language processing for Nepali text: a review. Artif Intell Rev 55, 3401–3429 (2022). https://doi.org/10.1007/s10462-021-10093-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-021-10093-1