Abstract
Three different Indic/Indo-Aryan languages - Bengali, Hindi and Nepali have been explored here in character level to find out similarities and dissimilarities. Having shared the same root, the Sanskrit, Indic languages bear common characteristics. That is why computer and language scientists can take the opportunity to develop common Natural Language Processing (NLP) techniques or algorithms. Bearing the concept in mind, we compare and analyze these three languages character by character. As an application of the hypothesis, we also developed a uniform sorting algorithm in two steps, first for the Bengali and Nepali languages only and then extended it for Hindi in the second step. Our thorough investigation with more than 30,000 words from each language suggests that, the algorithm maintains total accuracy as set by the local language authorities of the respective languages and good efficiency.
- [1] . 2007. The Indo-Aryan Languages. Routledge (2007), Abingdon, 163.Google Scholar
- [2] . 2004. Rituals, Mantras, and Science: An Integral Perspective. Motilal Banarsidass Publishers, Delhi, 3.Google Scholar
- [3] Must Go Inc. Indo-Aryan Branch - About World Languages. https://www.mustgo.com/worldlanguages/indo-aryan-branch/.Google Scholar
- [4] Central Intelligence Agency. The World Factbook. https://www.cia.gov/index.html.Google Scholar
- [5] Central Bureau of Statistics. 2001. His Majesty's Government of Nepal. https://cbs.gov.np/.Google Scholar
- [6] Registrar General and Census Commissioner of India. 2011. Scheduled Languages in Descending Order of Speaker's Strength –2011. http://censusindia.gov.in/Census_Data_2001/Census_Data_Online/Language/Statement4.aspx.Google Scholar
- [7] . 2016. Genealogical classification of new Indo-Aryan languages and lexicostatistics. Journal of Language Relationship 14, 4 (2016).
DOI: https://doi.org/10.31826/jlr-2017-143-411Google Scholar - [8] . 2003. Between East and West: The Moluccas and the Traffic in Spices Up to the Arrival of Europeans. American Philosophical Society. Philadelphia.Google Scholar
- [9] . 2004. A History of Writing. Reaktion Books, Islington, London, 110.Google Scholar
- [10] . 2006. A lexicon driven method for unconstrained Bangla handwritten word recognition. In Proceedings of the 10th IWFHR, Université de Rennes (October, 2016) HAL: https://hal.inria.fr/inria-00104048.Google Scholar
- [11] . 2009. Nepali. Journal of the International Phonetic Association 39, 3 (2009).
DOI: https://doi.org/10.1017/S0025100309990181Google ScholarCross Ref - [12] . 1987. A History of the Hindi Grammatical Tradition: Hindi-Hindustani Grammar, Grammarians, History and Problems. Brill, Leiden.Google ScholarCross Ref
- [13] Unicode Inc. Bangla Unicode. https://unicode.org/charts/PDF/U0980.pdf.Google Scholar
- [14] Unicode Inc. Devanagari Unicode. https://unicode.org/charts/PDF/U0900.pdf.Google Scholar
- [15] Bangla Academy - http://www.banglaacademy.gov.bd/. Bangla Academy Bengali-English Dictionary (1st ed.). Bangla Academy (1994), Dhaka.Google Scholar
- [16] Nepal Academy (Nepal Pragya Pratisthan) - http://nepalacademy.org.np/language/en/883-2/. Nepal Brihat Sabdakosh. Nepal Academy, Kathmandu.Google Scholar
- [17] Central Institute of Indian languages - https://www.ciil.org/default.aspx, Bharatvani Trilingual Dictionary, Central Institute of Indian languages (CIIL), Mysuru.Google Scholar
- [18] . 2018. Systematising the microstructure of a modern dictionary of the Arabic language. Space and Culture, India 6, 2 (2018), 34–45.
DOI: https://doi.org/10.20896/saci.v6i2.340Google Scholar - [19] . 2019. Towards electronic lexicography for the Kurdish language. In 6th Biennial Conference on Electronic Lexicography (ELEX 2019), Sintra, Portugal, (2019).
DOI: http://doi.org/10.5281/zenodo.3518950Google Scholar - [20] . 2007. Developing lexicographic sorting: An example for Urdu. ACM Transactions on Asian Language Information Processing 6, 3. (2007).
DOI: http://doi.acm.org/10.1145/1290002.1290004 Google ScholarDigital Library - [21] ELEXIS. European Lexicographic Infrastructure. https://elex.is/.Google Scholar
- [22] . 2018. ELEXIS - European lexicographic infrastructure: Contributions to and from the linguistic linked open data. In Proceedings of the LREC 2018 Workshop Globalex 17–22. (2018).
DOI: http://doi.org/10.5281/zenodo.2599927Google Scholar - [23] . 1997. An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In Proceedings of the 4th International Conference on Document Analysis and Recognition. (Ulm, Germany).
DOI: https://doi.org/10.1109/ICDAR.1997.620662 Google ScholarDigital Library - [24] . 2014. Script identification and language detection of 12 Indian languages using DWT and template matching of frequently occurring character(s). In 5th International Conference - Confluence the Next Generation Information Technology Summit (Confluence). (Noida, India).
DOI: https://doi.org/10.1109/CONFLUENCE.2014.6949300Google ScholarCross Ref - [25] . 2009. Devanagari and Bangla text extraction from natural scene images. In 10th International Conference on Document Analysis and Recognition. (Barcelona, Spain).
DOI: https://doi.org/10.1109/ICDAR.2009.178 Google ScholarDigital Library - [26] 2019. Multilingual text categorization of Indo-Aryan languages. In International Conference on Electrical, Computer and Communication Engineering (ECCE). (Cox's Bazar, Bangladesh).
DOI: https://doi.org/10.1109/ECACE.2019.8679445Google ScholarCross Ref - [27] . 1998. Bangla sorting algorithm: A linguistic approach. In Proceedings of International Conference on Computer and Information Technology, (Dhaka, Bangladesh). 204–208.Google Scholar
- [28] Mafizul Haque Khan, S. M. Rafizul Haque, Md. Sharif Uddin, Rahat Khan, and A. B. M. Tariqul Islam. 2004. An efficient and correct bangla sorting algorithm. In 7th International Conference on Computer and Information Technology (ICCIT'04), Dhaka, Bangladesh. 125.Google Scholar
- [29] 2005. A novel approach to sort Unicode Bengali text using ancillary maps. Asian Journal of Information Technology 4, 6 (2005), 569–573.Google Scholar
- [30] 2016. A novel algorithm for the comparison of Bangla strings for sorting according to the rules of Bangla academy. International Journal of Computer Applications 151, 7 (2016).
DOI: http://dx.doi.org/10.5120/ijca2016911840Google Scholar - [31] . 2016. A revised Unicode based sorting algorithm for Bengali texts. International Journal of Computer Applications 147, 14 (2016).
DOI: http://dx.doi.org/10.5120/ijca2016911305Google Scholar - [32] 2015. A faster approach to sort Unicode represented Bengali words. International Journal of Computer Applications 126, 11 (2015).
DOI: http://dx.doi.org/10.5120/ijca2015906224Google ScholarCross Ref - [33] 2017. An approach to sort Unicode based Bengali text using Trie. International Journal of Computer Applications 163, 11 (2017).
DOI: http://dx.doi.org/10.5120/ijca2017913764Google Scholar - [34] . Sorting utility for Nepali in Linux. PAN Localization, Working Papers 2004-2007. 412–415.Google Scholar
- [35] . 2001. Issues in Indic language collation. In 19th International Unicode Conference. San Jose, California. (2001).Google Scholar
- [36] Omniglot. Bengali at a Glance. https://www.omniglot.com/writing/bengali.htm.Google Scholar
- [37] . Microsoft Public Review issue –9, https://www.unicode.org/review/pr-9.pdf.Google Scholar
- [38] Peter Constable. Microsoft Public Review issue –37, https://www.unicode.org/review/pr-37.pdf.Google Scholar
- [39] Jan Goyvaerts. Regular-Expressions.info. http://www.regular-expressions.info/tutorial.html. Google ScholarDigital Library
- [40] Department of Linguistics, University of Pennsylvania School of Arts and Sciences, Unicode Character Ranges, https://www.ling.upenn.edu/courses/Spring_2003/ling538/UnicodeRanges.html.Google Scholar
- [41] . 2014. A review of stroke order in ‘Hanzi’ handwriting. Language Learning in Higher Education 4, 2 (2014), 423–440,
DOI: http://dx.doi.org/10.1515/cercles-2014-0022Google ScholarCross Ref - [42] . 1995. Concise History of the Language Sciences: From the Sumerians to the Cognitivists, New York, Pergamon.Google Scholar
- [43] . 1996. Implications of Hindi prosodic structure. Current Trends in Phonology: Models and Methods, 549–584.Google Scholar
Index Terms
- Towards Developing Uniform Lexicon Based Sorting Algorithm for Three Prominent Indo-Aryan Languages
Recommendations
A Generic Tool for Identification of Indo-Aryan Multi Word Expression
AbstractThe linguistic tools are essential for any language. A linguistic tool could be Parts of Speech Tagger (POST), Grammar Checker (GC), Alankaar Finder (AF), and Identification of Multi-word Expression (IMWE). MWE is a term that is used to represent ...
A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing
Since Sag et al. (2002) highlighted a key problem that had been underappreciated in the past in natural language processing (NLP), namely idiosyncratic multiword expressions (MWEs) such as idioms, quasi-idioms, cliches, quasi-cliches, institutionalized ...
Stemming resource-poor Indian languages
Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While ...
Comments