skip to main content
research-article

Towards Developing Uniform Lexicon Based Sorting Algorithm for Three Prominent Indo-Aryan Languages

Authors Info & Claims
Published:13 December 2021Publication History
Skip Abstract Section

Abstract

Three different Indic/Indo-Aryan languages - Bengali, Hindi and Nepali have been explored here in character level to find out similarities and dissimilarities. Having shared the same root, the Sanskrit, Indic languages bear common characteristics. That is why computer and language scientists can take the opportunity to develop common Natural Language Processing (NLP) techniques or algorithms. Bearing the concept in mind, we compare and analyze these three languages character by character. As an application of the hypothesis, we also developed a uniform sorting algorithm in two steps, first for the Bengali and Nepali languages only and then extended it for Hindi in the second step. Our thorough investigation with more than 30,000 words from each language suggests that, the algorithm maintains total accuracy as set by the local language authorities of the respective languages and good efficiency.

REFERENCES

  1. [1] Cardona George and Jain Danesh. 2007. The Indo-Aryan Languages. Routledge (2007), Abingdon, 163.Google ScholarGoogle Scholar
  2. [2] Burde Jayant. 2004. Rituals, Mantras, and Science: An Integral Perspective. Motilal Banarsidass Publishers, Delhi, 3.Google ScholarGoogle Scholar
  3. [3] Must Go Inc. Indo-Aryan Branch - About World Languages. https://www.mustgo.com/worldlanguages/indo-aryan-branch/.Google ScholarGoogle Scholar
  4. [4] Central Intelligence Agency. The World Factbook. https://www.cia.gov/index.html.Google ScholarGoogle Scholar
  5. [5] Central Bureau of Statistics. 2001. His Majesty's Government of Nepal. https://cbs.gov.np/.Google ScholarGoogle Scholar
  6. [6] Registrar General and Census Commissioner of India. 2011. Scheduled Languages in Descending Order of Speaker's Strength –2011. http://censusindia.gov.in/Census_Data_2001/Census_Data_Online/Language/Statement4.aspx.Google ScholarGoogle Scholar
  7. [7] Kogan Anton I.. 2016. Genealogical classification of new Indo-Aryan languages and lexicostatistics. Journal of Language Relationship 14, 4 (2016). DOI: https://doi.org/10.31826/jlr-2017-143-411Google ScholarGoogle Scholar
  8. [8] Donkin R. A.. 2003. Between East and West: The Moluccas and the Traffic in Spices Up to the Arrival of Europeans. American Philosophical Society. Philadelphia.Google ScholarGoogle Scholar
  9. [9] Fischer Steven Roger. 2004. A History of Writing. Reaktion Books, Islington, London, 110.Google ScholarGoogle Scholar
  10. [10] Umapada Pal, Kaushik Roy, and Kimura F.. 2006. A lexicon driven method for unconstrained Bangla handwritten word recognition. In Proceedings of the 10th IWFHR, Université de Rennes (October, 2016) HAL: https://hal.inria.fr/inria-00104048.Google ScholarGoogle Scholar
  11. [11] Khatiwada Rajesh. 2009. Nepali. Journal of the International Phonetic Association 39, 3 (2009). DOI: https://doi.org/10.1017/S0025100309990181Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Bhatia Tej. 1987. A History of the Hindi Grammatical Tradition: Hindi-Hindustani Grammar, Grammarians, History and Problems. Brill, Leiden.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Unicode Inc. Bangla Unicode. https://unicode.org/charts/PDF/U0980.pdf.Google ScholarGoogle Scholar
  14. [14] Unicode Inc. Devanagari Unicode. https://unicode.org/charts/PDF/U0900.pdf.Google ScholarGoogle Scholar
  15. [15] Bangla Academy - http://www.banglaacademy.gov.bd/. Bangla Academy Bengali-English Dictionary (1st ed.). Bangla Academy (1994), Dhaka.Google ScholarGoogle Scholar
  16. [16] Nepal Academy (Nepal Pragya Pratisthan) - http://nepalacademy.org.np/language/en/883-2/. Nepal Brihat Sabdakosh. Nepal Academy, Kathmandu.Google ScholarGoogle Scholar
  17. [17] Central Institute of Indian languages - https://www.ciil.org/default.aspx, Bharatvani Trilingual Dictionary, Central Institute of Indian languages (CIIL), Mysuru.Google ScholarGoogle Scholar
  18. [18] Shayakhmetov O. M., Imasheva G. Y., Almukhametov A. R., Mukhitdinov R. S., and Paltore Y.. 2018. Systematising the microstructure of a modern dictionary of the Arabic language. Space and Culture, India 6, 2 (2018), 3445. DOI: https://doi.org/10.20896/saci.v6i2.340Google ScholarGoogle Scholar
  19. [19] Sina Ahmadi, Hossein Hassani, and McCrae John P.. 2019. Towards electronic lexicography for the Kurdish language. In 6th Biennial Conference on Electronic Lexicography (ELEX 2019), Sintra, Portugal, (2019). DOI:http://doi.org/10.5281/zenodo.3518950Google ScholarGoogle Scholar
  20. [20] Hussein S., Gul S., and Waseem A.. 2007. Developing lexicographic sorting: An example for Urdu. ACM Transactions on Asian Language Information Processing 6, 3. (2007). DOI:http://doi.acm.org/10.1145/1290002.1290004 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] ELEXIS. European Lexicographic Infrastructure. https://elex.is/.Google ScholarGoogle Scholar
  22. [22] Declerck T., McCrae J., Navigli R., Zaytseva K., and Wissik T.. 2018. ELEXIS - European lexicographic infrastructure: Contributions to and from the linguistic linked open data. In Proceedings of the LREC 2018 Workshop Globalex 1722. (2018). DOI:http://doi.org/10.5281/zenodo.2599927Google ScholarGoogle Scholar
  23. [23] Chaudhuri B. B. and Pal U.. 1997. An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In Proceedings of the 4th International Conference on Document Analysis and Recognition. (Ulm, Germany). DOI: https://doi.org/10.1109/ICDAR.1997.620662 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Kumar Sarungbam Jeelen, Kumar Bhupendra, and Choudhary Ankur. 2014. Script identification and language detection of 12 Indian languages using DWT and template matching of frequently occurring character(s). In 5th International Conference - Confluence the Next Generation Information Technology Summit (Confluence). (Noida, India). DOI: https://doi.org/10.1109/CONFLUENCE.2014.6949300Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Bhattacharya Ujjwal, Kumar Parui Swapan, and Mondal Srikanta. 2009. Devanagari and Bangla text extraction from natural scene images. In 10th International Conference on Document Analysis and Recognition. (Barcelona, Spain). DOI: https://doi.org/10.1109/ICDAR.2009.178 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Khadka Nitesh, Ishraq Mir Ragib, Samir Asif Mohammed, and Rahman. Mohammad Shahidur 2019. Multilingual text categorization of Indo-Aryan languages. In International Conference on Electrical, Computer and Communication Engineering (ECCE). (Cox's Bazar, Bangladesh). DOI: https://doi.org/10.1109/ECACE.2019.8679445Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Rahman Md. Shahidur and Iqbal Md. Zafar. 1998. Bangla sorting algorithm: A linguistic approach. In Proceedings of International Conference on Computer and Information Technology, (Dhaka, Bangladesh). 204208.Google ScholarGoogle Scholar
  28. [28] Mafizul Haque Khan, S. M. Rafizul Haque, Md. Sharif Uddin, Rahat Khan, and A. B. M. Tariqul Islam. 2004. An efficient and correct bangla sorting algorithm. In 7th International Conference on Computer and Information Technology (ICCIT'04), Dhaka, Bangladesh. 125.Google ScholarGoogle Scholar
  29. [29] Islam Shah Md. Emrul and Ali. Muhammad Masroor 2005. A novel approach to sort Unicode Bengali text using ancillary maps. Asian Journal of Information Technology 4, 6 (2005), 569573.Google ScholarGoogle Scholar
  30. [30] Samir Asif Mohammed and Amin. Md. Ruhul 2016. A novel algorithm for the comparison of Bangla strings for sorting according to the rules of Bangla academy. International Journal of Computer Applications 151, 7 (2016). DOI: http://dx.doi.org/10.5120/ijca2016911840Google ScholarGoogle Scholar
  31. [31] Rahaman Md. Mahfuzur. 2016. A revised Unicode based sorting algorithm for Bengali texts. International Journal of Computer Applications 147, 14 (2016). DOI: http://dx.doi.org/10.5120/ijca2016911305Google ScholarGoogle Scholar
  32. [32] Shabnam Aamira, Urmi Tapashee Tabassum, and Islam Md. Saiful. 2015. A faster approach to sort Unicode represented Bengali words. International Journal of Computer Applications 126, 11 (2015). DOI: http://dx.doi.org/10.5120/ijca2015906224Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Akash Ranit Debnath, Nu U. Khyoi, and Chakrabarty Biswapriyo. 2017. An approach to sort Unicode based Bengali text using Trie. International Journal of Computer Applications 163, 11 (2017). DOI: http://dx.doi.org/10.5120/ijca2017913764Google ScholarGoogle Scholar
  34. [34] Bal Bal Krishna, Khatiwada Laxmi Prasad, Pradhan Paras, Chitrakar Pawan, and Gurung Srishtee. Sorting utility for Nepali in Linux. PAN Localization, Working Papers 2004-2007. 412415.Google ScholarGoogle Scholar
  35. [35] Wissink C.. 2001. Issues in Indic language collation. In 19th International Unicode Conference. San Jose, California. (2001).Google ScholarGoogle Scholar
  36. [36] Omniglot. Bengali at a Glance. https://www.omniglot.com/writing/bengali.htm.Google ScholarGoogle Scholar
  37. [37] Nelson Paul. Microsoft Public Review issue –9, https://www.unicode.org/review/pr-9.pdf.Google ScholarGoogle Scholar
  38. [38] Peter Constable. Microsoft Public Review issue –37, https://www.unicode.org/review/pr-37.pdf.Google ScholarGoogle Scholar
  39. [39] Jan Goyvaerts. Regular-Expressions.info. http://www.regular-expressions.info/tutorial.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Department of Linguistics, University of Pennsylvania School of Arts and Sciences, Unicode Character Ranges, https://www.ling.upenn.edu/courses/Spring_2003/ling538/UnicodeRanges.html.Google ScholarGoogle Scholar
  41. [41] Zhang H.. 2014. A review of stroke order in ‘Hanzi’ handwriting. Language Learning in Higher Education 4, 2 (2014), 423440, DOI: http://dx.doi.org/10.1515/cercles-2014-0022Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Koerner E. F. K. and Asher R. E.. 1995. Concise History of the Language Sciences: From the Sumerians to the Cognitivists, New York, Pergamon.Google ScholarGoogle Scholar
  43. [43] Pierrehumbert Janet and Nair Rami. 1996. Implications of Hindi prosodic structure. Current Trends in Phonology: Models and Methods, 549584.Google ScholarGoogle Scholar

Index Terms

  1. Towards Developing Uniform Lexicon Based Sorting Algorithm for Three Prominent Indo-Aryan Languages

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 3
        May 2022
        413 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3505182
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 December 2021
        • Revised: 1 September 2021
        • Accepted: 1 September 2021
        • Received: 1 April 2020
        Published in tallip Volume 21, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)57
        • Downloads (Last 6 weeks)3

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format