research-article

Towards Developing Uniform Lexicon Based Sorting Algorithm for Three Prominent Indo-Aryan Languages

Authors:
Mir Ragib Ishraq

Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh

Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh
View Profile

,
Nitesh Khadka

Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh

Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh
View Profile

,
Asif Mohammed Samir

Institute of Information and Communication Technology, Shahjalal University of Science and Technology, Sylhet, Bangladesh

Institute of Information and Communication Technology, Shahjalal University of Science and Technology, Sylhet, Bangladesh
View Profile

,
M. Shahidur Rahman

Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh

Department of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21 Issue 3Article No.: 57pp 1–20https://doi.org/10.1145/3488371

Published:13 December 2021Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Three different Indic/Indo-Aryan languages - Bengali, Hindi and Nepali have been explored here in character level to find out similarities and dissimilarities. Having shared the same root, the Sanskrit, Indic languages bear common characteristics. That is why computer and language scientists can take the opportunity to develop common Natural Language Processing (NLP) techniques or algorithms. Bearing the concept in mind, we compare and analyze these three languages character by character. As an application of the hypothesis, we also developed a uniform sorting algorithm in two steps, first for the Bengali and Nepali languages only and then extended it for Hindi in the second step. Our thorough investigation with more than 30,000 words from each language suggests that, the algorithm maintains total accuracy as set by the local language authorities of the respective languages and good efficiency.

REFERENCES

[1] Cardona George and Jain Danesh. 2007. The Indo-Aryan Languages. Routledge (2007), Abingdon, 163.Google Scholar
[2] Burde Jayant. 2004. Rituals, Mantras, and Science: An Integral Perspective. Motilal Banarsidass Publishers, Delhi, 3.Google Scholar
[3] Must Go Inc. Indo-Aryan Branch - About World Languages. https://www.mustgo.com/worldlanguages/indo-aryan-branch/.Google Scholar
[4] Central Intelligence Agency. The World Factbook. https://www.cia.gov/index.html.Google Scholar
[5] Central Bureau of Statistics. 2001. His Majesty's Government of Nepal. https://cbs.gov.np/.Google Scholar
[6] Registrar General and Census Commissioner of India. 2011. Scheduled Languages in Descending Order of Speaker's Strength –2011. http://censusindia.gov.in/Census_Data_2001/Census_Data_Online/Language/Statement4.aspx.Google Scholar
[7] Kogan Anton I.. 2016. Genealogical classification of new Indo-Aryan languages and lexicostatistics. Journal of Language Relationship 14, 4 (2016). DOI: https://doi.org/10.31826/jlr-2017-143-411Google Scholar
[8] Donkin R. A.. 2003. Between East and West: The Moluccas and the Traffic in Spices Up to the Arrival of Europeans. American Philosophical Society. Philadelphia.Google Scholar
[9] Fischer Steven Roger. 2004. A History of Writing. Reaktion Books, Islington, London, 110.Google Scholar
[10] Umapada Pal, Kaushik Roy, and Kimura F.. 2006. A lexicon driven method for unconstrained Bangla handwritten word recognition. In Proceedings of the 10th IWFHR, Université de Rennes (October, 2016) HAL: https://hal.inria.fr/inria-00104048.Google Scholar
[11] Khatiwada Rajesh. 2009. Nepali. Journal of the International Phonetic Association 39, 3 (2009). DOI: https://doi.org/10.1017/S0025100309990181Google ScholarCross Ref
[12] Bhatia Tej. 1987. A History of the Hindi Grammatical Tradition: Hindi-Hindustani Grammar, Grammarians, History and Problems. Brill, Leiden.Google ScholarCross Ref
[13] Unicode Inc. Bangla Unicode. https://unicode.org/charts/PDF/U0980.pdf.Google Scholar
[14] Unicode Inc. Devanagari Unicode. https://unicode.org/charts/PDF/U0900.pdf.Google Scholar
[15] Bangla Academy - http://www.banglaacademy.gov.bd/. Bangla Academy Bengali-English Dictionary (1st ed.). Bangla Academy (1994), Dhaka.Google Scholar
[16] Nepal Academy (Nepal Pragya Pratisthan) - http://nepalacademy.org.np/language/en/883-2/. Nepal Brihat Sabdakosh. Nepal Academy, Kathmandu.Google Scholar
[17] Central Institute of Indian languages - https://www.ciil.org/default.aspx, Bharatvani Trilingual Dictionary, Central Institute of Indian languages (CIIL), Mysuru.Google Scholar
[18] Shayakhmetov O. M., Imasheva G. Y., Almukhametov A. R., Mukhitdinov R. S., and Paltore Y.. 2018. Systematising the microstructure of a modern dictionary of the Arabic language. Space and Culture, India 6, 2 (2018), 34–45. DOI: https://doi.org/10.20896/saci.v6i2.340Google Scholar
[19] Sina Ahmadi, Hossein Hassani, and McCrae John P.. 2019. Towards electronic lexicography for the Kurdish language. In 6^th Biennial Conference on Electronic Lexicography (ELEX 2019), Sintra, Portugal, (2019). DOI:http://doi.org/10.5281/zenodo.3518950Google Scholar
[20] Hussein S., Gul S., and Waseem A.. 2007. Developing lexicographic sorting: An example for Urdu. ACM Transactions on Asian Language Information Processing 6, 3. (2007). DOI:http://doi.acm.org/10.1145/1290002.1290004 Google ScholarDigital Library
[21] ELEXIS. European Lexicographic Infrastructure. https://elex.is/.Google Scholar
[22] Declerck T., McCrae J., Navigli R., Zaytseva K., and Wissik T.. 2018. ELEXIS - European lexicographic infrastructure: Contributions to and from the linguistic linked open data. In Proceedings of the LREC 2018 Workshop Globalex 17–22. (2018). DOI:http://doi.org/10.5281/zenodo.2599927Google Scholar
[23] Chaudhuri B. B. and Pal U.. 1997. An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In Proceedings of the 4th International Conference on Document Analysis and Recognition. (Ulm, Germany). DOI: https://doi.org/10.1109/ICDAR.1997.620662 Google ScholarDigital Library
[24] Kumar Sarungbam Jeelen, Kumar Bhupendra, and Choudhary Ankur. 2014. Script identification and language detection of 12 Indian languages using DWT and template matching of frequently occurring character(s). In 5th International Conference - Confluence the Next Generation Information Technology Summit (Confluence). (Noida, India). DOI: https://doi.org/10.1109/CONFLUENCE.2014.6949300Google ScholarCross Ref
[25] Bhattacharya Ujjwal, Kumar Parui Swapan, and Mondal Srikanta. 2009. Devanagari and Bangla text extraction from natural scene images. In 10th International Conference on Document Analysis and Recognition. (Barcelona, Spain). DOI: https://doi.org/10.1109/ICDAR.2009.178 Google ScholarDigital Library
[26] Khadka Nitesh, Ishraq Mir Ragib, Samir Asif Mohammed, and Rahman. Mohammad Shahidur 2019. Multilingual text categorization of Indo-Aryan languages. In International Conference on Electrical, Computer and Communication Engineering (ECCE). (Cox's Bazar, Bangladesh). DOI: https://doi.org/10.1109/ECACE.2019.8679445Google ScholarCross Ref
[27] Rahman Md. Shahidur and Iqbal Md. Zafar. 1998. Bangla sorting algorithm: A linguistic approach. In Proceedings of International Conference on Computer and Information Technology, (Dhaka, Bangladesh). 204–208.Google Scholar
[28] Mafizul Haque Khan, S. M. Rafizul Haque, Md. Sharif Uddin, Rahat Khan, and A. B. M. Tariqul Islam. 2004. An efficient and correct bangla sorting algorithm. In 7th International Conference on Computer and Information Technology (ICCIT'04), Dhaka, Bangladesh. 125.Google Scholar
[29] Islam Shah Md. Emrul and Ali. Muhammad Masroor 2005. A novel approach to sort Unicode Bengali text using ancillary maps. Asian Journal of Information Technology 4, 6 (2005), 569–573.Google Scholar
[30] Samir Asif Mohammed and Amin. Md. Ruhul 2016. A novel algorithm for the comparison of Bangla strings for sorting according to the rules of Bangla academy. International Journal of Computer Applications 151, 7 (2016). DOI: http://dx.doi.org/10.5120/ijca2016911840Google Scholar
[31] Rahaman Md. Mahfuzur. 2016. A revised Unicode based sorting algorithm for Bengali texts. International Journal of Computer Applications 147, 14 (2016). DOI: http://dx.doi.org/10.5120/ijca2016911305Google Scholar
[32] Shabnam Aamira, Urmi Tapashee Tabassum, and Islam Md. Saiful. 2015. A faster approach to sort Unicode represented Bengali words. International Journal of Computer Applications 126, 11 (2015). DOI: http://dx.doi.org/10.5120/ijca2015906224Google ScholarCross Ref
[33] Akash Ranit Debnath, Nu U. Khyoi, and Chakrabarty Biswapriyo. 2017. An approach to sort Unicode based Bengali text using Trie. International Journal of Computer Applications 163, 11 (2017). DOI: http://dx.doi.org/10.5120/ijca2017913764Google Scholar
[34] Bal Bal Krishna, Khatiwada Laxmi Prasad, Pradhan Paras, Chitrakar Pawan, and Gurung Srishtee. Sorting utility for Nepali in Linux. PAN Localization, Working Papers 2004-2007. 412–415.Google Scholar
[35] Wissink C.. 2001. Issues in Indic language collation. In 19th International Unicode Conference. San Jose, California. (2001).Google Scholar
[36] Omniglot. Bengali at a Glance. https://www.omniglot.com/writing/bengali.htm.Google Scholar
[37] Nelson Paul. Microsoft Public Review issue –9, https://www.unicode.org/review/pr-9.pdf.Google Scholar
[38] Peter Constable. Microsoft Public Review issue –37, https://www.unicode.org/review/pr-37.pdf.Google Scholar
[39] Jan Goyvaerts. Regular-Expressions.info. http://www.regular-expressions.info/tutorial.html. Google ScholarDigital Library
[40] Department of Linguistics, University of Pennsylvania School of Arts and Sciences, Unicode Character Ranges, https://www.ling.upenn.edu/courses/Spring_2003/ling538/UnicodeRanges.html.Google Scholar
[41] Zhang H.. 2014. A review of stroke order in ‘Hanzi’ handwriting. Language Learning in Higher Education 4, 2 (2014), 423–440, DOI: http://dx.doi.org/10.1515/cercles-2014-0022Google ScholarCross Ref
[42] Koerner E. F. K. and Asher R. E.. 1995. Concise History of the Language Sciences: From the Sumerians to the Cognitivists, New York, Pergamon.Google Scholar
[43] Pierrehumbert Janet and Nair Rami. 1996. Implications of Hindi prosodic structure. Current Trends in Phonology: Models and Methods, 549–584.Google Scholar

Index Terms

Towards Developing Uniform Lexicon Based Sorting Algorithm for Three Prominent Indo-Aryan Languages
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Sorting and searching

Recommendations

A Generic Tool for Identification of Indo-Aryan Multi Word Expression
Abstract
The linguistic tools are essential for any language. A linguistic tool could be Parts of Speech Tagger (POST), Grammar Checker (GC), Alankaar Finder (AF), and Identification of Multi-word Expression (IMWE). MWE is a term that is used to represent ...
Read More
A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing

Since Sag et al. (2002) highlighted a key problem that had been underappreciated in the past in natural language processing (NLP), namely idiosyncratic multiword expressions (MWEs) such as idioms, quasi-idioms, cliches, quasi-cliches, institutionalized ...
Read More
Stemming resource-poor Indian languages

Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21, Issue 3
May 2022
413 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3505182
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 December 2021
- Revised: 1 September 2021
- Accepted: 1 September 2021
- Received: 1 April 2020
Published in tallip Volume 21, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Lexicon
algorithm
Indic
Indo-Aryan
collation
Unicode
Bengali
Hindi
Nepali
application
natural language processing
South-Asian
language family tree
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 158
  Total Downloads
- Downloads (Last 12 months)57
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Towards Developing Uniform Lexicon Based Sorting Algorithm for Three Prominent Indo-Aryan Languages

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

A Generic Tool for Identification of Indo-Aryan Multi Word Expression

A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing

Stemming resource-poor Indian languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Caption

Towards Developing Uniform Lexicon Based Sorting Algorithm for Three Prominent Indo-Aryan Languages

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

A Generic Tool for Identification of Indo-Aryan Multi Word Expression

A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing

Stemming resource-poor Indian languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Share this Publication link

Share on Social Media