Abstract
As social media platforms become increasingly accessible, individuals’ usage of new forms of textual communication (posts, comments, chats, etc.) on social media using local language scripts such as Amharic has increased tremendously. However, many users prefer to post comments in Latin scripts instead of local ones due to the availability of more convenient forms of character input using Latin keyboards. In existing Latin to Amharic transliteration systems, missing consideration of double consonants and double vowels has caused transliteration errors. Further, as there are multiple ways of character mapping conventions in existing systems, social media texts are susceptible to a wide variety of user adoptions during script production. The current systems have failed to address these gaps and adoptions. In this work, we present the RBLatAm (Rule-Based Latin to Amharic) transliteration system, a generic rule-based system that converts Amharic words which have been written using Latin script back into their native Amharic script. The system is based on mapping rules engineered from three existing transliteration systems (Microsoft, Google, SERA) and additional rules for double consonants, and conventions adopted on social media by speakers of Amharic. When tested on transliterated Amharic words of non-named entities, and named entities of persons, the system achieves an accuracy of 75.8% and 84.6%, respectively. The system also correctly transliterates words reported as errors in previous studies. This system drastically improves the basis for performing research on text mining for Amharic language texts by being able to process such texts even if they have originally been produced in Latin scripts.
Similar content being viewed by others
Notes
https://play.google.com/store/apps/details?id=com google.android.inputmethod.latin&hl = en_US&gl = US.
Not all the Amharic characters are displayed on the Table because of space limitations.
References
Sumikawa, Y., Jatowt, A.: Analyzing history related posts in Twitter. Int. J. Digit. Libr. 22(1), 105–134 (2021)
Benites, F., Duivesteijn, G., von, P., Cieliebak, M.: Translit: a large-scale name transliteration resource. In: Proceedings of 12th Language Resources and Evaluation Conference (LREC) 2020, pp. 3258–3264. European Language Resources Association (2020).
Owen, C.B., Ford, J., Makedon, F., Steinberg, T.: Parallel text alignment. In: Proceedings of International Conference on Theory and Practice of Digital Libraries, pp. 235–260. Springer (1998)
Wang, J., Lu, W., Chien, L.: Toward web mining of cross-language query translations in digital libraries. Int. J. Digit. Libr. 4(4), 247–257 (2004)
Klouche, B., Benslimane, S.: Arabizi chat alphabet transliteration to Algerian dialect. In: Proceedings of International Conference in Artificial Intelligence in Renewable Energetic Systems, pp. 790–797. Springer (2020)
Appel, G., Grewal, L., Hadi, R., Stephen, A.: The future of social media in marketing. J. Acad. Mark. Sci 48(1), 79–95 (2020)
Ruan, S., Wobbrock, J.O., Liou, K., Ng, A., Landay, J.A.: Comparing speech and keyboard text entry for short messages in two languages on touchscreen phones. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol 1, pp. 1–23. (2018)
Van, E., Sarbar, E., Lucassen, T., O’Brien, J., Breiner, T., Prasad, M., Crew, E., Nguyen, C., Beaufays, F.: Writing across the world’s languages: Deep internationalization for Gboard, the Google keyboard. arXiv preprint arXiv:1912.01218., pp. 1–27 (2019)
Yimam, B.: Ethiopian writing system. Dialogue 1(1), 17–41 (1992)
Munye, M., Atnafu, S.: Amharic-English bilingual web search engine. In: Proceedings of the International Conference on Management of Emergent Digital EcoSystems, pp. 32–39 (2012)
Tedla, T.: amLite: Amharic transliteration using key map dictionary. arXiv e-prints, 1509 (2015)
Wright, S.: The transliteration of Amharic. Int. J. Ethiop. Stud. 2(1), 1–10 (1964)
Yaqob, D.: Transliteration on the internet: the case of Ethiopic. In: Proceedings of the International Symposium on Multilingual Information Processing, Tsukuba, Japan. (1997)
Chinnakotla, M.K., Damani, O.P., Satoskar, A.: Transliteration for resource-scarce languages. ACM Trans. Asian Lang. Inform. Process. 9(4), 30 (2010)
Sharma, A., Kabra, A., Jain, M.: Ceasing hate with moh: Hate speech detection in Hindi–English code-switched language. Inf. Process. Manag. 59(1), 102760 (2022)
Firdyiwek, Y., Yaqob, D.: The Ethiopian script in ASCII. J. Ethio-Sci. 3(1), 8 (1997)
Bhalla, D., Joshi, N., Mathur, I.: Rule based transliteration scheme for English to Punjabi. Int. J. Nat. Lang. Comput. 2(2), 67–73 (2013)
Sajjad, H., Durrani, N., Schmid, H., Fraser, A.: Comparing two techniques for learning transliteration models using a parallel corpus. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 129–137 (2011)
Kaur, K., Singh, P.: Review of machine transliteration techniques. Int. J. Comput. Appl. 107(20) (2014)
AbdulJaleel, N., Larkey, L.S.: Statistical transliteration for English–Arabic cross language information retrieval. In: Proceedings of the 12th International Conference on Information and Knowledge Management, pp. 139–146. (2003)
Masmoudi, A., Khmekhem, M.E., Khrouf, M., Belguith, L.H.: Transliteration of Arabizi into Arabic script for Tunisian dialect. Asian Low-Resour. Lang. Inf. Process. 19(2), 1–21 (2019)
Nair, J., Sadasivan, A.: A Roman to Devanagari back-transliteration algorithm based on Harvard-Kyoto convention. In: Proceedings of 5th International Conference for Convergence in Technology (I2CT), pp. 1–6, IEEE (2019)
Guellil, I., Adeel, A., Azouaou, F., Benali, F., Hachani, A., Hussain, A.: Arabizi sentiment analysis based on transliteration and automatic corpus annotation. In: Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 335–341. (2018)
Deep, K., Goyal, V.: Development of a Punjabi to English transliteration system. Int. J. Comput. Sci. Commun. Netw. 2(2), 521–526 (2011)
Garg, K.D., Singh, U., Gupta, S.: Hidden markov model based Punjabi to English machine transliteration system. Int. J. Control Autom. 12(4), 199–206 (2019)
Malik, M.G.A., Boitet, C., Bhattacharyya, P.: Hindi Urdu machine transliteration using finite-state transducers. In: 22nd International Conference on Computational Linguistics (COLING), pp. 537–544. ICCL (2008)
Malik, M.G.A., Besacier, L., Boitet, C., Bhattacharyya, P.: A hybrid model for Urdu Hindi transliteration. In: Joint Conference of the 47th Annual Meeting of the Association of Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of NLP ACL/IJCNLP Workshop on Named Entities (NEWS-09), pp. 177–185 (2009)
Ahmadi, S.: A rule-based Kurdish text transliteration system. Asian Low-Resour. Lang. Inf. Process. 18(2), 1–8 (2019)
Singh, S.K., Sachan, M.K.: Grt: Gurmukhi to Roman transliteration system using character mapping and handcrafted rules. Int. J. Eng. Innov. Technol. 8(9), 2758–2763 (2019)
Deep, K., Goyal, V.: Development of a Punjabi to English transliteration system. Int. J. Comput. Sci. Commun. 2(2), 521–526 (2011)
Deep, K., Goyal, V.: English to Tamil transliteration using weka system. Int. J. Recent Trends Eng. 1(1), 498–500 (2009)
Deep, K., Goyal, V.: Transliteration for resource scarce language. ACM Trans. Asian Lang. Inform. Process. 9(4), 1–30 (2010)
Kore, M., Goyal, V.: Machine transliteration for English to Amharic proper nouns. Int. J. Comput. Sci. Trends Technol. 5(4) (2017)
Bende, M.L.: The origin of Amharic. Ethiop. J. Lang. Lit. 1(1), 41–52 (1983)
Asker, L., Argaw, A.A., Gambäck, B., Asfeha, S.E., Habte, L.N.: Classifying Amharic web news. Inf. Retrieval 12(3), 416–435 (2009)
Argaw, A.A., Asker, L.: An Amharic stemmer: reducing words to their citation forms. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, pp. 104–110. (2007)
Gambäck, B., Asker, L.: Experiences with developing language processing tools and corpora for Amharic. In: 2010 IST-Africa, pp. 1–8. IEEE (2010)
Afework, Y.: Automatic Amharic text categorization. M.Sc. Thesis, Addis Ababa University, Addis Ababa (2007)
Bender, M.L., Bowen, J.D., Cooper, R.L., Ferguson, C.A.: Languages in Ethiopia. Oxford University Press, London (1976)
Mossie, Z., Wang, J.: Social network hate speech detection for Amharic language. Comput. Sci. Inform. Technol. 41–55 (2018)
Mossie, Z., Wang, J.: Vulnerable community identification using hate speech detection on social media. Inf. Process. Manag 57(3), 102087 (2020)
Gagliardone, I., Patel, A., Pohjonen, M.: Mapping and analysing hate speech online: Opportunities and challenges for Ethiopia. SSRN J. (2014). https://doi.org/10.2139/ssrn.2601792
Gagliardone, P.M.I.: Mechachal: online debates and elections in Ethiopia from hate speech to engagement in social media. SSRN J. (2016). https://doi.org/10.2139/ssrn.2831369
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Abebaw, Z., Rauber, A. & Atnafu, S. Transliterating Latin to Amharic scripts using user-defined rules and character mappings. Int J Digit Libr 24, 63–75 (2023). https://doi.org/10.1007/s00799-023-00346-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-023-00346-5