skip to main content
research-article

Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration

Published: 08 February 2024 Publication History

Abstract

This article aims to understand different transliteration behaviors of Romanized Assamese text on social media. Assamese, a language that belongs to the Indo-Aryan language family, is also among the 22 scheduled languages in India. With the increasing popularity of social media in India and also the common use of the English Qwerty keyboard, Indian users on social media express themselves in their native languages, but using the Roman/Latin script. Unlike some other popular South Asian languages (say Pinyin for Chinese), Indian languages do not have a common standard romanization convention for writing on social media platforms. Assamese and English are two very different orthographical languages. Thus, considering both orthographic and phonemic characteristics of the language, this study tries to explain how Assamese vowels, vowel diacritics, and consonants are represented in Roman transliterated form. From a dataset of romanized Assamese social media texts collected from three popular social media sites: (Facebook, YouTube, and X (formerly known as Twitter)), we have manually labeled them with their native Assamese script. A comparison analysis is also carried out between the transliterated Assamese social media texts with six different Assamese romanization schemes that reflect how Assamese users on social media do not adhere to any fixed romanization scheme. We have built three separate character-level transliteration models from our dataset. One using a traditional phrase-based statistical machine transliteration model, (1) PBSMT model and two separate neural transliteration models, (2) BiLSTM neural seq2seq model with attention, and (3) Neural transformer model. A thorough error analysis has been performed on the transliteration result obtained from the three state-of-the-art models mentioned above. This may help to build a more robust machine transliteration system for the Assamese social media domain in the future. Finally, an attention analysis experiment is also carried out with the help of attention weight scores taken from the character-level BiLSTM neural seq2seq transliteration model built from our dataset.

References

[1]
Mohamed Al-Badrashiny, Ramy Eskander, Nizar Habash, and Owen Rambow. 2014. Automatic transliteration of romanized dialectal Arabic. In Proceedings of the 18th Conference on Computational Natural Language Learning. 30–38.
[2]
Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015.
[3]
Irshad Bhat, Riyaz Ahmad Bhat, Manish Shrivastava, and Dipti Misra Sharma. 2018. Universal dependency parsing for Hindi-English code-switching. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. (Long Papers), 987–998.
[4]
Irshad Ahmad Bhat, Vandan Mujadia, Aniruddha Tammewar, Riyaz Ahmad Bhat, and Manish Shrivastava. 2014. IIIT-H system submission for FIRE2014 shared task on transliterated search. In Proceedings of the Forum for Information Retrieval Evaluation. 48–53.
[5]
Aldo Luiz Bizzocchi. 2017. How many phonemes does the English language have? International Journal on Studies in English Language and Literature 5, 10 (2017), 36–46.
[6]
Zolzaya Byambadorj, Ryota Nishimura, Altangerel Ayush, and Norihide Kitaoka. 2021. Normalization of transliterated Mongolian words using seq2seq model with limited data. ACM Transactions on Asian and Low-Resource Language Information Processing 20, 6, Article 103 (Aug2021), 19 pages. DOI:
[7]
Kunal Chakma and Amitava Das. 2014. Revisiting automatic transliteration problem for code-mixed romanized Indian social media text. In Proceedings of the 1st Workshop on Language Technologies for Indian Social Media (SOCIAL-INDIA Goa, India, December 21). 11th International Conference on Natural Language Processing (ICON-2014), 42–48. http://amitavadas.com/Social_India/Program.html
[8]
C. Chandramouli and Registrar General. 2011. Census of India. Rural Urban Distribution of Population, Provisional Population Total. Office of the Registrar General and Census Commissioner, India, New Delhi.
[9]
Monojit Choudhury, Gokul Chittaranjan, Parth Gupta, and Amitava Das. 2014. Overview of FIRE 2014 track on transliterated search. In Proceedings of the FIRE. 68–89.
[10]
Amitava Das and Björn Gambäck. 2013. Code-mixing in social media text. The last language identification frontier? Traitement Automatique des Langues 54, 3 (2013), 41–64.
[11]
Tirthankar Dasgupta, Manjira Sinha, and Anupam Basu. 2013. A joint source channel model for the English to Bengali back transliteration. In Proceedings of the 1st International Conference on Mining Intelligence and Knowledge Exploration - Volume 8284 (MIKE 2013). Springer-Verlag, Berlin, 751–760. DOI:
[12]
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 61 (2011), 2121–2159.
[13]
Asif Ekbal, Sudip Kumar Naskar, and Sivaji Bandyopadhyay. 2006. A modified joint source-channel model for transliteration. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. 191–198.
[14]
Debasis Ganguly, Santanu Pal, and Gareth J. F. Jones. 2014. DCU@ FIRE-2014: Fuzzy queries with rule-based normalization for mixed script information retrieval. In Proceedings of the Forum for Information Retrieval Evaluation. 80–85.
[15]
Sakshi Gupta. 2016. Towards Understanding Code-Mixed Social Media Texts. Ph.D. Dissertation. MA thesis, International Institute of Information Technology, India.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[17]
Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation. 187–197.
[18]
Tauseef Hussain and K. Samudravijaya. 2011. Comparison and usefulness of ASR11 scheme over previous schemes for transliteration and label set purposes for Indian languages. In Proceedings of the 39th All India DLA Conference.
[19]
Ann Irvine, Jonathan Weese, and Chris Callison-Burch. 2012. Processing informal, romanized Pakistani text messages. In Proceedings of the 2nd Workshop on Language in Social Media. 75–78.
[20]
Gurpreet Singh Josan and Jagroop Kaur. 2011. Punjabi to Hindi statistical machine transliteration. International Journal of Information Technology and Knowledge Management 4, 2 (2011), 459–463.
[21]
Gurpreet Singh Josan and Gurpreet Singh Lehal. 2008. A Punjabi to Hindi machine translation system. In Proceedings of the 22nd International Conference on on Computational Linguistics: Demonstration Papers. 157–160.
[22]
Akshat Joshi, Kinal Mehta, Neha Gupta, and Varun Kannadi Valloli. 2018. Indian language transliteration using deep learning. In Proceedings of the 2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS). IEEE, 103–107.
[23]
Hardik Joshi, Apurva Bhatt, and Honey Patel. 2013. Transliterated search using syllabification approach. In Proceedings of the Forum for Information Retrieval Evaluation.
[24]
Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, and Alexandra Birch. 2018. Marian: Fast neural machine translation in C++. In Proceedings of ACL 2018, System Demonstrations, Fei Liu and Thamar Solorio (Eds.). Association for Computational Linguistics, 116–121. https://aclanthology.org/P18-4020
[25]
Sarvnaz Karimi. 2008. Machine transliteration of proper names between English and Persian. (2008).
[26]
Ahmed Khan and Aaliya Sarfaraz. 2019. RNN-LSTM-GRU based language transformation. Soft Computing 23, 24 (2019), 13007–13024.
[27]
Abdul Rafae Khan, Asim Karim, Hassan Sajjad, Faisal Kamiran, and Jia Xu. 2022. A clustering framework for lexical normalization of Roman Urdu. Natural Language Engineering 28, 1 (2022), 93–123.
[28]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. ICLR Poster, Yoshua Bengio and Yann LeCun (Eds.). http://dblp.uni-trier.de/db/conf/iclr/iclr2015.html#KingmaB14
[29]
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, Mohit Bansal and Heng Ji (Eds.). Association for Computational Linguistics. 67–72. https://aclanthology.org/P17-4012
[30]
Kevin Knight and Jonathan Graehl. 1997. Machine transliteration. In 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics. 128–135. https://aclanthology.org/P97-1017
[31]
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Association for Computational Linguistics, Prague, Czech Republic, 177–180. Retrieved from https://aclanthology.org/P07-2045
[32]
Anoop Kunchukuttan and Pushpak Bhattacharyya. 2015. Data representation methods and use of mined corpora for Indian language transliteration. In Proceedings of the 5th Named Entity Workshop. 78–82.
[33]
Anoop Kunchukuttan, Siddharth Jain, and Rahul Kejriwal. 2021. A large-scale evaluation of neural machine transliteration for Indic languages. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 3469–3475.
[34]
Anoop Kunchukuttan, Mitesh Khapra, Gurneet Singh, and Pushpak Bhattacharyya. 2018. Leveraging orthographic similarity for multilingual neural transliteration. Transactions of the Association for Computational Linguistics 6 (2018), 303–316. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00022/43438/Leveraging-Orthographic-Similarity-for
[35]
Anoop Kunchukuttan, Ratish Puduppully, and Pushpak Bhattacharyya. 2015. Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 81–85.
[36]
Soumyadeep Kundu, Sayantan Paul, and Santanu Pal. 2018. A deep learning based approach to transliteration. In Proceedings of the 7th Named Entities Workshop. 79–83.
[37]
Lenin Laitonjam, Loitongbam Gyanendro Singh, and Sanasam Ranbir Singh. 2018. Transliteration of English loanwords and named-entities to Manipuri: Phoneme vs grapheme representation. In Proceedings of the 2018 International Conference on Asian Language Processing (IALP). IEEE, 255–260.
[38]
Prahallad Lavanya, Prahallad Kishore, and Ganapa Thiraju Madhavi. 2005. A simple approach for building transliteration editors for Indian languages. Journal of Zhejiang University-SCIENCE A 6, 11 (2005), 1354–1361.
[39]
Ngoc Tan Le and Fatiha Sadat. 2018. Low-resource machine transliteration using recurrent neural networks of Asian languages. In Proceedings of the 7th Named Entities Workshop. 95–100.
[40]
Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics. Doklady 10, 8 (1966), 707–710.
[41]
Guizhou Li, Min Zhang, and Jian Su. 2004. A joint source-channel model for machine transliteration. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). 159–166.
[42]
Krister Lindén. 2006. Multilingual modeling of cross-lingual spelling variants. Information Retrieval 9, 3 (2006), 295–310.
[43]
Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lluís Màrquez, Chris Callison-Burch, and Jian Su (Eds.). Association for Computational Linguistics. 1412–1421. https://aclanthology.org/D15-1166
[44]
Yash Madhani, Sushane Parthan, Priyanka Bedekar, Gokul Nc, Ruchi Khapra, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Khapra. 2022. Aksharantar: Open indic-language transliteration datasets and models for the next billion users. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, 40–57. https://aclanthology.org/2023.findings-emnlp.4
[45]
Shakuntala Mahanta. 2012. Assamese. Journal of the International Phonetic Association 42, 2 (2012), 217–224. DOI:
[46]
Soumil Mandal and Karthick Nanmaran. 2018. Normalization of transliterated words in code-mixed data using Seq2Seq model & Levenshtein distance. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, Wei Xu, and Alan Ritter, Tim Baldwin, and Afshin Rahimi (Eds.). Association for Computational Linguistics, 49–53. https://aclanthology.org/W18-6107
[47]
Jonathan May, Yassine Benjira, and Abdessamad Echihabi. 2014. An Arabizi-English social media statistical machine translation system. In Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track. Association for Machine Translation in the Americas, Vancouver, Canada, 329–341. Retrieved from https://aclanthology.org/2014.amta-researchers.25
[48]
George McAfee McCune and Edwin Oldfather Reischauer. 1939. The Romanization of the Korean Language Based Upon Its Phonetic Structure. Korea Branch of the Royal Asiatic Society.
[49]
Yuval Merhav and Stephen Ash. 2018. Design challenges in named entity transliteration. In Proceedings of the 27th International Conference on Computational Linguistics. 630–640.
[50]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the International Conference on Neural Information Processing Systems. 3111–3119.
[51]
Abhinav Mukherjee, Anirudh Ravi, and Kaustav Datta. 2014. Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing. In Proceedings of the Forum for Information Retrieval Evaluation. 86–90.
[52]
Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 160–167.
[53]
Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1 (2003), 19–51.
[54]
Partha Pakray and Pinaki Bhaskar. 2013. Transliterated search system for Indian languages. In Pre-proceedings of the 5th FIRE-2013 Workshop, Forum for Information Retrieval Evaluation (FIRE).
[55]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.
[56]
Dwija Parikh and Thamar Solorio. 2021. Normalization and back-transliteration for code-switched data. In Proceedings of the 5th Workshop on Computational Approaches to Linguistic Code-Switching. 119–124.
[57]
Antony P.J., Ajith V.P., and Soman K.P.2010. Kernel method for English to Kannada transliteration. In Proceedings of the 2010 International Conference on Recent Trends in Information, Telecommunication and Computing (ITC’10). IEEE Computer Society, USA, 336–338. DOI:
[58]
Dinesh Kumar Prabhakar and Sukomal Pal. 2018. Machine transliteration and transliterated text retrieval: A survey. Sādhanā 43, 6 (2018), 93.
[59]
Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke, Cibu Johny, Işin Demirşahin, and Keith Hall. 2020. Processing South Asian languages written in the latin script: The Dakshina dataset. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC). 2413–2423. Retrieved from https://www.aclweb.org/anthology/2020.lrec-1.294
[60]
Rishiraj Saha Roy, Monojit Choudhury, Prasenjit Majumder, and Komal Agarwal. 2013. Overview of the FIRE 2013 track on transliterated search. In Post-Proceedings of the 4th and 5th Workshops of the Forum for Information Retrieval Evaluation. 1–7.
[61]
Krutika Sapkal and Urmila Shrawankar. 2016. Transliteration of secured SMS to Indian regional language. Procedia Computer Science 78, C (2016), 748–755.
[62]
Rajat Singh, Nurendra Choudhary, and Manish Shrivastava. 2018. Automatic normalization of word variations in code-mixed social media text. In International Conference on Computational Linguistics and Intelligent Text Processing, Springer, 371–381.
[63]
Thoudam Doren Singh. 2012. Bidirectional Bengali script and Meetei Mayek transliteration of web based Manipuri news corpus. In Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing. 181–190.
[64]
Thoudam Doren Singh and Thamar Solorio. 2018. Towards translating mixed-code comments from social media. In Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2017, Revised Selected Papers, Part II 18. Springer, 457–468.
[65]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
[66]
Bonnie Glover Stalls and Kevin Knight. 1998. Translating names and technical terms in Arabic text. In Proceedings of the Workshop on Computational Approaches to Semitic Languages (Semitic’98). Association for Computational Linguistics, USA, 34–41.
[67]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the International Conference on Neural Information Processing Systems. 3104–3112.
[68]
S. Thara and Prabaharan Poornachandran. 2018. Code-mixing: A brief survey. In Proceedings of the 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, 2382–2388.
[69]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the International Conference on Neural Information Processing Systems. 5998–6008.
[70]
Paola Virga and Sanjeev Khudanpur. 2003a. Transliteration of proper names in cross-language applications. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. 365–366.
[71]
Paola Virga and Sanjeev Khudanpur. 2003b. Transliteration of proper names in cross-lingual information retrieval. In Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-Language Named Entity Recognition. 57–64.
[72]
Stephen Wan and Cornelia Maria Verspoor. 1998a. Automatic English–Chinese name transliteration for development of multilingual resources. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 2 (ACL’98/COLING’98). Association for Computational Linguistics, USA, 1352–1356. DOI:
[73]
Stephen Wan and Karin Verspoor. 1998b. Automatic English–Chinese name transliteration for development of multilingual resources. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2. 1352–1356.
[74]
Lawrence Wolf-Sonkin, Vlad Schogol, Brian Roark, and Michael Riley. 2019. Latin script keyboards for South Asian languages with finite-state normalization. In Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing. 108–117.
[75]
Jihene Younes, Hadhemi Achour, Emna Souissi, and Ahmed Ferchichi. 2022. Romanized Tunisian dialect transliteration using sequence labelling techniques. Journal of King Saud University-Computer and Information Sciences 34, 3 (2022), 982–992.
[76]
Jihene Younes, Emna Souissi, Hadhemi Achour, and Ahmed Ferchichi. 2018. A sequence-to-sequence based approach for the double transliteration of Tunisian dialect. Procedia Computer Science 142, C (2018), 238–245.

Cited By

View all
  • (2024)Exploring the Transformative Role of ChatGPT in MarketingRevolutionizing the Service Industry Wth OpenAI Models10.4018/979-8-3693-1239-1.ch006(146-169)Online publication date: 31-May-2024

Index Terms

  1. Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 2
      February 2024
      340 pages
      EISSN:2375-4702
      DOI:10.1145/3613556
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 February 2024
      Online AM: 06 January 2024
      Accepted: 14 December 2023
      Revised: 24 May 2023
      Received: 13 September 2022
      Published in TALLIP Volume 23, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Transliteration
      2. grapheme
      3. phoneme
      4. PBSMT
      5. BiLSTM
      6. attention
      7. transformer

      Qualifiers

      • Research-article

      Funding Sources

      • Ministry of Electronics & Information Technology, Government of India

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)311
      • Downloads (Last 6 weeks)28
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Exploring the Transformative Role of ChatGPT in MarketingRevolutionizing the Service Industry Wth OpenAI Models10.4018/979-8-3693-1239-1.ch006(146-169)Online publication date: 31-May-2024

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media