skip to main content
research-article

Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings

Authors Info & Claims
Published:13 October 2023Publication History
Skip Abstract Section

Abstract

Learning the inherent meaning of a word in Natural Language Processing (NLP) has motivated researchers to represent a word at various levels of abstraction, namely character-level, morpheme-level, and subword-level vector representations. Syllable-Aware Word Embedding (SAWE) can effectively handle agglutinative and fusion-based NLP tasks. However, research attempts on assessing the SAWE on such extrinsic NLP tasks has been scanty, especially for low-resource languages in the context of code-mixing with English. A model to learn SAWE to extract semantics at fine-grained subunits of a word is proposed in this article, and the representative ability of the embeddings is assessed through sentiment analysis of code-mixed Telugu-English review corpora. Multilingual societies and advancements in communication technologies have accounted for the prolific usage of mixed data, which renders the State-of-the-Art (SOTA) sentiment analysis models developed based on monolingual data ineffective. Social media users in the Indian subcontinent exhibit a tendency to mix English and their respective native language (using the phonetic form of English) in expressing their opinions or sentiments. A code-mixing scenario provides flexibility to borrow words from a foreign language, usage of shorthand notations, elongation of vowels, and usage of words without following syntactic/grammatical rules, which renders the sentiment analysis of code-mixed data challenging to perform. Deep neural architectures like Long Short-Term Memory and Gated Recurrent Unit networks have been shown to be effective in solving several NLP tasks, such as sequence labeling, named entity recognition, and machine translation. In this article, a framework to perform sentiment analysis on a code-mixed Telugu-English review corpus is implemented. Both word embedding and SAWE are input to a unified deep neural network that contains a two-level Bidirectional Long Short-Term Memory/Gated Recurrent Unit network with Softmax as the output layer. The proposed model leverages the advantages of both word embedding and SAWE, which enable the proposed model to outperform existing SOTA code-mixed sentiment analysis models on a Telugu-English code-mixed dataset of the International Institute of Information Technology–Hyderabad and a dataset curated by the authors. The improvement realized by the proposed model on these datasets is [3% increase in F1-score and 2% increase in accuracy] and [7% increase in F1-score and 5% in accuracy], respectively, in comparison with the best-performing SOTA model.

REFERENCES

  1. Assylbekov Zhenisbek, Takhanov Rustem, Myrzakhmetov Bagdat, and Washington Jonathan N.. 2017. Syllable-aware neural language models: A failure to beat character-aware ones. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 18661872. Google ScholarGoogle ScholarCross RefCross Ref
  2. Bengio Yoshua, Ducharme Réjean, and Vincent Pascal. 2000. A neural probabilistic language model. In Advances in Neural Information Processing Systems 13 (NIPS 2000), Leen Todd K., Dietterich Thomas G., and Tresp Volker (Eds.). MIT Press, Cambridge, MA, 932938. http://dblp.uni-trier.de/db/conf/nips/nips2000.html#BengioDV00Google ScholarGoogle Scholar
  3. Bhattu S. Nagesh, Nunna Satya Krishna, Somayajulu Durvasula V. L. N., and Pradhan Binay. 2020. Improving code-mixed POS tagging using code-mixed embeddings. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 4 (2020), 131.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bojanowski Piotr, Grave Edouard, Joulin Armand, and Mikolov Tomas. 2016. Enriching Word Vectors with Subword Information. arXiv:1607.04606 (2016). http://arxiv.org/abs/1607.04606Google ScholarGoogle Scholar
  5. Botha Jan A. and Blunsom Phil. 2014. Compositional morphology for word representations and language modelling. CoRR abs/1405.4273 (2014). http://dblp.uni-trier.de/db/journals/corr/corr1405.html#BothaB14Google ScholarGoogle Scholar
  6. Chaitanya Inumella, Madapakula Indeevar, Gupta Subham Kumar, and Thara S.. 2018. Word level language identification in code-mixed data using word embedding methods for Indian languages. In Proceedings of the 2018 International Conference on Advances in Computing, Communications, and Informatics (ICACCI’18). IEEE, Los Alamitos, CA, 11371141. http://dblp.uni-trier.de/db/conf/icacci/icacci2018.html#ChaitanyaMGT18Google ScholarGoogle ScholarCross RefCross Ref
  7. Chittaranjan Gokul, Vyas Yogarshi, Bali Kalika, and Choudhury Monojit. 2014. Word-level language identification using CRF: Code-switching shared task report of MSR India system. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching. 7379. Google ScholarGoogle ScholarCross RefCross Ref
  8. Choi Sanghyuk, Kim Taeuk, Seol Jinseok, and Lee Sang Goo. 2017. A syllable-based technique for word embeddings of Korean words. CoRR abs/1708.01766 (2017). http://dblp.uni-trier.de/db/journals/corr/corr1708.html#abs-1708-01766Google ScholarGoogle Scholar
  9. Choudhary Nurendra, Singh Rajat, Bindlish Ishita, and Shrivastava Manish. 2018. Sentiment analysis of code-mixed languages leveraging resource rich languages. arXiv preprint arXiv:1804.00806 (2018).Google ScholarGoogle Scholar
  10. Cohen Jacob. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (1960), 3746.Google ScholarGoogle ScholarCross RefCross Ref
  11. Conneau Alexis, Khandelwal Kartikay, Goyal Naman, Chaudhary Vishrav, Wenzek Guillaume, Guzmán Francisco, Grave Edouard, Ott Myle, Zettlemoyer Luke, and Stoyanov Veselin. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).Google ScholarGoogle Scholar
  12. Cutler Anne, Weber Andrea, Smits Roel, and Cooper Nicole. 2004. Patterns of English phoneme confusions by native and non-native listeners. Journal of the Acoustical Society of America 116, 6 (2004), 36683678.Google ScholarGoogle ScholarCross RefCross Ref
  13. Das Amitava and Gambäck Björn. 2014. Identifying languages at the word level in code-mixed Indian social media text. In Proceedings of the 11th International Conference on Natural Language Processing. 378387. https://aclanthology.org/W14-5152Google ScholarGoogle Scholar
  14. Dutta Satyam, Agrawal Himanshi, and Roy Pradeep Kumar. 2021. Sentiment analysis on multilingual code-mixed Kannada language. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’21). 1–11.Google ScholarGoogle Scholar
  15. Garbe Wolf. 2019. SymSpell: 1 Million Times Faster through Symmetric Delete Spelling Correction Algorithm. Retrieved September 13, 2023 from https://github.com/softwx/symspellGoogle ScholarGoogle Scholar
  16. Géron Aurélien. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media.Google ScholarGoogle Scholar
  17. Gundapu Sunil and Mamidi Radhika. 2018. Word level language identification in English Telugu code mixed data. In Proceedings of the 32nd Pacific Asia Conference in Language, Information, and Computation (PACLIC’18). http://dblp.uni-trier.de/db/conf/paclic/paclic2018.html#GundapuM18Google ScholarGoogle Scholar
  18. Haldar Rishin and Mukhopadhyay Debajyoti. 2011. Levenshtein distance technique in dictionary lookup methods: An improved approach. arXiv preprint arXiv:1101.1232 (2011).Google ScholarGoogle Scholar
  19. Jamatia Anupam, Gambäck Björn, and Das Amitava. 2015. Part-of-speech tagging for code-mixed English-Hindi Twitter and Facebook chat messages. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. 239248.Google ScholarGoogle Scholar
  20. Joshi Aditya, Prabhu Ameya, Shrivastava Manish, and Varma Vasudeva. 2016. Towards sub-word level compositions for sentiment analysis of Hindi-English code mixed text. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16): Technical Papers. 24822491.Google ScholarGoogle Scholar
  21. Kannan Rajkumar, Swaminathan Sridhar, Anutariya Chutiporn, and Saravanan Vaishnavi. 2021. Exploiting multilingual neural linguistic representation for sentiment classification of political tweets in code-mix language. In Proceedings of the 12th International Conference on Advances in Information Technology. 15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Khanuja Simran, Dandapat Sandipan, Srinivasan Anirudh, Sitaram Sunayana, and Choudhury Monojit. 2020. GLUECoS: An evaluation benchmark for code-switched NLP. arXiv preprint arXiv:2004.12376 (2020).Google ScholarGoogle Scholar
  23. Kim Yoon, Jernite Yacine, Sontag David, and Rush Alexander M.. 2015. Character-aware neural language models. arXiv:1508.06615 (2015). http://arxiv.org/abs/1508.06615Google ScholarGoogle Scholar
  24. Kusampudi Siva Subrahamanyam Varma, Sathineni Preetham, and Mamidi Radhika. 2021. Sentiment analysis in code-mixed Telugu-English text with unsupervised data normalization. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’21). 753760.Google ScholarGoogle Scholar
  25. Lal Yash Kumar, Kumar Vaibhav, Dhar Mrinal, Shrivastava Manish, and Koehn Philipp. 2019. De-mixing sentiment from code-mixed text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 371377.Google ScholarGoogle ScholarCross RefCross Ref
  26. Levy Omer and Goldberg Yoav. 2014. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 302308. Google ScholarGoogle ScholarCross RefCross Ref
  27. Liang Franklin Mark. 1983. Word Hy-phen-a-tion by Com-put-er. Department of Computer Science, Stanford University Palo Alto, CA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mandalam Asrita Venkata and Sharma Yashvardhan. 2021. Sentiment analysis of Dravidian code mixed data. In Proceedings of the 1st Workshop on Speech and Language Technologies for Dravidian Languages. 4654.Google ScholarGoogle Scholar
  29. Mikolov Tomas, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013a. Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). http://dblp.uni-trier.de/db/journals/corr/corr1301.html#abs-1301-3781Google ScholarGoogle Scholar
  30. Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, Burges C. J. C., Bottou L., Welling M., Ghahramani Z., and Weinberger K. Q. (Eds.). MIT Press, Cambridge, MA, 31113119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionalityGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mishra Pruthwik, Danda Prathyusha, and Dhakras Pranav. 2018. Code-mixed sentiment analysis using machine learning and neural network approaches. arXiv preprint arXiv:1808.03299 (2018).Google ScholarGoogle Scholar
  32. Mohammad Saif. 2016. A practical guide to sentiment annotation: Challenges and solutions. In Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media Analysis. 174179.Google ScholarGoogle ScholarCross RefCross Ref
  33. Mukku Sandeep Sricharan and Mamidi Radhika. 2017. ACTSA: Annotated corpus for Telugu sentiment analysis. In Proceedings of the 1st Workshop on Building Linguistically Generalizable NLP Systems. 5458.Google ScholarGoogle ScholarCross RefCross Ref
  34. Patra Braja Gopal, Das Dipankar, and Das Amitava. 2018. Sentiment analysis of code-mixed Indian languages: An overview of SAIL_Code-Mixed Shared Task @ICON-2017. arXiv preprint arXiv:1803.06745 (2018).Google ScholarGoogle Scholar
  35. Patwa Parth, Aguilar Gustavo, Kar Sudipta, Pandey Suraj, Pykl Srinivas, Gambäck Björn, Chakraborty Tanmoy, Solorio Thamar, and Das Amitava. 2020. SemEval-2020 Task 9: Overview of sentiment analysis of code-mixed tweets. In Proceedings of the 14th Workshop on Semantic Evaluation. 774790.Google ScholarGoogle ScholarCross RefCross Ref
  36. Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.Google ScholarGoogle ScholarCross RefCross Ref
  37. Peters Matthew E., Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. arXiv:1802.05365 (2018). http://arxiv.org/abs/1802.05365Google ScholarGoogle Scholar
  38. Prabhu Ameya, Joshi Aditya, Shrivastava Manish, and Varma Vasudeva. 2016. Towards sub-word level compositions for sentiment analysis of Hindi-English code mixed text. CoRR abs/1611.00472 (2016). http://dblp.uni-trier.de/db/journals/corr/corr1611.html#PrabhuJSV16Google ScholarGoogle Scholar
  39. Pradhan Rahul and Sharma Dilip Kumar. 2022. An ensemble deep learning classifier for sentiment analysis on code-mix Hindi–English data. Soft Computing 2022 (2022), 118.Google ScholarGoogle Scholar
  40. Reddy Aniketh Janardhan, Adusumilli Monica, Gorla Sai Kiranmai, Neti Lalita Bhanu Murthy, and Malapati Aruna. 2018. Named entity recognition for Telugu using LSTM-CRF. In Proceedings of the 4th Workshop on Indian Language Data: Resources and Evaluation (WILDRE4’18). 6.Google ScholarGoogle Scholar
  41. Sharma Arnav, Gupta Sakshi, Motlani Raveesh, Bansal Piyush, Srivastava Manish, Mamidi Radhika, and Sharma Dipti M.. 2016. Shallow parsing pipeline for Hindi-English code-mixed social media text. arXiv preprint arXiv:1604.03136 (2016).Google ScholarGoogle Scholar
  42. Shivachi Casper Shikali, Mokhosi Refuoe, Shijie Zhou, and Qihe Liu. 2021. Learning syllables using conv-LSTM model for Swahili word representation and part-of-speech tagging. Transactions on Asian and Low-Resource Language Information Processing 20, 4 (2021), 125.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Srinivasan R. and Subalalitha C. N.. 2023. Sentimental analysis from imbalanced code-mixed data using machine learning approaches. Distributed and Parallel Databases 41 (2023), 3752.Google ScholarGoogle Scholar
  44. Sristy Nagesh Bhattu, Krishna N. Satya, Krishna Boora Shiva, and Ravi Vadlamani. 2017. Language identification in mixed script. In Proceedings of the 9th Annual Meeting of the Forum for Information and Retrieval Evaluation (FIRE’17). ACM, New York, NY, 14–20. http://dblp.uni-trier.de/db/conf/fire/fire2017.html#SristyKKR17Google ScholarGoogle Scholar
  45. Zilly Julian G., Srivastava Rupesh Kumar, Koutník Jan, and Schmidhuber Jürgen. 2016. Recurrent highway networks. CoRR abs/1607.03474 (2016). http://dblp.uni-trier.de/db/journals/corr/corr1607.html#ZillySKS16Google ScholarGoogle Scholar

Index Terms

  1. Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 10
          October 2023
          226 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3627976
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 October 2023
          • Online AM: 4 September 2023
          • Accepted: 29 August 2023
          • Revised: 24 June 2023
          • Received: 19 December 2022
          Published in tallip Volume 22, Issue 10

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)340
          • Downloads (Last 6 weeks)40

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text