research-article

Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings

Authors:
Upendar Rao Rayala

Department of Computer Science and Engineering, National Institute Technology, Andhra Pradesh, India, and Rajiv Gandhi University of Knowledge Technologies, Andhra Pradesh, India

Department of Computer Science and Engineering, National Institute Technology, Andhra Pradesh, India, and Rajiv Gandhi University of Knowledge Technologies, Andhra Pradesh, India

0000-0002-0962-3816
View Profile

,
Karthick Seshadri

Department of Computer Science and Engineering, National Institute of Technology, Andhra Pradesh, India

Department of Computer Science and Engineering, National Institute of Technology, Andhra Pradesh, India

0000-0002-5658-141X
View Profile

,
Nagesh Bhattu Sristy

Department of Computer Science and Engineering, National Institute of Technology, Andhra Pradesh, India

Department of Computer Science and Engineering, National Institute of Technology, Andhra Pradesh, India

0000-0003-1743-502X
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22 Issue 10Article No.: 233pp 1–30https://doi.org/10.1145/3620670

Published:13 October 2023Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Learning the inherent meaning of a word in Natural Language Processing (NLP) has motivated researchers to represent a word at various levels of abstraction, namely character-level, morpheme-level, and subword-level vector representations. Syllable-Aware Word Embedding (SAWE) can effectively handle agglutinative and fusion-based NLP tasks. However, research attempts on assessing the SAWE on such extrinsic NLP tasks has been scanty, especially for low-resource languages in the context of code-mixing with English. A model to learn SAWE to extract semantics at fine-grained subunits of a word is proposed in this article, and the representative ability of the embeddings is assessed through sentiment analysis of code-mixed Telugu-English review corpora. Multilingual societies and advancements in communication technologies have accounted for the prolific usage of mixed data, which renders the State-of-the-Art (SOTA) sentiment analysis models developed based on monolingual data ineffective. Social media users in the Indian subcontinent exhibit a tendency to mix English and their respective native language (using the phonetic form of English) in expressing their opinions or sentiments. A code-mixing scenario provides flexibility to borrow words from a foreign language, usage of shorthand notations, elongation of vowels, and usage of words without following syntactic/grammatical rules, which renders the sentiment analysis of code-mixed data challenging to perform. Deep neural architectures like Long Short-Term Memory and Gated Recurrent Unit networks have been shown to be effective in solving several NLP tasks, such as sequence labeling, named entity recognition, and machine translation. In this article, a framework to perform sentiment analysis on a code-mixed Telugu-English review corpus is implemented. Both word embedding and SAWE are input to a unified deep neural network that contains a two-level Bidirectional Long Short-Term Memory/Gated Recurrent Unit network with Softmax as the output layer. The proposed model leverages the advantages of both word embedding and SAWE, which enable the proposed model to outperform existing SOTA code-mixed sentiment analysis models on a Telugu-English code-mixed dataset of the International Institute of Information Technology–Hyderabad and a dataset curated by the authors. The improvement realized by the proposed model on these datasets is [3% increase in F1-score and 2% increase in accuracy] and [7% increase in F1-score and 5% in accuracy], respectively, in comparison with the best-performing SOTA model.

REFERENCES

Assylbekov Zhenisbek, Takhanov Rustem, Myrzakhmetov Bagdat, and Washington Jonathan N.. 2017. Syllable-aware neural language models: A failure to beat character-aware ones. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1866–1872. Google ScholarCross Ref
Bengio Yoshua, Ducharme Réjean, and Vincent Pascal. 2000. A neural probabilistic language model. In Advances in Neural Information Processing Systems 13 (NIPS 2000), Leen Todd K., Dietterich Thomas G., and Tresp Volker (Eds.). MIT Press, Cambridge, MA, 932–938. http://dblp.uni-trier.de/db/conf/nips/nips2000.html#BengioDV00Google Scholar
Bhattu S. Nagesh, Nunna Satya Krishna, Somayajulu Durvasula V. L. N., and Pradhan Binay. 2020. Improving code-mixed POS tagging using code-mixed embeddings. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 4 (2020), 1–31.Google ScholarDigital Library
Bojanowski Piotr, Grave Edouard, Joulin Armand, and Mikolov Tomas. 2016. Enriching Word Vectors with Subword Information. arXiv:1607.04606 (2016). http://arxiv.org/abs/1607.04606Google Scholar
Botha Jan A. and Blunsom Phil. 2014. Compositional morphology for word representations and language modelling. CoRR abs/1405.4273 (2014). http://dblp.uni-trier.de/db/journals/corr/corr1405.html#BothaB14Google Scholar
Chaitanya Inumella, Madapakula Indeevar, Gupta Subham Kumar, and Thara S.. 2018. Word level language identification in code-mixed data using word embedding methods for Indian languages. In Proceedings of the 2018 International Conference on Advances in Computing, Communications, and Informatics (ICACCI’18). IEEE, Los Alamitos, CA, 1137–1141. http://dblp.uni-trier.de/db/conf/icacci/icacci2018.html#ChaitanyaMGT18Google ScholarCross Ref
Chittaranjan Gokul, Vyas Yogarshi, Bali Kalika, and Choudhury Monojit. 2014. Word-level language identification using CRF: Code-switching shared task report of MSR India system. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching. 73–79. Google ScholarCross Ref
Choi Sanghyuk, Kim Taeuk, Seol Jinseok, and Lee Sang Goo. 2017. A syllable-based technique for word embeddings of Korean words. CoRR abs/1708.01766 (2017). http://dblp.uni-trier.de/db/journals/corr/corr1708.html#abs-1708-01766Google Scholar
Choudhary Nurendra, Singh Rajat, Bindlish Ishita, and Shrivastava Manish. 2018. Sentiment analysis of code-mixed languages leveraging resource rich languages. arXiv preprint arXiv:1804.00806 (2018).Google Scholar
Cohen Jacob. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (1960), 37–46.Google ScholarCross Ref
Conneau Alexis, Khandelwal Kartikay, Goyal Naman, Chaudhary Vishrav, Wenzek Guillaume, Guzmán Francisco, Grave Edouard, Ott Myle, Zettlemoyer Luke, and Stoyanov Veselin. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).Google Scholar
Cutler Anne, Weber Andrea, Smits Roel, and Cooper Nicole. 2004. Patterns of English phoneme confusions by native and non-native listeners. Journal of the Acoustical Society of America 116, 6 (2004), 3668–3678.Google ScholarCross Ref
Das Amitava and Gambäck Björn. 2014. Identifying languages at the word level in code-mixed Indian social media text. In Proceedings of the 11th International Conference on Natural Language Processing. 378–387. https://aclanthology.org/W14-5152Google Scholar
Dutta Satyam, Agrawal Himanshi, and Roy Pradeep Kumar. 2021. Sentiment analysis on multilingual code-mixed Kannada language. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’21). 1–11.Google Scholar
Garbe Wolf. 2019. SymSpell: 1 Million Times Faster through Symmetric Delete Spelling Correction Algorithm. Retrieved September 13, 2023 from https://github.com/softwx/symspellGoogle Scholar
Géron Aurélien. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media.Google Scholar
Gundapu Sunil and Mamidi Radhika. 2018. Word level language identification in English Telugu code mixed data. In Proceedings of the 32nd Pacific Asia Conference in Language, Information, and Computation (PACLIC’18). http://dblp.uni-trier.de/db/conf/paclic/paclic2018.html#GundapuM18Google Scholar
Haldar Rishin and Mukhopadhyay Debajyoti. 2011. Levenshtein distance technique in dictionary lookup methods: An improved approach. arXiv preprint arXiv:1101.1232 (2011).Google Scholar
Jamatia Anupam, Gambäck Björn, and Das Amitava. 2015. Part-of-speech tagging for code-mixed English-Hindi Twitter and Facebook chat messages. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. 239–248.Google Scholar
Joshi Aditya, Prabhu Ameya, Shrivastava Manish, and Varma Vasudeva. 2016. Towards sub-word level compositions for sentiment analysis of Hindi-English code mixed text. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16): Technical Papers. 2482–2491.Google Scholar
Kannan Rajkumar, Swaminathan Sridhar, Anutariya Chutiporn, and Saravanan Vaishnavi. 2021. Exploiting multilingual neural linguistic representation for sentiment classification of political tweets in code-mix language. In Proceedings of the 12th International Conference on Advances in Information Technology. 1–5.Google ScholarDigital Library
Khanuja Simran, Dandapat Sandipan, Srinivasan Anirudh, Sitaram Sunayana, and Choudhury Monojit. 2020. GLUECoS: An evaluation benchmark for code-switched NLP. arXiv preprint arXiv:2004.12376 (2020).Google Scholar
Kim Yoon, Jernite Yacine, Sontag David, and Rush Alexander M.. 2015. Character-aware neural language models. arXiv:1508.06615 (2015). http://arxiv.org/abs/1508.06615Google Scholar
Kusampudi Siva Subrahamanyam Varma, Sathineni Preetham, and Mamidi Radhika. 2021. Sentiment analysis in code-mixed Telugu-English text with unsupervised data normalization. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’21). 753–760.Google Scholar
Lal Yash Kumar, Kumar Vaibhav, Dhar Mrinal, Shrivastava Manish, and Koehn Philipp. 2019. De-mixing sentiment from code-mixed text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 371–377.Google ScholarCross Ref
Levy Omer and Goldberg Yoav. 2014. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 302–308. Google ScholarCross Ref
Liang Franklin Mark. 1983. Word Hy-phen-a-tion by Com-put-er. Department of Computer Science, Stanford University Palo Alto, CA.Google ScholarDigital Library
Mandalam Asrita Venkata and Sharma Yashvardhan. 2021. Sentiment analysis of Dravidian code mixed data. In Proceedings of the 1st Workshop on Speech and Language Technologies for Dravidian Languages. 46–54.Google Scholar
Mikolov Tomas, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013a. Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). http://dblp.uni-trier.de/db/journals/corr/corr1301.html#abs-1301-3781Google Scholar
Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, Burges C. J. C., Bottou L., Welling M., Ghahramani Z., and Weinberger K. Q. (Eds.). MIT Press, Cambridge, MA, 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionalityGoogle ScholarDigital Library
Mishra Pruthwik, Danda Prathyusha, and Dhakras Pranav. 2018. Code-mixed sentiment analysis using machine learning and neural network approaches. arXiv preprint arXiv:1808.03299 (2018).Google Scholar
Mohammad Saif. 2016. A practical guide to sentiment annotation: Challenges and solutions. In Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media Analysis. 174–179.Google ScholarCross Ref
Mukku Sandeep Sricharan and Mamidi Radhika. 2017. ACTSA: Annotated corpus for Telugu sentiment analysis. In Proceedings of the 1st Workshop on Building Linguistically Generalizable NLP Systems. 54–58.Google ScholarCross Ref
Patra Braja Gopal, Das Dipankar, and Das Amitava. 2018. Sentiment analysis of code-mixed Indian languages: An overview of SAIL_Code-Mixed Shared Task @ICON-2017. arXiv preprint arXiv:1803.06745 (2018).Google Scholar
Patwa Parth, Aguilar Gustavo, Kar Sudipta, Pandey Suraj, Pykl Srinivas, Gambäck Björn, Chakraborty Tanmoy, Solorio Thamar, and Das Amitava. 2020. SemEval-2020 Task 9: Overview of sentiment analysis of code-mixed tweets. In Proceedings of the 14th Workshop on Semantic Evaluation. 774–790.Google ScholarCross Ref
Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.Google ScholarCross Ref
Peters Matthew E., Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. arXiv:1802.05365 (2018). http://arxiv.org/abs/1802.05365Google Scholar
Prabhu Ameya, Joshi Aditya, Shrivastava Manish, and Varma Vasudeva. 2016. Towards sub-word level compositions for sentiment analysis of Hindi-English code mixed text. CoRR abs/1611.00472 (2016). http://dblp.uni-trier.de/db/journals/corr/corr1611.html#PrabhuJSV16Google Scholar
Pradhan Rahul and Sharma Dilip Kumar. 2022. An ensemble deep learning classifier for sentiment analysis on code-mix Hindi–English data. Soft Computing 2022 (2022), 1–18.Google Scholar
Reddy Aniketh Janardhan, Adusumilli Monica, Gorla Sai Kiranmai, Neti Lalita Bhanu Murthy, and Malapati Aruna. 2018. Named entity recognition for Telugu using LSTM-CRF. In Proceedings of the 4th Workshop on Indian Language Data: Resources and Evaluation (WILDRE4’18). 6.Google Scholar
Sharma Arnav, Gupta Sakshi, Motlani Raveesh, Bansal Piyush, Srivastava Manish, Mamidi Radhika, and Sharma Dipti M.. 2016. Shallow parsing pipeline for Hindi-English code-mixed social media text. arXiv preprint arXiv:1604.03136 (2016).Google Scholar
Shivachi Casper Shikali, Mokhosi Refuoe, Shijie Zhou, and Qihe Liu. 2021. Learning syllables using conv-LSTM model for Swahili word representation and part-of-speech tagging. Transactions on Asian and Low-Resource Language Information Processing 20, 4 (2021), 1–25.Google ScholarDigital Library
Srinivasan R. and Subalalitha C. N.. 2023. Sentimental analysis from imbalanced code-mixed data using machine learning approaches. Distributed and Parallel Databases 41 (2023), 37–52.Google Scholar
Sristy Nagesh Bhattu, Krishna N. Satya, Krishna Boora Shiva, and Ravi Vadlamani. 2017. Language identification in mixed script. In Proceedings of the 9th Annual Meeting of the Forum for Information and Retrieval Evaluation (FIRE’17). ACM, New York, NY, 14–20. http://dblp.uni-trier.de/db/conf/fire/fire2017.html#SristyKKR17Google Scholar
Zilly Julian G., Srivastava Rupesh Kumar, Koutník Jan, and Schmidhuber Jürgen. 2016. Recurrent highway networks. CoRR abs/1607.03474 (2016). http://dblp.uni-trier.de/db/journals/corr/corr1607.html#ZillySKS16Google Scholar

Index Terms

Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
      2. Language resources
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis

Recommendations

Improving Code-mixed POS Tagging Using Code-mixed Embeddings

Social media data has become invaluable component of business analytics. A multitude of nuances of social media text make the job of conventional text analytical tools difficult. Code-mixing of text is a phenomenon prevalent among social media users, ...
Read More
Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification
ICIAI '20: Proceedings of the 2020 the 4th International Conference on Innovation in Artificial Intelligence

In recent years, bilingual word embeddings have been used to promote sentiment classification task in low-resource languages. However, existing bilingual word embedding methods either require annotated cross-lingual data or fail to capture enough ...
Read More
Generating Word and Document Embeddings for Sentiment Analysis
Computational Linguistics and Intelligent Text Processing
Abstract
Sentiments of words can differ from one corpus to another. Inducing general sentiment lexicons for languages and using them cannot, in general, produce meaningful results for different domains. In this paper, we combine contextual and supervised ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 10
October 2023
226 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3627976
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 October 2023
- Online AM: 4 September 2023
- Accepted: 29 August 2023
- Revised: 24 June 2023
- Received: 19 December 2022
Published in tallip Volume 22, Issue 10

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Code-mixing
sentiment analysis
transliterated text
deep neural networks
syllable-aware embeddings
bidirectional networks
gated recurrent unit
long short-term memory
word embeddings
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 340
  Total Downloads
- Downloads (Last 12 months)340
- Downloads (Last 6 weeks)40
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Improving Code-mixed POS Tagging Using Code-mixed Embeddings

Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification

Generating Word and Document Embeddings for Sentiment Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Improving Code-mixed POS Tagging Using Code-mixed Embeddings

Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification

Generating Word and Document Embeddings for Sentiment Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media