Abstract
Learning the inherent meaning of a word in Natural Language Processing (NLP) has motivated researchers to represent a word at various levels of abstraction, namely character-level, morpheme-level, and subword-level vector representations. Syllable-Aware Word Embedding (SAWE) can effectively handle agglutinative and fusion-based NLP tasks. However, research attempts on assessing the SAWE on such extrinsic NLP tasks has been scanty, especially for low-resource languages in the context of code-mixing with English. A model to learn SAWE to extract semantics at fine-grained subunits of a word is proposed in this article, and the representative ability of the embeddings is assessed through sentiment analysis of code-mixed Telugu-English review corpora. Multilingual societies and advancements in communication technologies have accounted for the prolific usage of mixed data, which renders the State-of-the-Art (SOTA) sentiment analysis models developed based on monolingual data ineffective. Social media users in the Indian subcontinent exhibit a tendency to mix English and their respective native language (using the phonetic form of English) in expressing their opinions or sentiments. A code-mixing scenario provides flexibility to borrow words from a foreign language, usage of shorthand notations, elongation of vowels, and usage of words without following syntactic/grammatical rules, which renders the sentiment analysis of code-mixed data challenging to perform. Deep neural architectures like Long Short-Term Memory and Gated Recurrent Unit networks have been shown to be effective in solving several NLP tasks, such as sequence labeling, named entity recognition, and machine translation. In this article, a framework to perform sentiment analysis on a code-mixed Telugu-English review corpus is implemented. Both word embedding and SAWE are input to a unified deep neural network that contains a two-level Bidirectional Long Short-Term Memory/Gated Recurrent Unit network with Softmax as the output layer. The proposed model leverages the advantages of both word embedding and SAWE, which enable the proposed model to outperform existing SOTA code-mixed sentiment analysis models on a Telugu-English code-mixed dataset of the International Institute of Information Technology–Hyderabad and a dataset curated by the authors. The improvement realized by the proposed model on these datasets is [3% increase in F1-score and 2% increase in accuracy] and [7% increase in F1-score and 5% in accuracy], respectively, in comparison with the best-performing SOTA model.
- 2017. Syllable-aware neural language models: A failure to beat character-aware ones. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1866–1872. Google ScholarCross Ref .
- 2000. A neural probabilistic language model. In Advances in Neural Information Processing Systems 13 (NIPS 2000), , , and (Eds.). MIT Press, Cambridge, MA, 932–938. http://dblp.uni-trier.de/db/conf/nips/nips2000.html#BengioDV00Google Scholar .
- 2020. Improving code-mixed POS tagging using code-mixed embeddings. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 4 (2020), 1–31.Google ScholarDigital Library .
- 2016. Enriching Word Vectors with Subword Information. arXiv:1607.04606 (2016). http://arxiv.org/abs/1607.04606Google Scholar .
- 2014. Compositional morphology for word representations and language modelling. CoRR abs/1405.4273 (2014). http://dblp.uni-trier.de/db/journals/corr/corr1405.html#BothaB14Google Scholar .
- 2018. Word level language identification in code-mixed data using word embedding methods for Indian languages. In Proceedings of the 2018 International Conference on Advances in Computing, Communications, and Informatics (ICACCI’18). IEEE, Los Alamitos, CA, 1137–1141. http://dblp.uni-trier.de/db/conf/icacci/icacci2018.html#ChaitanyaMGT18Google ScholarCross Ref .
- 2014. Word-level language identification using CRF: Code-switching shared task report of MSR India system. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching. 73–79. Google ScholarCross Ref .
- 2017. A syllable-based technique for word embeddings of Korean words. CoRR abs/1708.01766 (2017). http://dblp.uni-trier.de/db/journals/corr/corr1708.html#abs-1708-01766Google Scholar .
- 2018. Sentiment analysis of code-mixed languages leveraging resource rich languages. arXiv preprint arXiv:1804.00806 (2018).Google Scholar .
- 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (1960), 37–46.Google ScholarCross Ref .
- 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).Google Scholar .
- 2004. Patterns of English phoneme confusions by native and non-native listeners. Journal of the Acoustical Society of America 116, 6 (2004), 3668–3678.Google ScholarCross Ref .
- 2014. Identifying languages at the word level in code-mixed Indian social media text. In Proceedings of the 11th International Conference on Natural Language Processing. 378–387. https://aclanthology.org/W14-5152Google Scholar .
- 2021. Sentiment analysis on multilingual code-mixed Kannada language. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’21). 1–11.Google Scholar .
- 2019. SymSpell: 1 Million Times Faster through Symmetric Delete Spelling Correction Algorithm. Retrieved September 13, 2023 from https://github.com/softwx/symspellGoogle Scholar .
- 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media.Google Scholar .
- 2018. Word level language identification in English Telugu code mixed data. In Proceedings of the 32nd Pacific Asia Conference in Language, Information, and Computation (PACLIC’18). http://dblp.uni-trier.de/db/conf/paclic/paclic2018.html#GundapuM18Google Scholar .
- 2011. Levenshtein distance technique in dictionary lookup methods: An improved approach. arXiv preprint arXiv:1101.1232 (2011).Google Scholar .
- 2015. Part-of-speech tagging for code-mixed English-Hindi Twitter and Facebook chat messages. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. 239–248.Google Scholar .
- 2016. Towards sub-word level compositions for sentiment analysis of Hindi-English code mixed text. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16): Technical Papers. 2482–2491.Google Scholar .
- 2021. Exploiting multilingual neural linguistic representation for sentiment classification of political tweets in code-mix language. In Proceedings of the 12th International Conference on Advances in Information Technology. 1–5.Google ScholarDigital Library .
- 2020. GLUECoS: An evaluation benchmark for code-switched NLP. arXiv preprint arXiv:2004.12376 (2020).Google Scholar .
- 2015. Character-aware neural language models. arXiv:1508.06615 (2015). http://arxiv.org/abs/1508.06615Google Scholar .
- 2021. Sentiment analysis in code-mixed Telugu-English text with unsupervised data normalization. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’21). 753–760.Google Scholar .
- 2019. De-mixing sentiment from code-mixed text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 371–377.Google ScholarCross Ref .
- 2014. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 302–308. Google ScholarCross Ref .
- 1983. Word Hy-phen-a-tion by Com-put-er. Department of Computer Science, Stanford University Palo Alto, CA.Google ScholarDigital Library .
- 2021. Sentiment analysis of Dravidian code mixed data. In Proceedings of the 1st Workshop on Speech and Language Technologies for Dravidian Languages. 46–54.Google Scholar .
- 2013a. Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). http://dblp.uni-trier.de/db/journals/corr/corr1301.html#abs-1301-3781Google Scholar .
- 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, , , , , and (Eds.). MIT Press, Cambridge, MA, 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionalityGoogle ScholarDigital Library .
- 2018. Code-mixed sentiment analysis using machine learning and neural network approaches. arXiv preprint arXiv:1808.03299 (2018).Google Scholar .
- 2016. A practical guide to sentiment annotation: Challenges and solutions. In Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media Analysis. 174–179.Google ScholarCross Ref .
- 2017. ACTSA: Annotated corpus for Telugu sentiment analysis. In Proceedings of the 1st Workshop on Building Linguistically Generalizable NLP Systems. 54–58.Google ScholarCross Ref .
- 2018. Sentiment analysis of code-mixed Indian languages: An overview of SAIL_Code-Mixed Shared Task @ICON-2017. arXiv preprint arXiv:1803.06745 (2018).Google Scholar .
- 2020. SemEval-2020 Task 9: Overview of sentiment analysis of code-mixed tweets. In Proceedings of the 14th Workshop on Semantic Evaluation. 774–790.Google ScholarCross Ref .
- 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.Google ScholarCross Ref .
- 2018. Deep contextualized word representations. arXiv:1802.05365 (2018). http://arxiv.org/abs/1802.05365Google Scholar .
- 2016. Towards sub-word level compositions for sentiment analysis of Hindi-English code mixed text. CoRR abs/1611.00472 (2016). http://dblp.uni-trier.de/db/journals/corr/corr1611.html#PrabhuJSV16Google Scholar .
- 2022. An ensemble deep learning classifier for sentiment analysis on code-mix Hindi–English data. Soft Computing 2022 (2022), 1–18.Google Scholar .
- 2018. Named entity recognition for Telugu using LSTM-CRF. In Proceedings of the 4th Workshop on Indian Language Data: Resources and Evaluation (WILDRE4’18). 6.Google Scholar .
- 2016. Shallow parsing pipeline for Hindi-English code-mixed social media text. arXiv preprint arXiv:1604.03136 (2016).Google Scholar .
- 2021. Learning syllables using conv-LSTM model for Swahili word representation and part-of-speech tagging. Transactions on Asian and Low-Resource Language Information Processing 20, 4 (2021), 1–25.Google ScholarDigital Library .
- 2023. Sentimental analysis from imbalanced code-mixed data using machine learning approaches. Distributed and Parallel Databases 41 (2023), 37–52.Google Scholar .
- 2017. Language identification in mixed script. In Proceedings of the 9th Annual Meeting of the Forum for Information and Retrieval Evaluation (FIRE’17). ACM, New York, NY, 14–20. http://dblp.uni-trier.de/db/conf/fire/fire2017.html#SristyKKR17Google Scholar .
- 2016. Recurrent highway networks. CoRR abs/1607.03474 (2016). http://dblp.uni-trier.de/db/journals/corr/corr1607.html#ZillySKS16Google Scholar .
Index Terms
- Sentiment Analysis of Code-Mixed Telugu-English Data Leveraging Syllable and Word Embeddings
Recommendations
Improving Code-mixed POS Tagging Using Code-mixed Embeddings
Social media data has become invaluable component of business analytics. A multitude of nuances of social media text make the job of conventional text analytical tools difficult. Code-mixing of text is a phenomenon prevalent among social media users, ...
Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification
ICIAI '20: Proceedings of the 2020 the 4th International Conference on Innovation in Artificial IntelligenceIn recent years, bilingual word embeddings have been used to promote sentiment classification task in low-resource languages. However, existing bilingual word embedding methods either require annotated cross-lingual data or fail to capture enough ...
Generating Word and Document Embeddings for Sentiment Analysis
Computational Linguistics and Intelligent Text ProcessingAbstractSentiments of words can differ from one corpus to another. Inducing general sentiment lexicons for languages and using them cannot, in general, produce meaningful results for different domains. In this paper, we combine contextual and supervised ...
Comments