Abstract
The widespread growth in social media platforms provides a plethora of opportunities to enhance interaction and bring awareness about recent activities happening across the countries. Many people use social media to share their thoughts and opinions on societal and political issues. Nonetheless, some individuals misuse these platforms by posting toxic, hostile, and insulting comments. Hence, detecting and controlling such content at its earliest stage is crucial since its spread can harm social relations and negatively impact a person’s life. In current scenarios, social media text consisting non-English languages is increasing due to active participation from multilingual societies. Of several non-English languages, Hindi English code-mixed is more prevalent in India. Most of the previous work to detect cyber aggression concentrates on English texts; therefore, there is high scope left to work on other languages such as Hindi English code-mixed. This paper has proposed a code-mixed hybrid embedding (CMHE) at the character and word level to capture similarly spelled and contextually related words. Furthermore, proposed embedding contributes significantly to the reduction of out of vocabulary words and capture words having similar polarity. After this, a deep learning framework based on CMHE, and a self-attention mechanism is proposed to retrieve significant features for classification. To evaluate proposed model, experiments were performed with two publicly available datasets: TRAC 2-2020 Hindi English code-mixed dataset (77.54% accuracy, 77.09% weighted average f1 score) and hate speech dataset (75.23% accuracy, 73.34% weighted average f1 score). The attained experimental results validate the effectiveness of proposed approach against the state-of-the-art.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Notes
References
Athavale V, Bharadwaj S, Pamecha M, Prabhu A, Shrivastava M (2016) Towards deep learning in hindi ner: An approach to tackle the labelled data sparsity. In: Proceedings of the 13th international conference on natural language processing, pp 154–160. https://doi.org/10.48550/arXiv.1610.09756
Badjatiya P, Gupta S, Gupta M, Varma V (2017) Deep learning for hate speech detection in tweets. 26th International World Wide Web Conference 2017, WWW 2017 Companion, pp 759–760. https://doi.org/10.1145/3041021.3054223
Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3Rd international conference on learning representations, ICLR 2015
Bakliwal A, Arora P, Varma V (2012) Hindi subjective lexicon: A lexical resource for Hindi adjective polarity classification. In: Proceedings of the Eighth International conference on language resources and evaluation (LREC’12), pp 1189–1196 European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2012/pdf/673_Paper.pdf. Accessed 27 July 2021
Bhat IA, Mujadia V, Tammewar A, Bhat RA, Shrivastava M (2015) Iiit-h system submission for fire2014 shared task on transliterated search. https://doi.org/10.1145/2824864.2824872
Bhattacharya S, Singh S, Kumar R et al (2020) Developing a multilingual annotated corpus of misogyny and aggression. In: Proceedings of the second workshop on trolling, aggression and cyberbullying, pp 158–168. European Language Resources Association (ELRA). https://aclanthology.org/2020.trac-1.25. Accessed 27 July 2021
Bohra A, Vijay D, Singh V, Akhtar SS, Shrivastava M (2018) A dataset of Hindi-English code-mixed social media text for hate speech detection, pp 36–41. https://doi.org/10.18653/v1/W18-1105
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguistics 5:135–146. https://doi.org/10.1162/tacl∖_a∖_00051
Chetty N, Alathur S (2018) Hate speech review in the context of online social networks. Aggress Violent Behav 40:108–118. https://doi.org/10.1016/j.avb.2018.05.003
Das A, Bandyopadhyay S (2010) Sentiwordnet for indian languages. In: Proceedings of the Eighth Workshop on Asian Language Resouces, pp 56–63
Datta A, Si S, Chakraborty U, Naskar SK (2020) Spyder: Aggression detection on multilingual tweets. In: Proceedings of the second workshop on trolling, aggression and Cyberbullying, Language resources and evaluation conference LREC 2020, pp 87–92
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019 - 2019 Conference of the North american chapter of the association for computational linguistics:, Human language technologies - Proceedings of the conference, vol 1, pp 4171–4186
Guo Q, Qiu X, Liu P, Xue X, Zhang Z (2020) Multi-scale self-attention for text classification. In: Proceedings of the AAAI Conference on artificial intelligence. https://doi.org/10.1609/AAAI.V34I05.6290, vol 34, pp 7847–7854
Hassan S, Kirmani MM, Sheetlani J, Hassan M (2021) Materials today: Proceedings word embedding generation for urdu language using word2vec model. Materials Today: Proceedings. https://doi.org/10.1016/j.matpr.2020.11.766
Huang F, Li X, Yuan C, Zhang S, Zhang J, Qiao S (2021) Attention-emotion-enhanced convolutional lstm for sentiment analysis. IEEE Transactions on Neural Networks and Learning Systems, pp 1–14. https://doi.org/10.1109/TNNLS.2021.3056664
Joshi A, Prabhu A, Shrivastava M, Varma V (2016) Towards sub-word level compositions for sentiment analysis of hindi-English code mixed text. In: Proceedings of COLING 2016, the 26th International conference on computational linguistics: Technical papers, pp 2482–2491. The COLING 2016 Organizing Committee. https://aclanthology.org/C16-1234. Accessed 27 July 2021
Kamble S, Joshi A (2018) Hate speech detection from code-mixed hindi-english tweets using deep learning models. In: 15th International conference on natural language processing (ICON-2018). https://doi.org/10.48550/arXiv.1811.05145
Kim Y (2014) Convolutional neural networks for sentence classification. EMNLP 2014 - 2014 Conference on empirical methods in natural language processing, proceedings of the conference, pp 1746–1751. https://doi.org/10.3115/v1/d14-1181
Kim H, Jeong YS (2019) Sentiment classification using convolutional neural networks. Appl Sci (Switzerland) 9:1–14. https://doi.org/10.3390/app9112347
Koufakou A, Basile V, Patti V (2020) FlorUniTo@TRAC-2: Retrofitting word embeddings on an abusive lexicon for aggressive language detection. In: Proceedings of the second workshop on trolling, aggression and Cyberbullying, pp 106–112. European Language Resources Association (ELRA). https://aclanthology.org/2020.trac-1.17. Accessed 27 July 2021
Kumar A, Sachdeva N (2020) Multi-input integrative learning using deep neural networks and transfer learning for cyberbullying detection in real-time code-mix data. Multimed Syst 2020:1–15. https://doi.org/10.1007/S00530-020-00672-7
Kumari K, Singh JP, Dwivedi YK, Rana NP (2021) Bilingual cyber-aggression detection on social media using lstm autoencoder. Soft Comput 25(14):8999–9012. https://doi.org/10.1007/S00500-021-05817-Y
Khanuja S, Bansal D, Mehtani S, Khosla S, Dey A, Gopalan B, Margam DK, Aggarwal P, Nagipogu RT, Dave S, Gupta S, Chandra S, Gali B, Subramanian V, Talukdar P (2021) MuRIL: Multilingual representations for indian languages. https://doi.org/10.48550/arXiv.2103.10730
Li W, Qi F, Tang M, Yu Z (2020) Bidirectional lstm with self-attention mechanism and multi-channel features for sentiment classification. Neurocomputing 387:63–77. https://doi.org/10.1016/J.NEUCOM.2020.010.006https://doi.org/10.1016/J.NEUCOM.2020.010.006
Liu G, Guo J (2019) Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing 337:325–338. https://doi.org/10.1016/J.NEUCOM.2019.01.078
Ma Q, Yu L, Tian S, Chen E, Ng WWY (2019) Global-local mutual attention model for text classification. IEEE/ACM Trans Audio Speech Lang Process 27:2127–2139. https://doi.org/10.1109/TASLP.2019.2942160https://doi.org/10.1109/TASLP.2019.2942160
Mandal S, Nanmaran K (2019) Normalization of transliterated words in code-mixed data using seq2seq model & levenshtein distance, pp 49–53. https://doi.org/10.18653/v1/w18-6107
Mathur P, Shah R, Sawhney R, Mahata D (2018) Detecting offensive tweets in Hindi-English code-switched language. In: Proceedings of the Sixth international workshop on natural language processing for social media, pp 18–26. Association for computational linguistics. https://doi.org/10.18653/v1/W18-3504. https://aclanthology.org/W18-3504
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119
Modha S, Majumder P, Mandl T, Mandalia C (2020) Detecting and visualizing hate speech in social media: A cyber watchdog for surveillance. Exp Syst Appl 161:113725. https://doi.org/10.1016/j.eswa.2020.113725
One Speaker (1995) Two Languages: Cross-Disciplinary Perspectives on Code-Switching. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511620867
Paul S, Saha S, Singh JP (2022) Covid-19 and cyberbullying: deep ensemble model to identify cyberbullying from code-switched languages during the pandemic. Multimedia Tools and Applications, pp 1–17. https://doi.org/10.1007/S11042-021-11601-9/TABLES/8
Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual BERT?. In: Proceedings of the 57th Annual meeting of the association for computational linguistics, pp 4996–5001. Association for computational linguistics. https://doi.org/10.18653/v1/P19-1493. https://aclanthology.org/P19-1493
Samghabadi NS, Mave D, Kar S, Solorio T (2018) Ritual-uh at TRAC 2018 shared task: Aggression identification. In: Shared Task 2018, vol abs/1807.11712. https://doi.org/10.48550/arXiv.1807.11712
Santosh TYSS, Aravind KVS (2019) Hate speech detection in hindi-english code-mixed social media text. ACM International Conference Proceeding Series, pp 310–313. https://doi.org/10.1145/3297001.3297048https://doi.org/10.1145/3297001.3297048
Sharma A, Kabra A, Jain M (2022) Ceasing hate with moh: Hate speech detection in hindi–english code-switched language. Inform Process Manag 59:102760. https://doi.org/10.1016/j.ipm.2021.102760
Sharma S, Srinivas PYKL, Balabantaray RC (2015) Text normalization of code mix and sentiment analysis. 2015 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2015, pp 1468–1473. https://doi.org/10.1109/ICACCI.2015.7275819
Singh V, Varshney A, Akhtar SS, Vijay D, Shrivastava M (2018) Aggression detection on social media text using deep neural networks. EMNLP 2018, p 43. https://doi.org/10.18653/v1/w18-5106
Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C (2015) Efficient object localization using convolutional networks. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 648–656
Waseem Z (2016) Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter. In: Proceedings of 2016 EMNLP Workshop on natural language processing and computational social science, pp 138–142
Yilmaz S, Toklu S (2020) A deep learning analysis on question classification task using word2vec representations. eural Comput Appl 32(7):32, 2909–2928. https://doi.org/10.1007/S00521-020-04725-W
Zhao R, Zhou A, Mao K (2016) Automatic detection of cyberbullying on social networks based on bullying features. ACM International Conference Proceeding Series 04-07-January, pp 1–6. https://doi.org/10.1145/2833312.2849567
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing Interests
The authors declare that they have no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mundra, S., Mittal, N. CMHE-AN: Code mixed hybrid embedding based attention network for aggression identification in hindi english code-mixed text. Multimed Tools Appl 82, 11337–11364 (2023). https://doi.org/10.1007/s11042-022-13668-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13668-4