Skip to main content
Log in

CMHE-AN: Code mixed hybrid embedding based attention network for aggression identification in hindi english code-mixed text

  • 1226: Deep-Patterns Emotion Recognition in the Wild
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The widespread growth in social media platforms provides a plethora of opportunities to enhance interaction and bring awareness about recent activities happening across the countries. Many people use social media to share their thoughts and opinions on societal and political issues. Nonetheless, some individuals misuse these platforms by posting toxic, hostile, and insulting comments. Hence, detecting and controlling such content at its earliest stage is crucial since its spread can harm social relations and negatively impact a person’s life. In current scenarios, social media text consisting non-English languages is increasing due to active participation from multilingual societies. Of several non-English languages, Hindi English code-mixed is more prevalent in India. Most of the previous work to detect cyber aggression concentrates on English texts; therefore, there is high scope left to work on other languages such as Hindi English code-mixed. This paper has proposed a code-mixed hybrid embedding (CMHE) at the character and word level to capture similarly spelled and contextually related words. Furthermore, proposed embedding contributes significantly to the reduction of out of vocabulary words and capture words having similar polarity. After this, a deep learning framework based on CMHE, and a self-attention mechanism is proposed to retrieve significant features for classification. To evaluate proposed model, experiments were performed with two publicly available datasets: TRAC 2-2020 Hindi English code-mixed dataset (77.54% accuracy, 77.09% weighted average f1 score) and hate speech dataset (75.23% accuracy, 73.34% weighted average f1 score). The attained experimental results validate the effectiveness of proposed approach against the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Notes

  1. https://www.firstpost.com/tech/news-analysis/india-ranks-third-on-global-cyber-bullying-list-3602419.html

  2. https://www.thehindu.com/news/national/8-out-of-10-indians-have-faced-online-harassment/article19798215.ece

  3. https://feminisminindia.com/2016/11/15/cyber-violence-against-women-india-report/

  4. https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers_in_India

  5. https://en.wikipedia.org/wiki/Hinglish

  6. https://pypi.org/project/GetOldTweets3/

  7. https://github.com/libindic/indic-trans

  8. https://keras.io/

  9. https://www.tensorflow.org/

References

  1. Athavale V, Bharadwaj S, Pamecha M, Prabhu A, Shrivastava M (2016) Towards deep learning in hindi ner: An approach to tackle the labelled data sparsity. In: Proceedings of the 13th international conference on natural language processing, pp 154–160. https://doi.org/10.48550/arXiv.1610.09756

  2. Badjatiya P, Gupta S, Gupta M, Varma V (2017) Deep learning for hate speech detection in tweets. 26th International World Wide Web Conference 2017, WWW 2017 Companion, pp 759–760. https://doi.org/10.1145/3041021.3054223

  3. Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3Rd international conference on learning representations, ICLR 2015

  4. Bakliwal A, Arora P, Varma V (2012) Hindi subjective lexicon: A lexical resource for Hindi adjective polarity classification. In: Proceedings of the Eighth International conference on language resources and evaluation (LREC’12), pp 1189–1196 European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2012/pdf/673_Paper.pdf. Accessed 27 July 2021

  5. Bhat IA, Mujadia V, Tammewar A, Bhat RA, Shrivastava M (2015) Iiit-h system submission for fire2014 shared task on transliterated search. https://doi.org/10.1145/2824864.2824872

  6. Bhattacharya S, Singh S, Kumar R et al (2020) Developing a multilingual annotated corpus of misogyny and aggression. In: Proceedings of the second workshop on trolling, aggression and cyberbullying, pp 158–168. European Language Resources Association (ELRA). https://aclanthology.org/2020.trac-1.25. Accessed 27 July 2021

  7. Bohra A, Vijay D, Singh V, Akhtar SS, Shrivastava M (2018) A dataset of Hindi-English code-mixed social media text for hate speech detection, pp 36–41. https://doi.org/10.18653/v1/W18-1105

  8. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguistics 5:135–146. https://doi.org/10.1162/tacl∖_a∖_00051

    Article  Google Scholar 

  9. Chetty N, Alathur S (2018) Hate speech review in the context of online social networks. Aggress Violent Behav 40:108–118. https://doi.org/10.1016/j.avb.2018.05.003

    Article  Google Scholar 

  10. Das A, Bandyopadhyay S (2010) Sentiwordnet for indian languages. In: Proceedings of the Eighth Workshop on Asian Language Resouces, pp 56–63

  11. Datta A, Si S, Chakraborty U, Naskar SK (2020) Spyder: Aggression detection on multilingual tweets. In: Proceedings of the second workshop on trolling, aggression and Cyberbullying, Language resources and evaluation conference LREC 2020, pp 87–92

  12. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019 - 2019 Conference of the North american chapter of the association for computational linguistics:, Human language technologies - Proceedings of the conference, vol 1, pp 4171–4186

  13. Guo Q, Qiu X, Liu P, Xue X, Zhang Z (2020) Multi-scale self-attention for text classification. In: Proceedings of the AAAI Conference on artificial intelligence. https://doi.org/10.1609/AAAI.V34I05.6290, vol 34, pp 7847–7854

  14. Hassan S, Kirmani MM, Sheetlani J, Hassan M (2021) Materials today: Proceedings word embedding generation for urdu language using word2vec model. Materials Today: Proceedings. https://doi.org/10.1016/j.matpr.2020.11.766

  15. Huang F, Li X, Yuan C, Zhang S, Zhang J, Qiao S (2021) Attention-emotion-enhanced convolutional lstm for sentiment analysis. IEEE Transactions on Neural Networks and Learning Systems, pp 1–14. https://doi.org/10.1109/TNNLS.2021.3056664

  16. Joshi A, Prabhu A, Shrivastava M, Varma V (2016) Towards sub-word level compositions for sentiment analysis of hindi-English code mixed text. In: Proceedings of COLING 2016, the 26th International conference on computational linguistics: Technical papers, pp 2482–2491. The COLING 2016 Organizing Committee. https://aclanthology.org/C16-1234. Accessed 27 July 2021

  17. Kamble S, Joshi A (2018) Hate speech detection from code-mixed hindi-english tweets using deep learning models. In: 15th International conference on natural language processing (ICON-2018). https://doi.org/10.48550/arXiv.1811.05145

  18. Kim Y (2014) Convolutional neural networks for sentence classification. EMNLP 2014 - 2014 Conference on empirical methods in natural language processing, proceedings of the conference, pp 1746–1751. https://doi.org/10.3115/v1/d14-1181

  19. Kim H, Jeong YS (2019) Sentiment classification using convolutional neural networks. Appl Sci (Switzerland) 9:1–14. https://doi.org/10.3390/app9112347

    Google Scholar 

  20. Koufakou A, Basile V, Patti V (2020) FlorUniTo@TRAC-2: Retrofitting word embeddings on an abusive lexicon for aggressive language detection. In: Proceedings of the second workshop on trolling, aggression and Cyberbullying, pp 106–112. European Language Resources Association (ELRA). https://aclanthology.org/2020.trac-1.17. Accessed 27 July 2021

  21. Kumar A, Sachdeva N (2020) Multi-input integrative learning using deep neural networks and transfer learning for cyberbullying detection in real-time code-mix data. Multimed Syst 2020:1–15. https://doi.org/10.1007/S00530-020-00672-7

    Google Scholar 

  22. Kumari K, Singh JP, Dwivedi YK, Rana NP (2021) Bilingual cyber-aggression detection on social media using lstm autoencoder. Soft Comput 25(14):8999–9012. https://doi.org/10.1007/S00500-021-05817-Y

    Article  Google Scholar 

  23. Khanuja S, Bansal D, Mehtani S, Khosla S, Dey A, Gopalan B, Margam DK, Aggarwal P, Nagipogu RT, Dave S, Gupta S, Chandra S, Gali B, Subramanian V, Talukdar P (2021) MuRIL: Multilingual representations for indian languages. https://doi.org/10.48550/arXiv.2103.10730

  24. Li W, Qi F, Tang M, Yu Z (2020) Bidirectional lstm with self-attention mechanism and multi-channel features for sentiment classification. Neurocomputing 387:63–77. https://doi.org/10.1016/J.NEUCOM.2020.010.006https://doi.org/10.1016/J.NEUCOM.2020.010.006

    Article  Google Scholar 

  25. Liu G, Guo J (2019) Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing 337:325–338. https://doi.org/10.1016/J.NEUCOM.2019.01.078

    Article  Google Scholar 

  26. Ma Q, Yu L, Tian S, Chen E, Ng WWY (2019) Global-local mutual attention model for text classification. IEEE/ACM Trans Audio Speech Lang Process 27:2127–2139. https://doi.org/10.1109/TASLP.2019.2942160https://doi.org/10.1109/TASLP.2019.2942160

    Article  Google Scholar 

  27. Mandal S, Nanmaran K (2019) Normalization of transliterated words in code-mixed data using seq2seq model & levenshtein distance, pp 49–53. https://doi.org/10.18653/v1/w18-6107

  28. Mathur P, Shah R, Sawhney R, Mahata D (2018) Detecting offensive tweets in Hindi-English code-switched language. In: Proceedings of the Sixth international workshop on natural language processing for social media, pp 18–26. Association for computational linguistics. https://doi.org/10.18653/v1/W18-3504. https://aclanthology.org/W18-3504

  29. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119

    Google Scholar 

  30. Modha S, Majumder P, Mandl T, Mandalia C (2020) Detecting and visualizing hate speech in social media: A cyber watchdog for surveillance. Exp Syst Appl 161:113725. https://doi.org/10.1016/j.eswa.2020.113725

    Article  Google Scholar 

  31. One Speaker (1995) Two Languages: Cross-Disciplinary Perspectives on Code-Switching. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511620867

    Google Scholar 

  32. Paul S, Saha S, Singh JP (2022) Covid-19 and cyberbullying: deep ensemble model to identify cyberbullying from code-switched languages during the pandemic. Multimedia Tools and Applications, pp 1–17. https://doi.org/10.1007/S11042-021-11601-9/TABLES/8

  33. Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual BERT?. In: Proceedings of the 57th Annual meeting of the association for computational linguistics, pp 4996–5001. Association for computational linguistics. https://doi.org/10.18653/v1/P19-1493. https://aclanthology.org/P19-1493

  34. Samghabadi NS, Mave D, Kar S, Solorio T (2018) Ritual-uh at TRAC 2018 shared task: Aggression identification. In: Shared Task 2018, vol abs/1807.11712. https://doi.org/10.48550/arXiv.1807.11712

  35. Santosh TYSS, Aravind KVS (2019) Hate speech detection in hindi-english code-mixed social media text. ACM International Conference Proceeding Series, pp 310–313. https://doi.org/10.1145/3297001.3297048https://doi.org/10.1145/3297001.3297048

  36. Sharma A, Kabra A, Jain M (2022) Ceasing hate with moh: Hate speech detection in hindi–english code-switched language. Inform Process Manag 59:102760. https://doi.org/10.1016/j.ipm.2021.102760

    Article  Google Scholar 

  37. Sharma S, Srinivas PYKL, Balabantaray RC (2015) Text normalization of code mix and sentiment analysis. 2015 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2015, pp 1468–1473. https://doi.org/10.1109/ICACCI.2015.7275819

  38. Singh V, Varshney A, Akhtar SS, Vijay D, Shrivastava M (2018) Aggression detection on social media text using deep neural networks. EMNLP 2018, p 43. https://doi.org/10.18653/v1/w18-5106

  39. Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C (2015) Efficient object localization using convolutional networks. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 648–656

  40. Waseem Z (2016) Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter. In: Proceedings of 2016 EMNLP Workshop on natural language processing and computational social science, pp 138–142

  41. Yilmaz S, Toklu S (2020) A deep learning analysis on question classification task using word2vec representations. eural Comput Appl 32(7):32, 2909–2928. https://doi.org/10.1007/S00521-020-04725-W

    Google Scholar 

  42. Zhao R, Zhou A, Mao K (2016) Automatic detection of cyberbullying on social networks based on bullying features. ACM International Conference Proceeding Series 04-07-January, pp 1–6. https://doi.org/10.1145/2833312.2849567

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shikha Mundra.

Ethics declarations

Competing Interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mundra, S., Mittal, N. CMHE-AN: Code mixed hybrid embedding based attention network for aggression identification in hindi english code-mixed text. Multimed Tools Appl 82, 11337–11364 (2023). https://doi.org/10.1007/s11042-022-13668-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13668-4

Keywords

Navigation