skip to main content
research-article

Ensemble Classifier for Hindi Hostile Content Detection

Published: 15 January 2024 Publication History

Abstract

Detection of hostile content from social media posts (Facebook, Twitter, etc.) is a demanding task in the field of Natural Language Processing. The increase of hostile content in different electronic media has opened up new challenges in language understanding. It becomes more difficult in regional languages. AI-based solutions are required to identify hostile content on a large scale. Although a satisfactory amount of research has been carried out in the English language, finding hostile content in regional languages is still under development due to the unavailability of suitable datasets and tools. In terms of the number of speakers, Hindi ranks third in the world and first on the Indian subcontinent. The objective of this article is to design a hostile content detection system in Hindi using coarse-grained (binary) classification and fine-grained (multi-class, multi-label) classification. We note that different baseline learning methods with different pre-trained language models perform differently. Using the Constraint 2021 Hindi Dataset, this research proposes a Bidirectional Encoder Representations from Transformers–(BERT) based contextual embedding technique with a concatenation of emoji2vec embeddings to classify social media posts in Hindi Devanagari script as hostile or non-hostile. Additionally, for the fine-grained tasks where hostile posts are sub-categorized as defamation, fake, hate, and offensive, we develop an ensemble classifier varying different learning methods and embedding models. With an F1-Score of 0.9721, it is found that our proposed Indic-BERT+emoji model outperforms the baseline model and other existing models for the coarse-grained task. We have also observed that our proposed ensemble method provides better results than the existing models and the baseline model for the fine-grained tasks with F1-Scores of 0.43, 0.82, 0.58, and 0.62 for the defamation, fake, hate, and offensive classes, respectively. The code and the data are available at https://github.com/skarifahmed/hostile.

Supplementary Material

3613498.supp (3613498.supp.pdf)
Supplementary material

References

[1]
2015. Coarse-grained vs. Fine-grained Sentiment Analysis. Retrieved May 25, 2015 from https://www.linkedin.com/pulse/coarse-grained-vs-fine-grained-sentiment-analysis-wei-li.
[2]
2021. SHARED TASK@CONSTRAINT 2021. Retrieved February 8, 2021 from https://constraint-shared-task-2021.github.io/.
[3]
2022. bert-base-multilingual-cased. Retrieved January 25, 2022 from https://huggingface.co/bert-base-multilingual-cased.
[4]
2022. Documentation/Evaluation View/Classification Loss Metrics/Macro F1-score. Retrieved from https://peltarion.com/knowledge-center/documentation/evaluation-view/classification-loss-metrics/macro-f1-score.
[5]
2022. flax-community/roberta-hindi. Retrieved from https://huggingface.co/flax-community/roberta-hindi.
[6]
2022. neuralspace-reverie/indic-transformers-hi-xlmroberta. Retrieved from https://huggingface.co/neuralspace-reverie/indic-transformers-hi-xlmroberta.
[7]
2023. BoomLive. Retrieved January 16, 2023 from https://hindi.boomlive.in/fake-news.
[8]
2023. Dainik Bhaskar. Retrieved january 16, 2023 from https://www.bhaskar.com/no-fake-news/.
[9]
2023. Rising Levels of Hate Speech & Online Toxicity During This Time of Crisis. Retrieved January 18, 2023 from https://1ight.com/Toxicity_during_coronavirus_Report-Lig.hptdf.
[10]
2023. Twitter API. Retrieved January 16, 2023 from https://developer.twitter.com/en/docs/twitter-api.
[11]
Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma. 2017. Deep learning for hate speech detection in tweets. In Proceedings of the 26th International Conference on World Wide Web Companion. 759–760.
[12]
Aditi Bagora, Kamal Shrestha, Kaushal Maurya, and Maunendra Sankar Desarkar. 2022. Hostility detection in online hindi-english code-mixed conversations. In Proceedings of the 14th ACM Web Science Conference 2022. 390–400.
[13]
Mohit Bhardwaj, Md Shad Akhtar, Asif Ekbal, Amitava Das, and Tanmoy Chakraborty. 2020. Hostility detection dataset in Hindi. arXiv:2011.03588. Retrieved from https://arxiv.org/abs/2011.03588.
[14]
Varad Bhatnagar, Prince Kumar, and Pushpak Bhattacharyya. 2022. Investigating hostile post detection in Hindi. Neurocomputing 474 (2022), 60–81.
[15]
Varad Bhatnagar, Prince Kumar, Sairam Moghili, and Pushpak Bhattacharyya. 2021. Divide and conquer: An ensemble approach for hostile post detection in Hindi. In International Workshop on Combating Online Hostile Posts in Regional Languages during Emerge ncy Situation. Springer, 244–255.
[16]
Ercan Canhasi, Rexhep Shijaku, and Erblin Berisha. 2022. Albanian fake news detection. Transactions on Asian and Low-Resource Language Information Processing 21, (2022) 1–24.
[17]
Dave Chaffey. 2022. Global Social Media Statistics Research Summary 2022. Retrieved January 27, 2022 from https://www.smartinsights.com/social-media-marketing/social-media-strategy/new-global-social-media-research.
[18]
Sourya Dipta Das, Ayan Basak, and Soumil Mandal. 2019. Fine grained insincere questions classification using ensembles of bidirectional LSTM-GRU model. In FIRE (Working Notes). 473–481.
[19]
Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 11. 512–515.
[20]
Arkadipta De, Venkatesh Elangovan, Kaushal Kumar Maurya, and Maunendra Sankar Desarkar. 2021. Coarse and fine-grained hostility detection in Hindi posts using fine tuned multilingual embeddings. In International Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situation. Springer, 201–212.
[21]
Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bošnjak, and Sebastian Riedel. 2016. emoji2vec: Learning emoji representations from their description. arXiv:1609.08359. Retrieved from https://arxiv.org/abs/1609.08359.
[22]
Ibrahim Abu Farha and Walid Magdy. 2020. Multitask learning for arabic offensive language and hate-speech detection. In Proceedings of the 4th Workshop on Open-source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. 86–90.
[23]
Jibran Fawaid, Aisyah Awalina, Rifky Yunus Krisnabayu, and Novanto Yudistira. 2021. Indonesia’s fake news detection using transformer network. In Proceedings of the 6th International Conference on Sustainable Information Engineering and Technology 2021. 247–251.
[24]
Fabio M. Graetz. 2018. Why AdamW Matters. Retrieved Jun 3, 2018 from https://towardsdatascience.com/why-adamw-matters-736223f31b5d.
[25]
Md Zobaer Hossain, Md Ashraful Rahman, Md Saiful Islam, and Sudipta Kar. 2020. Banfakenews: A dataset for detecting fake news in bangla. arXiv:2004.08789. Retrieved from https://arxiv.orb/abs/2004.08789.
[26]
Vikas Kumar Jha, Pa Hrudya, P. N. Vinu, Vishnu Vijayan, and Pa Prabaharan. 2020. DHOT-repository and classification of offensive tweets in the Hindi language. Proc. Comput. Sci. 171 (2020), 2324–2333.
[27]
Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, N. C. Gokul, Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. 4948–4961.
[28]
Ojasv Kamal, Adarsh Kumar, and Tejas Vaidhya. 2021. Hostility detection in hindi leveraging pre-trained language models. In International Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situation. Springer, 213–223.
[29]
Anna Klappenbach. 2022. The 12 Most Spoken Languages in the World. Retrieved January 7, 2022 from https://blog.busuu.com/most-spoken-languages-in-the-world/.
[30]
Animesh Koratana and Kevin Hu. 2018. Toxic speech detection. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 1–9.
[31]
Edward Ma. 2019. Data Augmentation in NLP. Retrieved April 12, 2019 from https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28.
[32]
Dheeraj Mekala, Varun Gangal, and Jingbo Shang. 2021. Coarse2Fine: Fine-grained text classification on coarsely-grained annotated data. arXiv:2109.10856. Retrieved from https://arxiv.org/abs/2109.10856.
[33]
Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, and Grigorios Tsoumakas. 2020. Ethos: An online hate speech detection dataset. arXiv:2006.08328. Retrieved from https://arxiv.org/abs/2006.08328.
[34]
Jihyung Moon, Won Ik Cho, and Junbum Lee. 2020. BEEP! Korean corpus of online news comments for toxic speech detection. arXiv:2005.12503. Retrieved from https://arxiv.org/abs/2005.12503.
[35]
M. Romero. 2022. mrm8488/HindiBERTa. Retrieved from https://huggingface.co/mrm8488/HindiBERTa.
[36]
Samir Nazareth. 2019. Removed from Reality. Retrieved February 15, 2019 from https://www.thehindu.com/opinion/op-ed/removed-from-reality/article26272904.ece?homepage=true.
[37]
parthpatwa. 2022. Constraint@AAAI2021–Hostile Post Detection in Hindi. Retrieved Februar4y 22, 2022 from https://competitions.codalab.org/competitions/26654#learn_the_details-dataset.
[38]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019).
[39]
Parth Patwa, Mohit Bhardwaj, Vineeth Guptha, Gitanjali Kumari, Shivam Sharma, Srinivas Pykl, Amitava Das, Asif Ekbal, Md Shad Akhtar, and Tanmoy Chakraborty. 2021. Overview of constraint 2021 shared tasks: Detecting English covid-19 fake news and hindi hostile posts. In International Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situation. Springer, 42–53.
[40]
Pradeep Kumar Roy, Asis Kumar Tripathy, Tapan Kumar Das, and Xiao-Zhi Gao. 2020. A framework for hate speech detection using deep convolutional neural network. IEEE Access 8 (2020), 204951–204962.
[41]
Sayar Ghosh Roy, Ujwal Narayan, Tathagata Raha, Zubair Abid, and Vasudeva Varma. 2021. Leveraging multilingual transformers for hate speech detection. arXiv:2101.03207. Retrieved from https://arxiv.org/abs/2101.03207.
[42]
Yi Shao, Jiande Sun, Tianlin Zhang, Ye Jiang, Jianhua Ma, and Jing Li. 2022. Fake news detection based on multi-modal classifier ensemble. In Proceedings of the 1st International Workshop on Multimedia AI against Disinformation. 78–86.
[43]
Chander Shekhar, Bhavya Bagla, Kaushal Kumar Maurya, and Maunendra Sankar Desarkar. 2021. Walk in wild: An ensemble approach for hostility detection in hindi posts. arXiv:2101.06004. Retrieved from https://arxiv.org/abs/2101.06004.
[44]
Shishir Tiwari and Gitanjali Ghosh. 2014. Social media and freedom of speech and expression: Challenges before the Indian law (unpublished).
[45]
Abhishek Velankar, Hrushikesh Patil, Amol Gore, Shubham Salunke, and Raviraj Joshi. 2021. Hate and offensive speech detection in hindi and marathi. arXiv:2110.12200. Retrieved from https://arxiv.org/abs/2110.12200.
[46]
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. 2020. HuggingFace’s transformers: State-of-the-art natural language processing. arXiv 2019. arXiv:1910.03771. Retrieved from https://arxiv.org/abs/1910.03771.
[47]
Tong Zeng and Daniel E. Acuna. 2020. Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models. Scientometrics 124, 1 (2020), 399–428.
[48]
Weifan Zhang, Hui Zhang, Yuan Zuo, and Deqing Wang. 2015. Modeling both coarse-grained and fine-grained topics in massive text data. In Proceedings of the IEEE First International Conference on Big Data Computing Service and Applications. IEEE, 378–383.
[49]
Ziqi Zhang and Lei Luo. 2019. Hate speech detection: A solved problem? The challenging case of long tail on twitter. Semant. Web 10, 5 (2019), 925–945.

Cited By

View all
  • (2024)Utilizing Deep Learning for Textual Classification of Hate Speech in Online Social Networks2023 4th International Conference on Intelligent Technologies (CONIT)10.1109/CONIT61985.2024.10627038(1-7)Online publication date: 21-Jun-2024
  • (2024)Customizable and Programmable Deep LearningPattern Recognition10.1007/978-3-031-78107-0_7(101-116)Online publication date: 2-Dec-2024

Index Terms

  1. Ensemble Classifier for Hindi Hostile Content Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 1
    January 2024
    385 pages
    EISSN:2375-4702
    DOI:10.1145/3613498
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 January 2024
    Online AM: 10 April 2023
    Accepted: 30 March 2023
    Revised: 06 February 2023
    Received: 09 August 2022
    Published in TALLIP Volume 23, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Hostility detection
    2. NLP
    3. social media
    4. Hindi
    5. defamation
    6. fake
    7. hate
    8. offensive
    9. BERT

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)199
    • Downloads (Last 6 weeks)23
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Utilizing Deep Learning for Textual Classification of Hate Speech in Online Social Networks2023 4th International Conference on Intelligent Technologies (CONIT)10.1109/CONIT61985.2024.10627038(1-7)Online publication date: 21-Jun-2024
    • (2024)Customizable and Programmable Deep LearningPattern Recognition10.1007/978-3-031-78107-0_7(101-116)Online publication date: 2-Dec-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media