skip to main content
research-article

Artificial Intelligence inspired method for cross-lingual cyberhate detection from low resource languages

Published: 16 August 2024 Publication History

Abstract

The appearance of inflammatory language on social media by college or university students is quite prevalent, inspiring platforms to engage in community safety mechanisms. Escalating hate speech entails creating sophisticated artificial intelligence-based, machine learning, and deep learning algorithms to detect offensive internet content. With a few noteworthy exceptions, the majority of the studies on automatic hate speech recognition have emphasized high-resource languages, mainly English. We bridge this gap by addressing hate speech detection in Punjabi (Gurmukhi), a low-resource Indo-Aryan language articulated in Indian educational institutions. This research identifies cross-lingual hate speech in the code-switched English-Punjabi language used on social media. It proposes an approach combining the best hate speech detection techniques to cover existing state-of-the-art system gaps and limitations. In this method, the Roman Punjabi is transliterated, and then Bidirectional Encoder Representations from Transformer (BERT) based models are employed for hate detection. The proposed model has achieved 0.86 precision and 0.83 recall, and various higher educational institutions could employ it to discover the issues/domains where hate prevails the most.

References

[1]
Muhammad Raihan Abbas and Khadim Hussain Asif. 2020. Punjabi to ISO 15919 and Roman transliteration with phonetic rectification. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 2 (2020).
[2]
Adam Klein. 2021. Social networks and the challenge of hate disguised as fear and politics. Journal for Deradicalization 26 (2021), 1–33. Retrieved November 23, 2022 from http://www.safetylit.org/citations/index.php?fuseaction=citations.viewdetails&citationIds[]=citjournalarticle_680496_13
[3]
Shivang Agarwal and C. Ravindranath Chowdary. 2021. Combating hate speech using an adaptive ensemble learning model with a case study on COVID-19. Expert Syst. Appl. 185 (2021), 115632.
[4]
Gazi Imtiyaz Ahmad and Jimmy Singla. 2021. Sentiment analysis of code-mixed social media text (SA-CMSMT) in Indian-languages. Proceedings - 2021 International Conference on Computing Sciences, ICCS 2021 (2021), 25–33.
[5]
Raza Ali, Umar Farooq, Umair Arshad, Waseem Shahzad, and Mirza Omer Beg. 2022. Hate speech detection on Twitter using transfer learning. Comput. Speech Lang. 74 (2022), 101365.
[6]
Maha Jarallah Althobaiti. 2022. BERT-based approach to Arabic hate speech and offensive language detection in Twitter: Exploiting emojis and sentiment analysis. International Journal of Advanced Computer Science and Applications 13, 5 (2022), 972–980.
[7]
A. Arora. 2019. Qualitative analysis of code-switching with reference to gender and pragmatic functions in Indian students at Oxford. (2019).
[8]
Muhammad Adnan Ashraf, Rao Muhammad Adeel Nawab, and Feiping Nie. 2023. Tran-switch: A transfer learning approach for sentence level cross-genre author profiling on code-switched English–RomanUrdu text. Inf. Process Manag. 60, 3 (2023), 103261.
[9]
Rima N. Bahous, Mona Baroud Nabhani, and Nahla Nola Bacha. 2014. Code-switching in higher education in a multilingual environment: A Lebanese exploratory study. 23, 4 (2014), 353–368.
[10]
Bansal Vibhuti, Tyagi Mrinal, Sharma Rajesh, Gupta Vedika, and Xin Qin. 2022. A transformer based approach for abuse detection in code mixed Indic languages. ACM Transactions on Asian and Low-Resource Language Information Processing (2022).
[11]
Shankar Biradar, Sunil Saumya, and Arun Chauhan. 2022. Fighting hate speech from bilingual Hinglish speaker's perspective, a transformer- and translation-based approach. Soc. Netw. Anal. Min. 12, 1 (2022), 1–10.
[12]
Umit Can and B. Alatas. 2019. A new direction in social network analysis: Online social network analysis problems and applications. Physica A: Statistical Mechanics and its Applications 535 (2019), 122372.
[13]
Tommaso Caselli, Valerio Basile, Jelena Mitrović, and Michael Granitzer. 2021. HateBERT: Retraining BERT for abusive language detection in English. WOAH 2021 - 5th Workshop on Online Abuse and Harms, Proceedings of the Workshop (2021), 17–25.
[14]
Sergio Andrés Castaño-Pulgarín, Natalia Suárez-Betancur, Luz Magnolia Tilano Vega, and Harvey Mauricio Herrera López. 2021. Internet, social media and online hate speech. Systematic Review. Aggress. Violent Behav. 58 (2021), 101608.
[15]
Jordi Castellví, Mariona Massip Sabater, Gustavo A. González-Valencia, and Antoni Santisteban. 2022. Future teachers confronting extremism and hate speech. Humanities and Social Sciences Communications 2022 9:1 9, 1 (2022), 1–9.
[16]
Carmen Cervone, Martha Augoustinos, and Anne Maass. 2020. The language of derogation and hate: Functions, consequences, and reappropriation. 40, 1 (2020), 80–101.
[17]
Bharathi Raja Chakravarthi. 2022. Hope speech detection in YouTube comments. Soc. Netw. Anal. Min. 12, 1 (2022), 1–19.
[18]
Ka Long Roy Chan. 2019. Trilingual code-switching in Hong Kong. Applied Linguistics Research Journal (2019).
[19]
Dennis Chau and Carmen Lee. 2021. “See you soon! ADD OIL AR!”: Code-switching for face-work in edu-social Facebook groups. J. Pragmat. 184 (2021), 18–28.
[20]
Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1 (2018), 4171–4186.
[21]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North (2019), 4171--4186. https://doi.org/10.18653/V1/N19-1423
[22]
Jean Marc Dewaele and Li Wei. 2014. Attitudes towards code-switching among adult mono- and multilingual language users. 35, 3 (2014), 235–251.
[23]
Meagan Y. Driver. 2022. Switching codes and shifting morals: How code-switching and emotion affect moral judgment. Int. J. Biling. Educ. Biling. 25, 3 (2022), 905–921.
[24]
Nur’ Ain Elias, Aqilah Izzati Norzaidi, Mohamad Izhar Izzat Mohd Sabri, Charanjit Kaur Swaran Singh, Uma Shanti Ramanlingam, and Mahendran Maniam. 2022. ESL learners’ perceptions of code-switching in the English language classroom. International Journal of Asian Social Science 12, 5 (2022), 158–168.
[25]
Lizhou Fan, Huizi Yu, and Zhanyuan Yin. 2020. Stigmatization in social media: Documenting and analyzing hate speech for COVID-19 on Twitter. Proceedings of the Association for Information Science and Technology 57, 1 (2020), e313.
[26]
Ahmad Fanani and Jean Antunes Rudolf Zico Ma'u. 2018. Code switching and code mixing in English learning process. LingTera 5, 1 (2018), 68–77.
[27]
Greg Finak, Bryan Mayer, William Fulp, Paul Obrecht, Alicia Sato, Eva Chung, Drienna Holman, and Raphael Gottardo. 2018. DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis. Gates Open Res. 2 (2018).
[28]
Gaddisa Olani Ganfure. 2022. Comparative analysis of deep learning based Afaan Oromo hate speech detection. J. Big Data 9, 1 (2022), 1–13.
[29]
Sreeram Ganji, Kunal Dhawan, and Rohit Sinha. 2020. Novel textual features for language modeling of intra-sentential code-switching data. Comput. Speech Lang. 64 (2020), 101099.
[30]
Sayani Ghosal and Amita Jain. 2023. HateCircle and unsupervised hate speech detection incorporating emotion and contextual semantics. ACM Transactions on Asian and Low-Resource Language Information Processing 22, 4 (2023).
[31]
Meseret F. Hailu and Molly Sarubbi. 2021. Student resistance movements in higher education: An analysis of the depiction of Black Lives Matter student protests in news media. Black Liberation in Higher Education (2021), 42–58.
[32]
Fatemah Husain and Ozlem Uzuner. 2022. Investigating the effect of preprocessing Arabic text on offensive language and hate speech detection. Transactions on Asian and Low-Resource Language Information Processing 21, 4 (2022).
[33]
Keliang Jia. 2022. Sentiment classification of microblog: A framework based on BERT and CNN with attention mechanism. Computers and Electrical Engineering 101 (2022), 108032.
[34]
Navya Jose, Bharathi Raja Chakravarthi, Shardul Suryawanshi, Elizabeth Sherly, and John P. McCrae. 2020. A survey of current datasets for code-switching research. 2020 6th International Conference on Advanced Computing and Communication Systems, ICACCS 2020 (2020), 136–141.
[35]
Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, N. C. Gokul, Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 3 (2020), 4948–4961.
[36]
Satyajit Kamble, K. J. Somaiya, and Aditya Joshi. 2018. Hate speech detection from code-mixed Hindi-English tweets using deep learning models. (2018). Retrieved March 5, 2024 from https://arxiv.org/abs/1811.05145v1
[37]
Habibe Karayiğit, Ali Akdagli, and Çiğdem İnan Aci. 2022. Homophobic and hate speech detection using multilingual-BERT model on Turkish social media. Information Technology and Control 51, 2 (2022), 356–375.
[38]
Jagroop Kaur and Jaswinder Singh. 2020. Roman to Gurmukhi social media text normalization. International Journal of Intelligent Computing and Cybernetics 13, 4 (2020), 407–435.
[39]
Manpreet Kaur and Munish Saini. 2022. Indian government initiatives on cyberbullying: A case study on cyberbullying in Indian higher education institutions. Education and Information Technologies 2022 (2022), 1–35.
[40]
Manpreet Kaur and Munish Saini. 2023. Role of artificial intelligence in cyberbullying and cyberhate detection. 2023 14th International Conference on Computing Communication and Networking Technologies, ICCCNT 2023 (2023).
[41]
Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra, Bose Gali, Vish Subramanian, and Partha Talukdar. 2021. MuRIL: Multilingual representations for Indian languages. (2021).
[42]
Svetlana Kiritchenko, Isar Nejadgholi, and Kathleen C. Fraser. 2021. Confronting abusive language online: A survey from the ethical and human rights perspective. Journal of Artificial Intelligence Research 71 (2021), 431–478.
[43]
György Kovács, Pedro Alonso, and Rajkumar Saini. 2021. Challenges of hate speech detection in social media: Data scarcity, and leveraging external resources. SN Comput. Sci. 2, 2 (2021), 1–15.
[44]
Ritesh Kumar, Guggilla Bhanodai, Rajendra Pamula, and Maheshwar Reddy Chennuru. 2018. TRAC-1 shared task on aggression identification: IIT(ISM)@COLING’18. 58–65. Retrieved November 8, 2022 from https://aclanthology.org/W18-4407
[45]
Kirti Kumari, Jyoti Prakash Singh, Yogesh Kumar Dwivedi, and Nripendra Pratap Rana. 2020. Towards cyberbullying-free social media in smart cities: A unified multi-modal approach. Soft Comput 24, 15 (2020), 11059–11070.
[46]
Kirti Kumari, Jyoti Prakash Singh, Yogesh Kumar Dwivedi, and Nripendra Pratap Rana. 2021. Bilingual cyber-aggression detection on social media using LSTM autoencoder. Soft Comput 25, 14 (2021), 8999–9012.
[47]
Sanjay Kumar. 2022. Negative stances detection from multilingual data streams in low-resource languages on social media using BERT and CNN based transfer learning model. ACM Transactions on Asian and Low-Resource Language Information Processing (2022).
[48]
J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977), 159.
[49]
Gail Mason. 2020. Blue Lives Matter and hate crime law. 12, 2 (2020), 411–430.
[50]
Binny Mathew, Punyajoy Saha, Seid Muhie Yimam, Chris Biemann, Pawan Goyal, and Animesh Mukherjee. 2021. HateXplain: A benchmark dataset for explainable hate speech detection. 35th AAAI Conference on Artificial Intelligence, AAAI 2021 17A, (2021), 14867–14875.
[51]
Caroline Mellgren, Mika Andersson, and Anna Karin Ivert. 2017. For whom does hate crime hurt more? A comparison of consequences of victimization across motives and crime types. 36, 3–4 (2017), NP1512--1536NP.
[52]
Mainack Mondal, Leandro Araújo Silva, and Fabrício Benevenuto. 2017. A measurement study of hate speech in social media. HT 2017 - Proceedings of the 28th ACM Conference on Hypertext and Social Media (2017), 85–94.
[53]
Zewdie Mossie and Jenq Haur Wang. 2020. Vulnerable community identification using hate speech detection on social media. Inf. Process Manag. 57, 3 (2020), 102087.
[54]
Magdalena Obermaier, Desirée Schmuck, and Muniba Saleem. 2021. I'll be there for you? Effects of Islamophobic online hate speech and counter speech on Muslim in-group bystanders’ intention to intervene. New Media Soc. (2021).
[55]
Rebecca L. Oxford. 2018. Emotion as the amplifier and the primary motive: Some theories of emotion with relevance to language learning. Second Language Learning and Teaching 9783319669748 (2018), 53–72.
[56]
Rahul Pradhan and Dilip Kumar Sharma. 2022. An ensemble deep learning classifier for sentiment analysis on code-mix Hindi–English data. Soft Comput. (2022), 1–18.
[57]
Ratnavel Rajalakshmi, Srivarshan Selvaraj, R. Faerie Mattins, Pavitra Vasudevan, and M. Anand Kumar. 2023. HOTTEST: Hate and offensive content identification in Tamil using transformers and enhanced STemming. Comput. Speech Lang. 78 (2023), 101464.
[58]
Pradeep Kumar Roy, Snehaan Bhawal, and Chinnaudayar Navaneethakrishnan Subalalitha. 2022. Hate speech and offensive language detection in Dravidian languages using deep ensemble framework. Comput. Speech Lang. 75 (2022), 101386.
[59]
Koustav Rudra, Ashish Sharma, Kalika Bali, Monojit Choudhury, and Niloy Ganguly. 2019. Identifying and analyzing different aspects of English-Hindi code-switching in Twitter. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 18, 3 (2019).
[60]
Koustuv Saha, Eshwar Chandrasekharan, and Munmun De Choudhury. 2019. Prevalence and psychological effects of hateful speech in online college communities. WebSci 2019 - Proceedings of the 11th ACM Conference on Web Science (2019), 255–264.
[61]
Munish Saini, Madanjit Singh, Manpreet Kaur, and Manevpreet Kaur. 2021. Analysing the tweets to examine the behavioural response of Indian citizens over the approval of national education policy 2020. Int. J. Educ. Dev. 82 (2021), 102356.
[62]
Tuukka Savimäki, Markus Kaakinen, Pekka Räsänen, and Atte Oksanen. 2020. Disquieted by online hate: Negative experiences of Finnish adolescents and young adults. Eur. J. Crim. Pol. Res. 26, 1 (2020), 23–37.
[63]
Joseph Seering, Tony Wang, Jina Yoon, and Geoff Kaufman. 2019. Moderator engagement and community development in the age of algorithms. New Media Soc. 21, 7 (2019), 1417–1443.
[64]
Fatima Shannag, Bassam H. Hammo, and Hossam Faris. 2022. The design, construction and evaluation of annotated Arabic cyberbullying corpus. Educ. Inf. Technol. (Dordr) 27, 8 (2022), 10977–11023.
[65]
Arushi Sharma, Anubha Kabra, and Minni Jain. 2022. Ceasing hate with MoH: Hate speech detection in Hindi–English code-switched language. Inf. Process Manag. 59, 1 (2022), 102760.
[66]
Harsh Sharma, Rohan Mathur, Tejas Chintala, Samiappan Dhanalakshmi, and Ramalingam Senthil. 2022. An effective deep learning pipeline for improved question classification into Bloom's taxonomy's domains. Educ. Inf. Technol. (Dordr) (2022), 1–41.
[67]
Farah Shatnawi, Malak Abdullah, Mahmoud Hammad, and Mahmoud Al-Ayyoub. 2022. Comprehensive study of pre-trained language models: Detecting humor in news headlines. Soft Comput. 27, 5 (2022), 2575–2599.
[68]
Xiayang Shi, Xinyi Liu, Chun Xu, Yuanyuan Huang, Fang Chen, and Shaolin Zhu. 2022. Cross-lingual offensive speech identification with transfer learning for low-resource languages. Computers and Electrical Engineering 101 (2022), 108005.
[69]
Ervina C. M. Simatupang and Shendy Amalia. 2019. A sociolinguistic study of code switching among overseas Indonesian students on Facebook comments. International Journal of Innovation, Creativity and Change. www.ijicc.net 7, 9 (2019). Retrieved November 30, 2022 from www.ijicc.net
[70]
Pardeep Singh and Kamlesh Dutta. 2011. Formant analysis of Punjabi non-nasalized vowel phonemes. Proceedings - 2011 International Conference on Computational Intelligence and Communication Systems, CICN 2011 (2011), 375–380.
[71]
Oana Ştefăniţă and Diana Maria Buf. 2021. Hate speech in social media and its effects on the LGBT community: A review of the current research. Romanian Journal of Communication and Public Relations 23, 1 (2021), 47–55.
[72]
Malliga Subramanian, Rahul Ponnusamy, Sean Benhur, Kogilavani Shanmugavadivel, Adhithiya Ganesan, Deepti Ravi, Gowtham Krishnan Shanmugasundaram, Ruba Priyadharshini, and Bharathi Raja Chakravarthi. 2022. Offensive language detection in Tamil YouTube comments by adapters and cross-domain knowledge transfer. Comput. Speech Lang. 76 (2022), 101404.
[73]
Nicolas P. Suzor. 2019. Lawless: The secret rules that govern our digital lives (2019), 1–210.
[74]
Eugenio Tacchini, Gabriele Ballarin, Marco L. Della Vedova, Stefano Moret, and Luca de Alfaro. 2017. Some like it hoax: Automated fake news detection in social networks. CEUR Workshop Proc. 1960, (2017).
[75]
Lee Jia Thun, Phoey Lee Teh, and Chi Bin Cheng. 2022. CyberAid: Are your children safe from cyberbullying? Journal of King Saud University - Computer and Information Sciences 34, 7 (2022), 4099–4108.
[76]
Alice Tontodimamma, Eugenia Nissi, Annalina Sarra, and Lara Fontanella. 2021. Thirty years of research into hate speech: Topics of interest and their evolution. Scientometrics 126, 1 (2021), 157–179.
[77]
Cagri Toraman, Furkan Şahinuç, and Eyup Halit Yilmaz. 2022. Large-scale hate speech detection with cross-domain transfer. 2022 Language Resources and Evaluation Conference, LREC 2022 (2022), 2215–2225. Retrieved March 5, 2024 from https://arxiv.org/abs/2203.01111v2
[78]
Md Nasir Uddin, Bixin Li, Zafar Ali, Pavlos Kefalas, Inayat Khan, and Islam Zada. 2022. Software defect prediction employing BiLSTM and BERT-based semantic feature. Soft Comput. 26, 16 (2022), 7877–7891.
[79]
Matteo Vergani and Carolina Navarro. 2021. Hate crime reporting: The relationship between types of barriers and perceived severity. Eur. J. Crim. Pol. Res. (2021), 1–16.
[80]
Sebastian Wachs, Ludwig Bilz, Alexander Wettstein, Michelle F. Wright, Julia Kansok-Dusche, Norman Krause, and Cindy Ballaschk. 2022. Associations between witnessing and perpetrating online hate speech among adolescents: Testing moderation effects of moral disengagement and empathy. Psychol. Violence (2022).
[81]
Ary Iswanto Wibowo, Idah Yuniasih, and Fera Nelfianti. 2017. Analysis of types code switching and code mixing by the sixth president of republic indonesia's speech at the national of Independence Day. Progressive 12, 2 (2017), 1979--4975.
[82]
Xiao Kun Wu, Tian Fang Zhao, Lu Lu, and Wei Neng Chen. 2022. Predicting the hate: A GSTM model based on COVID-19 hate speech datasets. Inf. Process Manag. 59, 4 (2022), 102998.
[83]
Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. SemEval-2019 Task 6: Identifying and categorizing offensive language in social media (OffensEval). NAACL HLT 2019 - International Workshop on Semantic Evaluation, SemEval 2019, Proceedings of the 13th Workshop (2019), 75–86.
[84]
Marcos Zampieri, Tharindu Ranasinghe, Mrinal Chaudhari, Saurabh Gaikwad, Prajwal Krishna, Mayuresh Nene, and Shrunali Paygude. 2022. Predicting the type and target of offensive social media posts in Marathi. Soc. Netw. Anal. Min. 12, 1 (2022), 1–10.
[85]
Hate Speech (Stanford Encyclopedia of Philosophy). Retrieved December 21, 2022 from https://plato.stanford.edu/entries/hate-speech/

Cited By

View all
  • (2024)A Fine Grained Sentiment Analysis of Arabic LanguageVAWKUM Transactions on Computer Sciences10.21015/vtcs.v12i2.192612:2(178-190)Online publication date: 4-Dec-2024

Index Terms

  1. Artificial Intelligence inspired method for cross-lingual cyberhate detection from low resource languages

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 9
      September 2024
      186 pages
      EISSN:2375-4702
      DOI:10.1145/3613646
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 August 2024
      Online AM: 11 July 2024
      Accepted: 02 June 2024
      Revised: 19 March 2024
      Received: 18 October 2023
      Published in TALLIP Volume 23, Issue 9

      Check for updates

      Author Tags

      1. Artificial Intelligence
      2. cross-lingual
      3. cyberhate
      4. low resource languages
      5. social media

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)306
      • Downloads (Last 6 weeks)17
      Reflects downloads up to 02 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A Fine Grained Sentiment Analysis of Arabic LanguageVAWKUM Transactions on Computer Sciences10.21015/vtcs.v12i2.192612:2(178-190)Online publication date: 4-Dec-2024

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media