skip to main content
10.1145/3582768.3582771acmotherconferencesArticle/Chapter ViewAbstractPublication PagesnlpirConference Proceedingsconference-collections
research-article

Hate Speech Detection on Indonesian Social Media: A Preliminary Study on Code-Mixed Language Issue

Published: 27 June 2023 Publication History

Abstract

Nowadays, social media becomes an important media for online communication, facilitating its users to publish content and providing a medium to express their opinions and feelings about anything. At the same time, abusive language is becoming a relevant problem on social media platforms such as Facebook and Twitter. Geographically, Indonesia consists of several regions with their own local languages. A recent report shows 718 local languages used by different regions and tribes in Indonesia. Indonesian tend to use a mix of their own local language and Bahasa to communicate on social media platforms, such as Twitter. Similar to other languages, code-mixed is also becoming the main issue and challenge of detecting hate speech in Indonesian social media. In this study, we conduct a preliminary experiment to study the detection of hate speech in Indonesian social media, specifically Twitter. Our experiment used 6,115 tweets in Indonesian-Javanese code-mixed and 2,945 tweets in Indonesian-Sundanese code-mixed. The overall results show that the traditional machine learning model with lexical-based features obtained the best performance in Javanese-Indonesian, while the LSTM network achieved the best performance in Sundanese-Indonesian. We also found that translating the code-mixed data into more resource-rich languages could not help to improve the classification performance.

References

[1]
Bohdan Andrusyak, Mykhailo Rimel, and Roman Kern. 2018. Detection of Abusive Speech for Mixed Sociolects of Russian and Ukrainian Languages. In The 12th Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2018, Karlova Studanka, Czech Republic, December 7-9, 2018, Ales Horák, Pavel Rychlý, and Adam Rambousek (Eds.). Tribun EU, 77–84.
[2]
Ajeng Dwi Asti, Indra Budi, and Muhammad Okky Ibrohim. 2021. Multi-label Classification for Hate Speech and Abusive Language in Indonesian-Local Languages. In 2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS). IEEE, 1–6.
[3]
Aditya Bohra, Deepanshu Vijay, Vinay Singh, Syed Sarfaraz Akhtar, and Manish Shrivastava. 2018. A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection. In Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media. Association for Computational Linguistics, New Orleans, Louisiana, USA, 36–41. https://doi.org/10.18653/v1/W18-1105
[4]
Bharathi Raja Chakravarthi, Anand Kumar M, John P. McCrae, B. Premjith, K. P. Soman, and Thomas Mandl. 2020. Overview of the track on HASOC-Offensive Language Identification-DravidianCodeMix. In Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation, Hyderabad, India, December 16-20, 2020(CEUR Workshop Proceedings, Vol. 2826), Parth Mehta, Thomas Mandl, Prasenjit Majumder, and Mandar Mitra (Eds.). CEUR-WS.org, 112–120. http://ceur-ws.org/Vol-2826/T2-2.pdf
[5]
Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1724–1734. https://doi.org/10.3115/v1/D14-1179
[6]
Adeep Hande, Karthik Puranik, Konthala Yasaswini, Ruba Priyadharshini, Sajeetha Thavareesan, Anbukkarasi Sampath, Kogilavani Shanmugavadivel, Durairaj Thenmozhi, and Bharathi Raja Chakravarthi. 2021. Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling. arXiv preprint arXiv:2108.12177(2021).
[7]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
[8]
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 6282–6293. https://doi.org/10.18653/v1/2020.acl-main.560
[9]
Jaida Langham and Kinnis Gosha. 2018. The Classification of Aggressive Dialogue in Social Media Platforms. In Proceedings of the 2018 ACM SIGMIS Conference on Computers and People Research, SIGMIS-CPR 2018, Buffalo-Niagara Falls, NY, USA, June 18-20, 2018, Rajiv Kishore, Daniel Beimborn, Rajendra K. Bandi, Benoit Aubert, Deborah Compeau, and Monideepa Tarafdar (Eds.). ACM, 60–63. https://doi.org/10.1145/3209626.3209720
[10]
Thomas Mandl, Sandip Modha, Prasenjit Majumder, Daksh Patel, Mohana Dave, Chintak Mandalia, and Aditya Patel. 2019. Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages. In FIRE ’19: Forum for Information Retrieval Evaluation, Kolkata, India, December, 2019, Prasenjit Majumder, Mandar Mitra, Surupendu Gangopadhyay, and Parth Mehta (Eds.). ACM, 14–17. https://doi.org/10.1145/3368567.3368584
[11]
Binny Mathew, Navish Kumar, Pawan Goyal, Animesh Mukherjee, 2018. Analyzing the hate and counter speech accounts on Twitter. arXiv preprint arXiv:1812.02712(2018).
[12]
Shubhanshu Mishra and Sudhanshu Mishra. 2019. 3Idiots at HASOC 2019: Fine-tuning Transformer Neural Networks for Hate Speech Identification in Indo-European Languages. In Working Notes of FIRE 2019 - Forum for Information Retrieval Evaluation, Kolkata, India, December 12-15, 2019(CEUR Workshop Proceedings, Vol. 2517), Parth Mehta, Paolo Rosso, Prasenjit Majumder, and Mandar Mitra (Eds.). CEUR-WS.org, 208–213. http://ceur-ws.org/Vol-2517/T3-4.pdf
[13]
Sandip Modha, Thomas Mandl, Gautam Kishore Shahi, Hiren Madhu, Shrey Satapara, Tharindu Ranasinghe, and Marcos Zampieri. 2021. Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech. In FIRE 2021: Forum for Information Retrieval Evaluation, Virtual Event, India, December 13 - 17, 2021, Debasis Ganguly, Surupendu Gangopadhyay, Mandar Mitra, and Prasenjit Majumder (Eds.). ACM, 1–3. https://doi.org/10.1145/3503162.3503176
[14]
Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi. 2020. Hate speech detection and racial bias mitigation in social media based on BERT model. PloS one 15, 8 (2020), e0237861.
[15]
Edward Ombui, Lawrence Muchemi, and Peter Wagacha. 2019. Hate Speech Detection in Code-switched Text Messages. In 2019 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT). IEEE, 1–6.
[16]
Oluwafemi Oriola and Eduan Kotzé. 2020. Evaluating Machine Learning Techniques for Detecting Offensive and Hate Speech in South African Tweets. IEEE Access 8(2020), 21496–21509. https://doi.org/10.1109/ACCESS.2020.2968173
[17]
Endang Wahyu Pamungkas, Valerio Basile, and Viviana Patti. 2021. A joint learning approach with knowledge injection for zero-shot cross-lingual hate speech detection. Information Processing & Management 58, 4 (2021), 102544. https://doi.org/10.1016/j.ipm.2021.102544
[18]
Endang Wahyu Pamungkas, Valerio Basile, and Viviana Patti. 2021. Towards Multidomain and Multilingual Abusive Language Detection: A Survey. Personal and Ubiquitous Computing(2021). https://link.springer.com/article/10.1007/s00779-021-01609-1 Published online: 11 August 2021.
[19]
Nishchal Prasad, Sriparna Saha, and Pushpak Bhattacharyya. 2021. A Multimodal Classification of Noisy Hate Speech using Character Level Embedding and Attention. In International Joint Conference on Neural Networks, IJCNN 2021, Shenzhen, China, July 18-22, 2021. IEEE, 1–8. https://doi.org/10.1109/IJCNN52387.2021.9533371
[20]
Adithya Pratapa, Monojit Choudhury, and Sunayana Sitaram. 2018. Word Embeddings for Code-Mixed Language Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 3067–3072. https://doi.org/10.18653/v1/D18-1344
[21]
Shofianina Dwi Ananda Putri, Muhammad Okky Ibrohim, and Indra Budi. 2021. Abusive language and hate speech detection for Javanese and Sundanese languages in tweets: Dataset and preliminary study. In 2021 11th International Workshop on Computer Science and Engineering, WCSE 2021. International Workshop on Computer Science and Engineering (WCSE), 461–465.
[22]
Harrison Rainie, Janna Quitney Anderson, and Jonathan Albright. 2017. The future of free speech, trolls, anonymity and fake news online. Pew Research Center Washington, DC.
[23]
Tharindu Ranasinghe, Marcos Zampieri, and Hansi Hettiarachchi. 2019. BRUMS at HASOC 2019: Deep Learning Models for Multilingual Hate Speech and Offensive Language Identification. In Working Notes of FIRE 2019 - Forum for Information Retrieval Evaluation, Kolkata, India, December 12-15, 2019(CEUR Workshop Proceedings, Vol. 2517), Parth Mehta, Paolo Rosso, Prasenjit Majumder, and Mandar Mitra (Eds.). CEUR-WS.org, 199–207. http://ceur-ws.org/Vol-2517/T3-3.pdf
[24]
Priya Rani, Shardul Suryawanshi, Koustava Goswami, Bharathi Raja Chakravarthi, Theodorus Fransen, and John Philip McCrae. 2020. A Comparative Study of Different State-of-the-Art Hate Speech Detection Methods in Hindi-English Code-Mixed Data. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying. European Language Resources Association (ELRA), Marseille, France, 42–48. https://www.aclweb.org/anthology/2020.trac-1.7
[25]
Chrysoula Themeli, George Giannakopoulos, and Nikiforos Pittaras. 2021. A study of text representations in Hate Speech Detection. CoRR abs/2102.04521(2021). arxiv:2102.04521https://arxiv.org/abs/2102.04521
[26]
Bin Wang, SL Yunxia Ding, and X Zhou. 2019. YNU Wb at HASOC 2019: Ordered Neurons LSTM with Attention for Identifying Hate Speech and Offensive Language. In Proceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation (December 2019).

Cited By

View all
  • (2024)Impact of hate speech in digital media on pre-election public opinionJurnal Studi Komunikasi (Indonesian Journal of Communications Studies)10.25139/jsk.v8i3.82478:3(607-616)Online publication date: 25-Nov-2024
  • (2023)Exploring the Impact of Lexicon-based Knowledge Transfer for Hate Speech Detection in Indonesia Code-Mixed LanguagesProceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval10.1145/3639233.3639247(85-90)Online publication date: 15-Dec-2023

Index Terms

  1. Hate Speech Detection on Indonesian Social Media: A Preliminary Study on Code-Mixed Language Issue

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      NLPIR '22: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval
      December 2022
      241 pages
      ISBN:9781450397629
      DOI:10.1145/3582768
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 June 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      NLPIR 2022

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)9
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 13 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Impact of hate speech in digital media on pre-election public opinionJurnal Studi Komunikasi (Indonesian Journal of Communications Studies)10.25139/jsk.v8i3.82478:3(607-616)Online publication date: 25-Nov-2024
      • (2023)Exploring the Impact of Lexicon-based Knowledge Transfer for Hate Speech Detection in Indonesia Code-Mixed LanguagesProceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval10.1145/3639233.3639247(85-90)Online publication date: 15-Dec-2023

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media