ABSTRACT
Hate speech has negative effects on both the targeted victims and the listeners. The dissemination of hate speech can occur not only physically or verbally, but also in writing on social media. The emergence of hate speech on social media platforms can be difficult to identify in written communication. Currently, hate speech detection relies on machine learning. This study generates a vector representation of words using three pre-trained word insertion models: Global Vectors (GloVe), FastText, and Bidirectional Encoder Representations from Transformers (BERT). Synthetic Minority Oversampling Technique (SMOTE) and Random Over Sampling (ROS) were utilized as balancing methods to rectify data imbalance between classes. In addition, three distinct deep learning architectures were used to identify sentence-level hate speech in Indonesian tweets: Bidirectional Long Sort-Term Memory (BiLSTM), Convolution Neural Network (CNN), and Recurrent Neural Network (RNN). The dataset was collected by crawling the data via the Twitter API. After data underwent preprocessing, characteristics were extracted. Based on experimental results, classifiers employing RNN and BERT embedding and utilizing SMOTE produced the most accurate results (95.5%).
- Aggarwal, A. 2021. Two-Way Feature Extraction Using Sequential and Multimodal Approach for Hateful Meme Classification. Complexity. 2021, (2021). DOI:https://doi.org/10.1155/2021/5510253.Google ScholarDigital Library
- Ali Shah, S.M. 2021. GT-Finder: Classify the family of glucose transporters with pre-trained BERT language models. Computers in Biology and Medicine. 131, (Apr. 2021). DOI:https://doi.org/10.1016/j.compbiomed.2021.104259.Google ScholarCross Ref
- Ariwibowo, S. 2022. Hate Speech Text Classification Using Long Short-Term Memory (LSTM). ICOSNIKOM 2022 - 2022 IEEE International Conference of Computer Science and Information Technology: Boundary Free: Preparing Indonesia for Metaverse Society (2022).Google ScholarCross Ref
- Asti, A.D. 2021. Multi-label Classification for Hate Speech and Abusive Language in Indonesian-Local Languages. 2021 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2021 (2021).Google Scholar
- Bojanowski, P. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. 5, (Dec. 2017), 135–146. DOI:https://doi.org/10.1162/tacl_a_00051.Google ScholarCross Ref
- D'Sa, A.G. 2020. BERT and fastText Embeddings for Automatic Detection of Toxic Speech. 2020 International Multi-Conference on: “Organization of Knowledge and Advanced Technologies” (OCTA) (Feb. 2020), 1–5.Google Scholar
- Hana, K.M. 2020. Multi-label Classification of Indonesian Hate Speech on Twitter Using Support Vector Machines. 2020 International Conference on Data Science and Its Applications (ICoDSA) (Aug. 2020), 1–7.Google Scholar
- Hasanah, N.A. 2021. Identifying degree-of-concern on covid-19 topics with text classification of twitters. Register: Jurnal Ilmiah Teknologi Sistem Informasi. 7, 1 (2021), 50–62. DOI:https://doi.org/10.26594/register.v7i1.2234.Google ScholarCross Ref
- Joulin, A. Bag of Tricks for Efficient Text Classification. the Association for Computational Linguistics. 2, 427–431. DOI:https://doi.org/https://doi.org/10.48550/arXiv.1607.01759.Google ScholarCross Ref
- Khasanah, I.N. 2021. Sentiment Classification Using fastText Embedding and Deep Learning Model. Procedia CIRP (2021), 343–350.Google Scholar
- Lim, E. 2019. Stance Classification Post Kesehatan di Media Sosial Dengan FastText Embedding dan Deep Learning. Journal of Intelligent System and Computation. 1, 2 (Dec. 2019), 65–73. DOI:https://doi.org/10.52985/insyst.v1i2.86.Google ScholarCross Ref
- Luthfi, E.T. 2021. Enhancing the Takhrij Al-Hadith based on Contextual Similarity using BERT Embeddings. International Journal of Advanced Computer Science and Applications. 12, 11 (2021), 2021. DOI:https://doi.org/10.14569/IJACSA.2021.0121133.Google ScholarCross Ref
- Mossie, Z. and Wang, J.H. 2020. Vulnerable community identification using hate speech detection on social media. Information Processing and Management. 57, 3 (2020), 102087. DOI:https://doi.org/10.1016/j.ipm.2019.102087.Google ScholarDigital Library
- Padurariu, C. and Breaban, M.E. 2019. Dealing with data imbalance in text classification. Procedia Computer Science (2019), 736–745.Google Scholar
- Saketh Aluru, S. 2020. Deep Learning Models for Multilingual Hate Speech Detection *.Google Scholar
- Sigurbergsson, G.I. and Derczynski, L. 2023. Offensive Language and Hate Speech Detection for Danish. Proceedings of the Twelfth Language Resources and Evaluation Conference (Aug. 2023).Google Scholar
- Sreelakshmi, K. 2020. Detection of Hate Speech Text in Hindi-English Code-mixed Data. Procedia Computer Science (2020), 737–744.Google Scholar
- SURYONO, R.R. and BUDI, I. 2020. P2P Lending Sentiment Analysis in Indonesian Online News. Proceedings of the Sriwijaya International Conference on Information Technology and Its Applications (SICONIAN 2019) (Paris, France, 2020).Google Scholar
Index Terms
- Comparison of Deep Learning Methods in Detecting Hate Speech in Indonesian Tweets
Recommendations
Hate Speech Identification using the Hate Codes for Indonesian Tweets
DSIT 2019: Proceedings of the 2019 2nd International Conference on Data Science and Information TechnologyThe hate speech has become the major source of negativity spread in all over the social media. As the social media becomes aware of this issue, they gradually build several new regulations to handle the spread of hate speech e.g. by automatically ...
Detection of hate speech in Arabic tweets using deep learning
AbstractNowadays, people are communicating through social networks everywhere. However, for whatever reason it is noticeable that verbal misbehaviors, such as hate speech is now propagated through the social networks. One of the most popular social ...
Hate speech and offensive language detection in Dravidian languages using deep ensemble framework
AbstractSocial networking platforms gained widespread popularity and are used for various activities like: promoting products, sharing news, achievements and many more. On the other hand, it is also used for spreading rumors, bullying people, ...
Highlights- Proposed a weighted ensemble framework for hate and offensive code-mixed posts identification on social platforms.
Comments