research-article

Improved Bi-GRU Model for Imbalanced English Toxic Comments Dataset

Authors:

Bao ZhangAuthors Info & Claims

NLPIR '21: Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

Pages 24 - 29

https://doi.org/10.1145/3508230.3508234

Published: 08 March 2022 Publication History

Abstract

Deep learning is widely used in the study of English toxic comment classification. However, most existing studies failed to consider data imbalance. Aiming at an imbalanced English Toxic Comments Dataset, we propose an improved Bi-gated recurrent unit (GRU) model that combines an oversampling and cost-sensitive method. We use random oversampling in the improved model to reduce the data imbalance, introduce a cost-sensitive method, and propose a new loss function for the Bi-GRU model. Experimental results show that the improved Bi-GRU model demonstrates a significantly improved classification performance in the imbalanced English Toxic Comments Dataset.

References

[1]

Support and S. Team, “Harassment survey.” Wikimedia Foundation, 2015. https://foundation.wikimedia.org/wiki/File:Harassment_Survey_2015_-_Results_Report.pdf.

[2]

K. Dinakar, R. Reichart, and H. Lieberman, “Modeling the detection of textual cyberbullying,” in Proceedings of the International AAAI Conference on Web and Social Media, vol. 5, 2011.

[3]

J.-M. Xu, K.-S. Jun, X. Zhu, and A. Bellmore, “Learning from bullying traces in social media,” in Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: Human language technologies, pp. 656–666, 2012.

Digital Library

[4]

T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated hate speech detection and the problem of offensive language,” in Proceedings of the International AAAI Conference on Web and Social Media, vol. 11, 2017.

[5]

S. V. Georgakopoulos, S. K. Tasoulis, A. G. Vrahatis, and V. P. Plagianakos, “Convolutional neural networks for toxic comment classification,” in Proceedings of the 10th hellenic conference on artificial intelligence, pp. 1–6, 2018.

Digital Library

[6]

S. V. Georgakopoulos, S. K. Tasoulis, A. G. Vrahatis, and V. P. Plagianakos, “Convolutional neural networks for toxic comment classification,” in Proceedings of the 10th hellenic conference on artificial intelligence, pp. 1–6, 2018.

Digital Library

[7]

N. Nikhil, R. Pahwa, M. K. Nirala, and R. Khilnani, “Lstms with attention for aggression detection,” in Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), pp. 52–57, 2018.

[8]

R. Kumar, G. Bhanodai, R. Pamula, and M. R. Chennuru, “Trac-1 shared task on aggression identification: Iit (ism)@ coling鈥?8,” in Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), pp. 58–65, 2018.

[9]

R. Pronko, “Simple bidirectional lstm solution for text classification,” Proceedings ofthePolEval2019Workshop, p. 111, 2019.

[10]

S. Srivastava, P. Khurana, and V. Tewari, “Identifying aggression and toxicity in comments using capsule network,” in Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), pp. 98–105, 2018.

[11]

V. Garcá, J. S. Sánchez, and R. A. Mollineda, “On the effectiveness of preprocessing methods when dealing with different levels of class imbalance,” Knowledge-Based Systems, vol. 25, no. 1, pp. 13–21, 2012.

[12]

Y.-X. Wang, D. Ramanan, and M. Hebert, “Learning to model the tail,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 7032–7042, 2017.

[13]

B. Krawczyk, “Cost-sensitive one-vs-one ensemble for multi-class imbalanced data,” in 2016 International Joint Conference on Neural Networks (IJCNN), pp. 2447–2452, IEEE, 2016.

[14]

C. Zhang, K. C. Tan, H. Li, and G. S. Hong, “A cost-sensitive deep belief network for imbalanced classification,” IEEE transactions on neural networks and learning systems, vol. 30, no. 1, pp. 109–122, 2018.

[15]

Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-balanced loss based on effective number of samples,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268–9277, 2019.

[16]

J. Cheng, L. Dong, and M. Lapata, “Long short-term memory-networks for machine reading,” arXiv preprint arXiv:1601.06733, 2016.

Cited By

Saeed HKhalil TKamiran F(2025)Urdu Toxic Comment Classification With PURUTT Corpus DevelopmentIEEE Access10.1109/ACCESS.2025.353586213(21635-21651)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3535862
Machová KBalara VMach MKožík Š(2025)Selection and evaluation of a set of attributes appropriate for detection of antisocial behaviour in online mediaMultimedia Tools and Applications10.1007/s11042-024-20514-2Online publication date: 13-Jan-2025
https://doi.org/10.1007/s11042-024-20514-2
Peng SCao LWang GOuyang ZZhou YYu S(2024)A survey on textual emotion cause extraction in social networksDigital Communications and Networks10.1016/j.dcan.2024.07.004Online publication date: Jul-2024
https://doi.org/10.1016/j.dcan.2024.07.004
Show More Cited By

Recommendations

Over-sampling via under-sampling in strongly imbalanced data

Classification of imbalanced datasets is an important challenge in machine learning. This investigation analysed the effect of ratio imbalance and the selected classifier on the application of several re-sampling strategies to deal with imbalanced ...
Cost-Sensitive Learning for Imbalanced Bad Debt Datasets in Healthcare Industry
APCASE '15: Proceedings of the 2015 Asia-Pacific Conference on Computer Aided System Engineering

The research using computational intelligence methods to improve bad debt recovery is imperative due to the rapid increase in the cost of healthcare in the U.S. This study explores effectiveness of using cost-sensitive learning methods to classify the ...
An Effective Method for Imbalanced Time Series Classification: Hybrid Sampling
Proceedings of the 26th Australasian Joint Conference on AI 2013: Advances in Artificial Intelligence - Volume 8272

Most traditional supervised classification learning algorithms are ineffective for highly imbalanced time series classification, which has received considerably less attention than imbalanced data problems in data mining and machine learning research. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

NLPIR '21: Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

December 2021

175 pages

ISBN:9781450387354

DOI:10.1145/3508230

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Major Project of Natural Science Research Foundation of Education Bureau of Anhui Province, China
Project of University Excellent Talents of Education Bureau of Anhui Province, China

Conference

NLPIR 2021

NLPIR 2021: 2021 5th International Conference on Natural Language Processing and Information Retrieval

December 17 - 20, 2021

Sanya, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
86
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Saeed HKhalil TKamiran F(2025)Urdu Toxic Comment Classification With PURUTT Corpus DevelopmentIEEE Access10.1109/ACCESS.2025.353586213(21635-21651)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3535862
Machová KBalara VMach MKožík Š(2025)Selection and evaluation of a set of attributes appropriate for detection of antisocial behaviour in online mediaMultimedia Tools and Applications10.1007/s11042-024-20514-2Online publication date: 13-Jan-2025
https://doi.org/10.1007/s11042-024-20514-2
Peng SCao LWang GOuyang ZZhou YYu S(2024)A survey on textual emotion cause extraction in social networksDigital Communications and Networks10.1016/j.dcan.2024.07.004Online publication date: Jul-2024
https://doi.org/10.1016/j.dcan.2024.07.004
Peng SCao LPeng SCao L(2024)IntroductionTextual Emotion Classification Using Deep Broad Learning10.1007/978-3-031-67718-2_1(1-30)Online publication date: 28-Sep-2024
https://doi.org/10.1007/978-3-031-67718-2_1
Al Shamsi AAbdallah S(2023)Ensemble Stacking Model for Sentiment Analysis of Emirati and Arabic DialectsJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10169135:8(101691)Online publication date: Sep-2023
https://doi.org/10.1016/j.jksuci.2023.101691

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten