research-article

Presenting a labelled dataset for real-time detection of abusive user posts

Authors:

Susan Mckeever,

Sarah Jane DelanyAuthors Info & Claims

WI '17: Proceedings of the International Conference on Web Intelligence

Pages 884 - 890

https://doi.org/10.1145/3106426.3106456

Published: 23 August 2017 Publication History

Abstract

Social media sites facilitate users in posting their own personal comments online. Most support free format user posting, with close to real-time publishing speeds. However, online posts generated by a public user audience carry the risk of containing inappropriate, potentially abusive content. To detect such content, the straightforward approach is to filter against blacklists of profane terms. However, this lexicon filtering approach is prone to problems around word variations and lack of context. Although recent methods inspired by machine learning have boosted detection accuracies, the lack of gold standard labelled datasets limits the development of this approach. In this work, we present a dataset of user comments, using crowdsourcing for labelling. Since abusive content can be ambiguous and subjective to the individual reader, we propose an aggregated mechanism for assessing different opinions from different labellers. In addition, instead of the typical binary categories of abusive or not, we introduce a third class of 'undecided' to capture the real life scenario of instances that are neither blatantly abusive nor clearly harmless. We have performed preliminary experiments on this dataset using best practice techniques in text classification. Finally, we have evaluated the detection performance of various feature groups, namely syntactic, semantic and context-based features. Results show these features can increase our classifier performance by 18% in detection of abusive content.

References

[1]

Jennifer Bayzick, April Kontostathis, and Lynne Edwards. 2011. Detecting the presence of cyberbullying using computer software. WebSci Conferemce (2011).

[2]

Pete Burnap and Matthew L Williams. 2015. Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making. Policy & Internet 7, 2 (2015), 223--242.

[3]

Hao Chen, Susan Mckeever, and Sarah Jane Delany. 2016. Harnessing the Power of Text Mining for the Detection of Abusive Content in Social Media. In Advances in Computational Intelligence Systems: Contributions Presented at the 16th UK Workshop on Computational Intelligence, 2016, Vol. 513. Springer, Springer, Lancaster, UK, 187.

[4]

Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu. 2012. Detecting Offensive Language in Social Media to Protect Adolescent Online Safety. In Proceedings of the 2012 ASE/IEEE International Conference on Social Computing and 2012 ASE/IEEE International Conference on Privacy, Security, Risk and Trust (SOCIALCOM-PASSAT '12). IEEE Computer Society, Washington, DC, USA, 71-80.

Digital Library

[5]

Maral Dadvar and Franciska de Jong. 2012. Cyberbullying Detection: A Step Toward a Safer Internet Yard. In Proceedings of the 21st International Conference on World Wide Web (WWW '12 Companion). ACM, New York, NY, USA, 121-126.

Digital Library

[6]

M Dadvar, FMG de Jong, RJF Ordelman, and RB Trieschnigg. 2012. Improved cyberbullying detection using gender information. In Proceedings of the Twelfth Dutch-Belgian Information Retrieval Workshop (DIR 2012). University of Ghent, St. Pietersnieuwstraat 33, 9000 Gent, Belgium, 23--26.

[7]

Maral Dadvar, Dolf Trieschnigg, and Franciska de Jong. 2014. Experts and machines against bullies: a hybrid approach to detect cyberbullies. In Canadian Conference on Artificial Intelligence. Springer, 275--281.

[8]

Laura P Del Bosque and Sara Elena Garza. 2014. Aggressive text detection for cyberbullying. In Mexican International Conference on Artificial Intelligence. Springer, 221--232.

[9]

Karthik Dinakar, Roi Reichart, and Henry Lieberman. 2011. Modeling the Detection of Textual Cyberbullying. In The Social Mobile Web, Papers from the 2011 ICWSM Workshop (AAAI Workshops). AAAI, Barcelona, Catalonia, Spain. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/3841

[10]

Homa Hosseinmardi, Rahat Ibn Rafiq, Shaosong Li, Zhili Yang, Richard Han, Shivakant Mishra, and Qin Lv. 2014. A comparison of common users across instagram and ask. fm to better understand cyberbullying. arXiv preprint arXiv:1408.4882 (2014).

[11]

Qianjia Huang, Vivek Kumar Singh, and Pradeep Kumar Atrey. 2014. Cyber Bullying Detection Using Social and Textual Analysis. In Proceedings of the 3rd International Workshop on Socially-Aware Multimedia (SAM '14). ACM, New York, NY, USA, 3--6.

Digital Library

[12]

Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. Machine learning: ECML-98 (1998), 137--142.

Digital Library

[13]

April Kontostathis, Kelly Reynolds, Andy Garron, and Lynne Edwards. 2013. Detecting cyberbullying: query terms and techniques. In Proceedings of the 5th annual acm web science conference. ACM, 195--204.

Digital Library

[14]

A. Mangaonkar, A. Hayrapetian, and R. Raje. 2015. Collaborative detection of cyberbullying behavior in Twitter data. In 2015 IEEE International Conference on Electro/Information Technology (EIT). IEEE, Northern Illinois University Dekalb, IL, USA, 611--616.

[15]

Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 145--153.

Digital Library

[16]

S Patro and Kishore Kumar Sahu. 2015. Normalization: A Preprocessing Stage. arXiv preprint arXiv:1503.06462 (2015).

[17]

Kelly Reynolds, April Kontostathis, and Lynne Edwards. 2011. Using machine learning to detect cyberbullying. In Machine learning and applications and workshops (ICMLA), 2011 10th International Conference on, Vol. 2. IEEE, IEEE, Hilton Hawaiian Village, Honolulu Hawaii USA, 241--244.

Digital Library

[18]

Fabrizio Sebastiani. 2002. Machine Learning in Automated Text Categorization. ACM Comput. Surv. 34, 1 (March 2002), 1--47.

Digital Library

[19]

Sara Sood, Judd Antin, and Elizabeth Churchill. 2012. Profanity Use in Online Communities. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '12). ACM, New York, NY, USA, 1481-1490.

Digital Library

[20]

Sara Owsley Sood, Elizabeth F Churchill, and Judd Antin. 2012. Automatic identification of personal insults on social news sites. Journal of the American Society for Information Science and Technology 63, 2 (2012), 270--285.

Digital Library

[21]

GuangXiang, Bin Fan, Ling Wang, Jason Hong, and Carolyn Rose. 2012. Detecting Offensive Tweets via Topical Feature Discovery over a Large Scale Twitter Corpus. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM '12). ACM, New York, NY, USA, 1980-1984.

Digital Library

[22]

Jun-Ming Xu, Kwang-Sung Jun, Xiaojin Zhu, and Amy Bellmore. 2012. Learning from bullying traces in social media. In Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: Human language technologies. Association for Computational Linguistics, 656--666.

Digital Library

[23]

Dawei Yin, Zhenzhen Xue, Liangjie Hong, Brian D Davison, April Kontostathis, and Lynne Edwards. 2009. Detection of harassment on web 2.0. Proceedings of the Content Analysis in the WEB 2 (2009), 1--7.

[24]

Rui Zhao, Anna Zhou, and Kezhi Mao. 2016. Automatic Detection of Cyberbullying on Social Networks Based on Bullying Features. In Proceedings of the 17th International Conference on Distributed Computing and Networking (ICDCN '16). ACM, New York, NY, USA, Article 43, 6 pages.

Digital Library

Cited By

Fati SMuneer AAlwadain ABalogun A(2023)Cyberbullying Detection on Twitter Using Deep Learning-Based Attention Mechanisms and Continuous Bag of Words Feature ExtractionMathematics10.3390/math1116356711:16(3567)Online publication date: 17-Aug-2023
https://doi.org/10.3390/math11163567
Haq ELu YHui P(2022)It's All Relative! A Method to Counter Human Bias in Crowdsourced Stance Detection of News ArticlesProceedings of the ACM on Human-Computer Interaction10.1145/35556366:CSCW2(1-25)Online publication date: 11-Nov-2022
https://dl.acm.org/doi/10.1145/3555636
Brar GSharma GSingh PGupta NKalra NParashar A(2022)NOMA—Non-offensive Messaging Application Framework Using Machine Learning Technique for Online Communication Through Social MediaHuman-Centric Smart Computing10.1007/978-981-19-5403-0_27(315-328)Online publication date: 29-Nov-2022
https://doi.org/10.1007/978-981-19-5403-0_27
Show More Cited By

Recommendations

Abusive Language Detection in Online User Content
WWW '16: Proceedings of the 25th International Conference on World Wide Web

Detection of abusive language in user generated online content has become an issue of increasing importance in recent years. Most current commercial methods make use of blacklists and regular expressions, however these measures fall short when ...
Real time intrusion detection system for ultra-high-speed big data environments

In recent years, the number of people using the Internet and network services is increasing day by day. On a daily basis, a large amount of data is generated over the Internet from zeta byte to petabytes with a very high speed. On the other hand, we see ...
Permission based malware detection in android devices
SCA '18: Proceedings of the 3rd International Conference on Smart City Applications

The mobile operation system Android is one of the most OS's used in the entire world, which make it the target of many malware projects and the mission of detecting those malware applications is getting harder over time due to evaluation and development ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WI '17: Proceedings of the International Conference on Web Intelligence

August 2017

1284 pages

ISBN:9781450349512

DOI:10.1145/3106426

Conference Chair:
Amit Sheth
Wright State University GuoROLE@GENERAL CHAIR
,
General Chairs:
Axel Ngonga
Leipzig University, Germany
,
yin Wang
Chongqing University of Posts and Telecommunications, China
,
Elizabeth Chang
The University of New South Wales, Australia
,
Dominik Ślęzak
Infobright Inc. & University of Warsaw, Poland
,
Bogdan Franczyk
Leipzig University, Germany
,
Program Chairs:
Rainer Alt
Leipzig University, Germany
,
Xiaohui Tao
University of Southern Queensland, Australia

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence
TCII: IEEE Computer Society Technical Committee on Intelligent Informatics
Web Intelligence Consortium

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WI '17

Sponsor:

SIGAI
TCII

WI '17: International Conference on Web Intelligence 2017

August 23 - 26, 2017

Leipzig, Germany

Acceptance Rates

WI '17 Paper Acceptance Rate 118 of 178 submissions, 66%;

Overall Acceptance Rate 118 of 178 submissions, 66%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
336
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)2

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fati SMuneer AAlwadain ABalogun A(2023)Cyberbullying Detection on Twitter Using Deep Learning-Based Attention Mechanisms and Continuous Bag of Words Feature ExtractionMathematics10.3390/math1116356711:16(3567)Online publication date: 17-Aug-2023
https://doi.org/10.3390/math11163567
Haq ELu YHui P(2022)It's All Relative! A Method to Counter Human Bias in Crowdsourced Stance Detection of News ArticlesProceedings of the ACM on Human-Computer Interaction10.1145/35556366:CSCW2(1-25)Online publication date: 11-Nov-2022
https://dl.acm.org/doi/10.1145/3555636
Brar GSharma GSingh PGupta NKalra NParashar A(2022)NOMA—Non-offensive Messaging Application Framework Using Machine Learning Technique for Online Communication Through Social MediaHuman-Centric Smart Computing10.1007/978-981-19-5403-0_27(315-328)Online publication date: 29-Nov-2022
https://doi.org/10.1007/978-981-19-5403-0_27
Sharma GBrar GSingh PGupta NKalra NParashar A(2022)An Exploration of Machine Learning and Deep Learning Techniques for Offensive Text Detection in Social Media—A Systematic ReviewInternational Conference on Innovative Computing and Communications10.1007/978-981-19-3679-1_45(541-559)Online publication date: 8-Nov-2022
https://doi.org/10.1007/978-981-19-3679-1_45
Balayn AYang JSzlavik ZBozzon A(2021)Automatic Identification of Harmful, Aggressive, Abusive, and Offensive Language on the Web: A Survey of Technical Biases Informed by Psychology LiteratureACM Transactions on Social Computing10.1145/34791584:3(1-56)Online publication date: 25-Oct-2021
https://dl.acm.org/doi/10.1145/3479158
Pamungkas EBasile VPatti V(2021)Towards multidomain and multilingual abusive language detection: a surveyPersonal and Ubiquitous Computing10.1007/s00779-021-01609-127:1(17-43)Online publication date: 11-Aug-2021
https://doi.org/10.1007/s00779-021-01609-1
Vargas JLezama OJimenez J(2021)Determining the Degree of Relevance of Content on Social Networks Using Machine Learning Techniques and N-GramsProceedings of International Conference on Intelligent Computing, Information and Control Systems10.1007/978-981-15-8443-5_25(313-320)Online publication date: 25-Jan-2021
https://doi.org/10.1007/978-981-15-8443-5_25
Iwendi CSrivastava GKhan SMaddikunta P(2020)Cyberbullying detection solutions based on deep learning architecturesMultimedia Systems10.1007/s00530-020-00701-529:3(1839-1852)Online publication date: 13-Oct-2020
https://doi.org/10.1007/s00530-020-00701-5
Mathur AAcar GFriedman MLucherini EMayer JChetty MNarayanan A(2019)Dark Patterns at ScaleProceedings of the ACM on Human-Computer Interaction10.1145/33591833:CSCW(1-32)Online publication date: 7-Nov-2019
https://dl.acm.org/doi/10.1145/3359183
Baumer EGuha SSkeba PGay G(2019)All Users are (Not) Created EqualProceedings of the ACM on Human-Computer Interaction10.1145/33591823:CSCW(1-28)Online publication date: 7-Nov-2019
https://dl.acm.org/doi/10.1145/3359182
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten