A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

Naseem, Usman; Razzak, Imran; Eklund, Peter W.

doi:10.1007/s11042-020-10082-6

A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

Published: 04 November 2020

Volume 80, pages 35239–35266, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Usman Naseem¹,
Imran Razzak² &
Peter W. Eklund²

2505 Accesses
51 Citations
Explore all metrics

Abstract

Pre-processing plays an essential role in disambiguating the meaning of short-texts, not only in applications that classify short-texts but also for clustering and anomaly detection. Pre-processing can have a considerable impact on overall system performance; however, it is less explored in the literature in comparison to feature extraction and classification. This paper analyzes twelve different pre-processing techniques on three pre-classified Twitter datasets on hate speech and observes their impact on the classification tasks they support. It also proposes a systematic approach to text pre-processing to apply different pre-processing techniques in order to retain features without information loss. In this paper, two different word-level feature extraction models are used, and the performance of the proposed package is compared with state-of-the-art methods. To validate gains in performance, both traditional and deep learning classifiers are used. The experimental results suggest that some pre-processing techniques impact negatively on performance, and these are identified, along with the best performing combination of pre-processing techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

A comparison of text preprocessing techniques for hate and offensive speech detection in Twitter

Article 01 December 2023

Anna Glazkova

A Study of Text Representations for Hate Speech Detection

Vietnamese hate and offensive detection using PhoBERT-CNN and social media streaming data

Article 17 September 2022

Quoc Tran Khanh, Trong Nguyen An, … Nguyen Kiet

Notes

Hate speech is defined by Cambridge Dictionary as “public speech that expresses hate or encourages violence towards a person or group based on something such as race, religion, sex, or sexual orientation”.
https://github.com/sloria/TextBlob
http://norvig.com/spell-correct.html
https://www.nltk.org/api/nltk.html
https://github.com/scikit-learn/scikit-learn
https://github.com/explosion/spaCy
https://radimrehurek.com/gensim/
https://stanfordnlp.github.io/CoreNLP/
https://textblob.readthedocs.io/en/dev/
https://github.com/cjlin1/liblinear
https://machinelearningmastery.com/prepare-movie-review-data-sentiment-analysis/
http://noisy-text.github.io/
https://github.com/cbaziotis/ekphrasis
https://pypi.org/project/pycontractions/
https://pythonprogramming.net/lemmatizing-nltk-tutorial/
https://gist.github.com/sebleier/554280
http://norvig.com/spell-correct.html
https://github.com/tweepy/tweepy

References

Agarwal A, Xie B, Vovsha I, Rambow O, Rebecca J (2011) Passonneau. sentiment analysis of twitter data
Alomari E, Mehmood R, Katib I (2019) Road traffic event detection using twitter data, machine learning, and apache spark. In: 2019 IEEE SmartWorld, ubiquitous intelligence & computing, advanced & trusted computing, scalable computing & communications, cloud & big data computing, internet of people and smart city innovation (Smart- World/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), IEEE, pp 1888–1895
Alotaibi S, Mehmood R, Katib I, Rana O, Albeshri A (2020) Sehaa: a big data analytics tool for healthcare symptoms and diseases detection using twitter, apache spark, and machine learning. Appl Sci 10(4):1398
Article Google Scholar
Balahur A (2013) Sentiment analysis in social media texts. In: WASSA@NAACL-HLT
Bao Y, Quan C, Wang L, Ren F (2014) The role of pre-processing in twitter sentiment analysis. In: Huang D-S, Jo K-H, Ling Wang (eds) Intelligent computing methodologies. Springer International Publishing, Cham, pp 615–624
Boia M, Faltings B, Musat CC, Pu P (2013) A: is worth a thousand words: how people attach sentiment to emoticons and words in tweets. In: 2013 international conference on social computing, pp 345–350
Davidson T, Warmsley D, Macy MW, Weber I Automated hate speech detection and the problem of offensive language. arXiv:04009.2017
Dos Santos CN, de C. Gatti MA (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: COLING
Fayyad UM, Piatetsky-Shapiro G, Uthurusamy R (2003) Summary from the KDD-03 panel: data mining: the next 10 years. ACM SIGKDD Explor Newsl 5(2):191–196
Article Google Scholar
Gimpel K, Schneider N, O’Connor B, Das D, Mills D, Eisenstein J, Smith NA (2010) Part-of-speech tagging for twitter: Annotation, features, and experiments. Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science
Golbeck J, Ashktorab Z, Banjo RO, Berlinger A, Bhagwan S, Buntain C, Cheakalos P, Geller AA, Gergory Q, Gnanasekaran RK, Gunasekaran RR, Hoffman KM, Hottle J, Jienjitlert V, Khare S, Lau R, Martindale MJ, Naik S, Nixon HL, Ramachandran P, Rogers KM, Rogers L, Sarin MS, Shahane G, Thanki J, Vengataraman P, Wan Z, Wu DM (2017) A large labeled corpus for online harassment research. In: WebSci
Haddi E, Liu X, Shi Y (2013) The role of text pre-processing in sentiment analysis. In: ITQM
Hovy D, Waseem Z (2016) Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In: Proceedings of the student research workshop, SRW@HLT-NAACL 2016, The 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, San Diego California, USA 12-17, 2016, pp 88–93
Jianqiang Z (2015) Pre-processing boosting twitter sentiment analysis? pp 748–753, 12
Jianqiang Z, Xiaolin G (2017) Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access 5:2870–2879
Article Google Scholar
Jianqiang Z, Xiaolin G (2018) Deep convolution neural networks for twitter sentiment analysis. IEEE Access PP:1–1, 01
Google Scholar
Khan FH, Bashir S, Qamar U (2014) Tom: Twitter opinion mining framework using hybrid classification scheme. Decis Support Syst 57:245–257
Article Google Scholar
Kim Y (2014) Convolutional neural networks for sentence classification. In: EMNLP
Kiritchenko S, Zhu X, Mohammad SM (2014) Sentiment analysis of short informal texts. J Artif Int Res 50(1):723–762
Google Scholar
Kouloumpis E, Wilson T, Moore JD (2011) Twitter sentiment analysis: the good the bad and the omg!. In: ICWSM
Lin C, He Y (2009) Joint sentiment/topic model for sentiment analysis. In: Proceedings of the 18th ACM conference on information and knowledge management, CIKM ’09, New York, NY, USA, ACM, pp 375–384
Looks M, Herreshoff M, Hutchins D, Norvig P (2017) Deep learning with dynamic computation graphs. arXiv:1702.02181
Mohammad S, Kiritchenko S, Zhu X (2013) Nrc-canada: building the state-of-the-art in sentiment analysis of tweets. In: Second joint conference on lexical and computational semantics (*SEM), Volume 2: proceedings of the seventh international workshop on semantic evaluation (SemEval 2013), association for computational linguistics, pp 321–327
Naseem U (2020) Hybrid words representation for the classification of low quality text (Doctoral dissertation)
Naseem U, Musial K, Eklund P, Prasad M (2020) Biomedical named-entity recognition by hierarchically fusing biobert representations and deep contextual-level word-embedding. In: 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, pp 1–8
Naseem U, Khan SK, Razzak I, Hameed IA (2019) Hybrid words representation for airlines sentiment analysis. In: Australasian Joint Conference on Artificial Intelligence. Springer, Cham, pp 381–392
Naseem U, Musial K (2019) Dice: deep intelligent contextual embedding for twitter sentiment analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, pp 953–958
Naseem U, Razzak I, Eklund P, Musial K (2020) Towards improved deep contextual embedding for the identification of irony and sarcasm. In: 2020 International joint conference on neural networks (IJCNN), IEEE, pp 1–7
Naseem U, Razzak I, Hameed IA (2019) Deep context-aware embedding for abusive and hate speech detection on twitter. Aust. J. Intell. Inf. Process. Syst. 15(3):69–76
Google Scholar
Naseem U, Razzak I, Musial K, Imran M (2020) Transformer based deep intelligent contextual embedding for twitter sentiment analysis. Future Gener Comp Syst 113:58–69
Article Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: In EMNLP
Saeed Z, Abbasi RA, Maqbool O, Sadaf A, Razzak I, Daud A, Aljohani NR, Xu G (2019) What’s happening around the world? a survey and framework on event detection techniques on twitter. J Grid Comput 17(2):279–312
Article Google Scholar
Saeed Z, Abbasi RA, Razzak I (2020) Evesense: what can you sense from twitter?. Adv Inform Retr 12036:491
Google Scholar
Saeed Z, Abbasi RA, Razzak I, Maqbool O, Sadaf A, Xu G (2019) Enhanced heartbeat graph for emerging event detection on twitter using time series networks. Expert Syst Appl 136:115–132
Article Google Scholar
Saeed Z, Abbasi RA, Razzak MI, Xu G (2019) Event detection in twitter stream using weighted dynamic heartbeat graph approach. arXiv:1902.08522
Saeed Z, Abbasi RA, Sadaf A, Razzak MI, Xu G (2018) Text stream to temporal network-a dynamic heartbeat graph to detect emerging events on twitter. In: Pacific-asia conference on knowledge discovery and data mining. Springer, New York, pp 534–545
Saif H, Andres MF, He Y, Alani H (2013) Evaluation datasets for twitter sentiment analysis: a survey and a new dataset, the sts-gold. In: ESSEM@AI*IA
Saloot MA, Idris N, Mohd Shuib NL, Raj RG, Aw A (2015) Toward tweets normalization using maximum entropy. In: NUT@IJCNLP
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inform Process Manag 24(5):513–523
Article Google Scholar
Severyn A, Moschitti A (2015) Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’15, New York, NY, USA, ACM, pp 959–962
Singh T, Kumari M (2016) Role of text pre-processing in twitter sentiment analysis
Suma S, Mehmood R, Albeshri A (2020) Automatic detection and validation of smart city events using hpc and apache spark platforms. In: Smart infrastructure and applications. Springer, p New York
Suma S, Mehmood R, Albugami N, Katib I, Albeshri A (2017) Enabling next generation logistics and planning for smarter societies. Procedia ComputSci 109:1122–1127
Article Google Scholar
Symeonidis S, Effrosynidis D, Arampatzis A (2018) A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Syst Appl 110:298–310
Article Google Scholar
Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. In: ACL
Uysal AK, Günal S (2014) The impact of preprocessing on text classification. Inf Process Manage 50:104–112
Article Google Scholar
Yamada I, Takeda H, Takefuji Y (2015) Enhancing named entity recognition in twitter messages using entity linking. In: NUT@IJCNLP

Download references

Author information

Authors and Affiliations

University of Sydney, Sydney, Australia
Usman Naseem
Deakin University, Geelong, Australia
Imran Razzak & Peter W. Eklund

Authors

Usman Naseem
View author publications
You can also search for this author in PubMed Google Scholar
Imran Razzak
View author publications
You can also search for this author in PubMed Google Scholar
Peter W. Eklund
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Usman Naseem.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Naseem, U., Razzak, I. & Eklund, P.W. A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter. Multimed Tools Appl 80, 35239–35266 (2021). https://doi.org/10.1007/s11042-020-10082-6

Download citation

Received: 28 April 2020
Revised: 17 August 2020
Accepted: 13 October 2020
Published: 04 November 2020
Issue Date: November 2021
DOI: https://doi.org/10.1007/s11042-020-10082-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

Abstract

Access this article

Similar content being viewed by others

A comparison of text preprocessing techniques for hate and offensive speech detection in Twitter

A Study of Text Representations for Hate Speech Detection

Vietnamese hate and offensive detection using PhoBERT-CNN and social media streaming data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

Abstract

Access this article

Similar content being viewed by others

A comparison of text preprocessing techniques for hate and offensive speech detection in Twitter

A Study of Text Representations for Hate Speech Detection

Vietnamese hate and offensive detection using PhoBERT-CNN and social media streaming data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation