Skip to main content
Log in

BERT- and CNN-based TOBEAT approach for unwelcome tweets detection

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Social media platforms have become an inevitable part of our lives today. Twitter is one such major online social networking platform. Recently, it pointed to exponential growth with growing interest from registered users. This popularity attracts cybercriminals (or spammers) to spread malware and advertisements via links shared in tweets, and hijack hot topics to get the attention of legitimate users. Instead, these spammers send violent messages known as spam, also known as junk e-mail, and spread other malicious activity. Spam on Twitter has become an inescapable problem that must be solved. In this context, several solutions have been proposed to reveal the problem of Twitter spam. However, the main existing proposed methods suffer from many limitations and cannot perfectly detect spammers on social networks. In this paper, we propose a new approach that considers the extraction of new TOpics-Based fEAtures (TOBEAT), from Twitter data. Our approach is based on BERT (bidirectional encoder representations of transformers) and CNN (convolutional neural network). To implement our solution, a new framework was developed to combine topic-based features with contextual BERT embeddings. The obtained final features vector is then fed into the supervised classifier for classification. The experimental results, performed on a Twitter data collection, show that CNN is the most suitable classifier to solve the spam filtering task. Moreover, the analysis of the results of the comparative study shows that by using the Twitter data set, our approach outperforms the previously published approaches and achieves 94.97%, 94.05%, 95.88%, 94.95% and 94.92% in accuracy, precision, recall, \(F1-score\), and \(G-mean\), respectively. In terms of time consumption, our approach recorded a time of 0.5164 seconds per training step. In percentage terms, this represents a gain of 82% compared to the TOBEAT-BERT+SVM model, 76.1% compared to the TOBEAT-BERT+NB model, and 70% compared to the TOBEAT-BERT+RF model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. https://infolab.tamu.edu/data/.

References

  • Adewole KS, Han T, Wanqing W, Song H, Sangaiah AK (2020) Twitter spam account detection based on clustering and classification methods. J Supercomput 76(7):4802–4837

    Article  Google Scholar 

  • Agarwal B, Mittal N (2016a) Machine learning approach for sentiment analysis. In: Prominent feature extraction for sentiment analysis, pp 21–45. Springer

  • Agarwal B, Mittal N (2016b) Sentiment analysis using conceptnet ontology and context information. In: Prominent feature extraction for sentiment analysis, pp 63–75. Springer. https://doi.org/10.1007/978-3-319-25343-5_5

  • Ahmad SBS, Rafie M, Ghorabie SM (2021) Spam detection on twitter using a support vector machine and users’ features by identifying their interactions. Multimed Tools Appl 80(8):11583–11605

    Article  Google Scholar 

  • Ala’M A-Z, Faris H, Alqatawna J, Hassonah MA (2018) Evolving support vector machines using whale optimization algorithm for spam profiles detection on online social networks in different lingual contexts. Knowl-Based Syst 153:91–104

    Article  Google Scholar 

  • Al-Janabi M, de Quincey E, Andras P (2017) Using supervised machine learning algorithms to detect suspicious urls in online social networks. In: Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017, pp 1104–1111

  • Almeida Tiago A, Jurandy A, Akebo Y (2011) Spam filtering: how the dimensionality reduction affects the accuracy of naive bayes classifiers. J Int Serv Appl 1(3):183–200

    Article  Google Scholar 

  • Alom Z, Carminati B, Ferrari E (2020) A deep learning model for twitter spam detection. Online Soc Netw Media 18:100079

    Article  Google Scholar 

  • Alshdadi Abdulrahman A, Alghamdi Ahmed S, Ali D, Saqib H (2021) Blog backlinks malicious domain name detection via supervised learning. Int J Seman Web Inf Syst (IJSWIS) 17(3):1–17

    Article  Google Scholar 

  • Ashour M, Salama C, El-Kharashi MW (2018) Detecting spam tweets using character n-gram features. In: 2018 13th international conference on computer engineering and systems (ICCES), pp 190–195. IEEE

  • Barushka A, Hajek P (2020) Spam detection on social networks using cost-sensitive feature selection and ensemble-based regularized deep neural networks. Neural Comput Appl 32(9):4239–4257

    Article  Google Scholar 

  • Benevenuto F, Magno G, Rodrigus T, Almedia V (2010) Detecting spammers on twitter in 7th annual collaboration. In: Electronic messaging, anti-abuse and, spam conference (CEAS), vol 6

  • Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. “ O’Reilly Media, Inc.”,

  • Biyani YV, Khan RA (2020) Spam detection in social media using machine learning algorithm. Int J Res Appl Sci Eng Technol (IJRASET)

  • Bosma M, Meij E, Weerkamp W (2012) A framework for unsupervised spam detection in social networking sites. In: European conference on information retrieval, pp 364–375. Springer

  • Boukhari K, Omri MN (2020) Approximate matching-based unsupervised document indexing approach: application to biomedical domain. Scientometrics 124(2):903–924

    Article  Google Scholar 

  • Chan Patrick PK, Cheng Y, Yeung Daniel S, Ng Wing WY (2015) Spam filtering for short messages in adversarial environment. Neurocomputing 155:167–176

    Article  Google Scholar 

  • Chen C, Zhang J, Xie Y, Xiang Y, Zhou W, Hassan MM, AlElaiwi A, Alrubaian M (2015) A performance evaluation of machine learning-based streaming spam tweets detection. IEEE Trans Comput Soc Syst 2(3):65–76

    Article  Google Scholar 

  • Chen W, Yeo CK, Lau CT, Lee BS (2017) A study on real-time low-quality content detection on twitter from the users’ perspective. PLoS ONE 12(8):e0182487

    Article  Google Scholar 

  • Choudhary N, Jain AK (2017) Towards filtering of sms spam messages using machine learning based technique. In: International conference on advanced informatics for computing research, pp 18–30. Springer

  • Chu Z, Widjaja I, Wang H (2012) Detecting social spam campaigns on twitter. In: International conference on applied cryptography and network security, pp 455–472. Springer

  • Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Al Najada H (2015) Survey of review spam detection using machine learning techniques. J Big Data 2(1):1–24

    Article  Google Scholar 

  • Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  • Fethi F, Nazih OM (2013) Estimation of a priori decision threshold for collocations extraction: an empirical study. Int J Inf Technol Web Eng (IJITWE) 8(3):34–49

    Article  Google Scholar 

  • Gayathri A, Aswini J, Revathi A (2021) Classification of spam detection using naive bayes algorithm over k-nearest neighbors algorithm based on accuracy. NVEO-Natural Volatiles Essential Oils J| NVEO, pp 8516–8530

  • Gupta H, Jamal MS, Madisetty S, Desarkar MS (2018) A framework for real-time spam detection in twitter. In 2018 10th international conference on communication systems & networks (COMSNETS), pp 380–383. IEEE

  • Ilias L, Roussaki I (2021) Detecting malicious activity in twitter using deep learning techniques. Appl Soft Comput 107:107360

    Article  Google Scholar 

  • Inuwa-Dutse I, Liptrott M, Korkontzelos I (2018) Detection of spam-posting accounts on twitter. Neurocomputing 315:496–511

    Article  Google Scholar 

  • Jain G, Sharma M, Agarwal B (2019) Spam detection in social media using convolutional and long short term memory neural network. Ann Math Artif Intell 85(1):21–44

    Article  Google Scholar 

  • Kanodia S, Sasheendran R, Pathari V (2018) A novel approach for youtube video spam detection using markov decision process. In: 2018 international conference on advances in computing, communications and informatics (ICACCI), pp 60–66. IEEE

  • Kiliroor CC, Valliyammai C (2019) Social context based naive bayes filtering of spam messages from online social networks. In: Soft computing in data analytics, pp 699–706. Springer

  • Lee K, Eoff BD, Caverlee J (2011) Seven months with the devils: A long-term study of content polluters on twitter. In Fifth international AAAI conference on weblogs and social media

  • Mabrouk O, Hlaoua L, Omri MN (2021) Exploiting ontology information in fuzzy svm social media profile classification. Appl Intell 51(6):3757–3774

    Article  Google Scholar 

  • Madisetty S, Desarkar MS (2018) A neural network-based ensemble approach for spam detection in twitter. IEEE Trans Comput Soc Syst 5(4):973–984

    Article  Google Scholar 

  • Mahmoud R, Belgacem S, Omri MN (2021) Towards wide-scale continuous gesture recognition model for in-depth and grayscale input videos. Int J Mach Learn Cybern 12(4):1173–1189

    Article  Google Scholar 

  • Mahmoud R, Belgacem S, Omri MN (2020) Deep signature-based isolated and large scale continuous gesture recognition approach. J King Saud Univ-Comput Inf Sci

  • Martinez-Romo J, Araujo L (2013) Detecting malicious tweets in trending topics using a statistical analysis of language. Expert Syst Appl 40(8):2992–3000

    Article  Google Scholar 

  • Menaga D, Revathi S (2020) Deep learning: a recent computing platform for multimedia information retrieval. In: Deep learning techniques and optimization strategies in big data analytics, pp 124–141. IGI Global

  • Mishne G, Carmel D, Lempel R et al (2005) Blocking blog spam with language model disagreement. In AIRWeb 5:1–6

    Google Scholar 

  • Ouni S, Fkih F, Omri MN (2021) Toward a new approach to author profiling based on the extraction of statistical features. Soc Netw Anal Min 11(1):1–16

    Article  Google Scholar 

  • Poria S, Cambria E, Gelbukh A (2016) Aspect extraction for opinion mining with a deep convolutional neural network. Knowl-Based Syst 108:42–49

    Article  Google Scholar 

  • Rangel F, Rosso P (2019) Overview of the 7th author profiling task at pan 2019: bots and gender profiling in twitter. In: Working notes papers of the CLEF 2019 evaluation labs volume 2380 of CEUR workshop

  • Rathore S, Loia V, Park JH (2018) Spamspotter: an efficient spammer detection framework based on intelligent decision support system on facebook. Appl Soft Comput 67:920–932

    Article  Google Scholar 

  • Reddy KS, Reddy ES (2019) Detecting spam messages in twitter data by machine learning algorithms using cross validation. Int J Innov Technol Explor Eng (IJITEE)

  • Rojas-Galeano S (2021) Using bert encoding to tackle the mad-lib attack in sms spam detection. arXiv preprint arXiv:2107.06400

  • Roy PK, Singh JP, Banerjee S (2020) Deep learning to filter sms spam. Futur Gener Comput Syst 102:524–533

    Article  Google Scholar 

  • Sagnika S, Mishra Bhabani SP, Meher SK (2021) An attention-based cnn-lstm model for subjectivity detection in opinion-mining. Neural Comput Appl 33:17425–17438. https://doi.org/10.1007/s00521-021-06328-5

  • Santoshi KU, Bhavya SS, Sri YB, Venkateswarlu B (2021) Twitter spam detection using naïve bayes classifier. In: 2021 6th international conference on inventive computation technologies (ICICT), pp 773–777. IEEE

  • Sedhai S, Sun A (2015) Hspam14: A collection of 14 million tweets for hashtag-oriented spam research. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 223–232

  • Şenel LK, Utlu I, Yücesoy V, Koc A, Cukur T (2018) Semantic structure and interpretability of word embeddings. IEEE/ACM Trans Audio Speech Lang Process 26(10):1769–1779

    Article  Google Scholar 

  • Sharmin S, Zaman Z (2017) Spam detection in social media employing machine learning tool for text mining. In: 2017 13th international conference on signal-image technology & internet-based systems (SITIS), pp 137–142. IEEE

  • Singh A, Batra S (2018) Ensemble based spam detection in social iot using probabilistic data structures. Futur Gener Comput Syst 81:359–371

    Article  Google Scholar 

  • Soni S, Roberts K (2021) An evaluation of two commercial deep learning-based information retrieval systems for covid-19 literature. J Am Med Inform Assoc 28(1):132–137

    Article  Google Scholar 

  • Spirin N, Han J (2012) Survey on web spam detection: principles and algorithms. ACM SIGKDD Explor Newsl 13(2):50–64

    Article  Google Scholar 

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  • Wu T, Liu S, Zhang J, Xiang Y (2017) Twitter spam detection based on deep learning. In: Proceedings of the Australasian computer science week multiconference, pp 1–8

  • Xie S, Wang G, Lin S, Yu PS (2012) Review spam detection via temporal pattern discovery. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 823–831

  • Yang C, Harkreader R, Guofei G (2013) Empirical evaluation and new design for fighting evolving twitter spammers. IEEE Trans Inf Forensics Secur 8(8):1280–1293

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarra Ouni.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ouni, S., Fkih, F. & Omri, M.N. BERT- and CNN-based TOBEAT approach for unwelcome tweets detection. Soc. Netw. Anal. Min. 12, 144 (2022). https://doi.org/10.1007/s13278-022-00970-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-022-00970-0

Keywords

Navigation