Abstract
The advent of spam on social media platforms has lead to a number of problems not only for social media users but also for researchers mining social media data. While there has been substantial research on automated methods of spam detection on Twitter, research on the lexical content of spam on the platform is limited. A dataset of 301 million generic tweets was filtered through a URL blacklisting service to obtain 7207 tweets containing links to malicious web-pages. These tweets, considered spam, were combined with a random sample of non-spam tweets to obtain an overall dataset of 14,414 tweets. A total of 12 numerical tweet features were used to train and test a Random Forest algorithm with an overall classification accuracy of over 90%. In addition to the numerical features, the text of each tweet was processed to create four frequency-mapped corpora pertaining uniquely to spam and non-spam data. The corpora of words, emoji, numbers, and stop-words for spam and non-spam were plotted against each other to visualize differences in usage between the two groups. A clear distinction between words, and emoji used in spam, and non-spam tweets was observed.
Similar content being viewed by others
Notes
Interactive and non-rasterized versions of language plots can be accessed at http://spam.datalab.science.
References
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. Newton: OReilly Media, Inc.
Cervellini, P., Menezes, A.G. & Mago, V.K. (2016). Finding trendsetters on yelp dataset. In IEEE symposium series on computational intelligence (SSCI), 2016. pp. 1–7. IEEE
Chen, C., Zhang, J., Xie, Y., Xiang, Y., Zhou, W., Hassan, M. M., et al. (2015). A performance evaluation of machine learning-based streaming spam tweets detection. IEEE Transactions on Computational Social Systems, 2(3), 6576. https://doi.org/10.1109/tcss.2016.2516039.
Drucker, H., Wu, D., & Vapnik, V. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 10481054. https://doi.org/10.1109/72.788645.
Kessler, J. (2017). Scattertext: a browser-based tool for visualizing how corpora differ. In: Proceedings of ACL 2017, system demonstrations. https://doi.org/10.18653/v1/p17-4015
Kwak, H., Lee, C., Park, H. & Moon, S. (2010). What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web - WWW 10. https://doi.org/10.1145/1772690.1772751
Lin, G., Sun, N., Nepal, S., Zhang, J., Xiang, Y., & Hassan, H. (2017). Statistical twitter spam detection demystified: Performance, stability and scalability. IEEE Access, 5, 1114211154. https://doi.org/10.1109/access.2017.2710540.
Patel, I., Nguyen, H., Belyi, E., Getahun, Y., Abdulkareem, S., Giabbanelli, P.J. & Mago, V. (2017). Modeling information spread in polarized communities: Transitioning from legacy media to a facebook world. In: SoutheastCon, 2017. pp. 1–8. IEEE
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13.
Thomas, K., Grier, C., Song, D. & Paxson, V. (2011). Suspended accounts in retrospect: An analysis of twitter spam. In Proceedings of the 2011 ACM SIGCOMM conference on internet measurement conference. pp. 243–258. IMC ’11, ACM, New York, NY, USA. https://doi.org/10.1145/2068816.2068840
Yang, C., Harkreader, R., & Gu, G. (2013). Empirical evaluation and new design for fighting evolving twitter spammers. IEEE Transactions on Information Forensics and Security, 8(8), 12801293. https://doi.org/10.1109/tifs.2013.2267732.
Acknowledgements
This research is funded by the NSERC Discovery grant; computing resources are provided by the High Performance Computing (HPC) lab and Department of Computer Science at Lakehead University, Canada. Authors are grateful to Darryl Willick, HPC Programmer at Lakehead University for supporting this long-term social media data mining project, and Andrew Heppner for reviewing and editing the final manuscript.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Robinson, K., Mago, V. Birds of prey: identifying lexical irregularities in spam on Twitter. Wireless Netw 28, 1189–1196 (2022). https://doi.org/10.1007/s11276-018-01900-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11276-018-01900-9