Birds of prey: identifying lexical irregularities in spam on Twitter

Robinson, Kyle; Mago, Vijay

doi:10.1007/s11276-018-01900-9

Birds of prey: identifying lexical irregularities in spam on Twitter

Published: 11 December 2018

Volume 28, pages 1189–1196, (2022)
Cite this article

Wireless Networks Aims and scope Submit manuscript

Kyle Robinson¹ &
Vijay Mago¹

360 Accesses
6 Citations
2 Altmetric
Explore all metrics

Abstract

The advent of spam on social media platforms has lead to a number of problems not only for social media users but also for researchers mining social media data. While there has been substantial research on automated methods of spam detection on Twitter, research on the lexical content of spam on the platform is limited. A dataset of 301 million generic tweets was filtered through a URL blacklisting service to obtain 7207 tweets containing links to malicious web-pages. These tweets, considered spam, were combined with a random sample of non-spam tweets to obtain an overall dataset of 14,414 tweets. A total of 12 numerical tweet features were used to train and test a Random Forest algorithm with an overall classification accuracy of over 90%. In addition to the numerical features, the text of each tweet was processed to create four frequency-mapped corpora pertaining uniquely to spam and non-spam data. The corpora of words, emoji, numbers, and stop-words for spam and non-spam were plotted against each other to visualize differences in usage between the two groups. A clear distinction between words, and emoji used in spam, and non-spam tweets was observed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Framework for Evaluating Anti Spammer Systems for Twitter

Spam detection on Twitter using a support vector machine and users’ features by identifying their interactions

Article 06 January 2021

Think Before RT: An Experimental Study of Abusing Twitter Trends

Notes

Interactive and non-rasterized versions of language plots can be accessed at http://spam.datalab.science.

References

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. Newton: OReilly Media, Inc.
MATH Google Scholar
Cervellini, P., Menezes, A.G. & Mago, V.K. (2016). Finding trendsetters on yelp dataset. In IEEE symposium series on computational intelligence (SSCI), 2016. pp. 1–7. IEEE
Chen, C., Zhang, J., Xie, Y., Xiang, Y., Zhou, W., Hassan, M. M., et al. (2015). A performance evaluation of machine learning-based streaming spam tweets detection. IEEE Transactions on Computational Social Systems, 2(3), 6576. https://doi.org/10.1109/tcss.2016.2516039.
Article Google Scholar
Drucker, H., Wu, D., & Vapnik, V. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 10481054. https://doi.org/10.1109/72.788645.
Article Google Scholar
Kessler, J. (2017). Scattertext: a browser-based tool for visualizing how corpora differ. In: Proceedings of ACL 2017, system demonstrations. https://doi.org/10.18653/v1/p17-4015
Kwak, H., Lee, C., Park, H. & Moon, S. (2010). What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web - WWW 10. https://doi.org/10.1145/1772690.1772751
Lin, G., Sun, N., Nepal, S., Zhang, J., Xiang, Y., & Hassan, H. (2017). Statistical twitter spam detection demystified: Performance, stability and scalability. IEEE Access, 5, 1114211154. https://doi.org/10.1109/access.2017.2710540.
Article Google Scholar
Patel, I., Nguyen, H., Belyi, E., Getahun, Y., Abdulkareem, S., Giabbanelli, P.J. & Mago, V. (2017). Modeling information spread in polarized communities: Transitioning from legacy media to a facebook world. In: SoutheastCon, 2017. pp. 1–8. IEEE
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
MathSciNet MATH Google Scholar
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13.
Google Scholar
Thomas, K., Grier, C., Song, D. & Paxson, V. (2011). Suspended accounts in retrospect: An analysis of twitter spam. In Proceedings of the 2011 ACM SIGCOMM conference on internet measurement conference. pp. 243–258. IMC ’11, ACM, New York, NY, USA. https://doi.org/10.1145/2068816.2068840
Yang, C., Harkreader, R., & Gu, G. (2013). Empirical evaluation and new design for fighting evolving twitter spammers. IEEE Transactions on Information Forensics and Security, 8(8), 12801293. https://doi.org/10.1109/tifs.2013.2267732.
Article Google Scholar

Download references

Acknowledgements

This research is funded by the NSERC Discovery grant; computing resources are provided by the High Performance Computing (HPC) lab and Department of Computer Science at Lakehead University, Canada. Authors are grateful to Darryl Willick, HPC Programmer at Lakehead University for supporting this long-term social media data mining project, and Andrew Heppner for reviewing and editing the final manuscript.

Author information

Authors and Affiliations

Lakehead University, Thunder Bay, ON, P7B-5E1, Canada
Kyle Robinson & Vijay Mago

Authors

Kyle Robinson
View author publications
You can also search for this author in PubMed Google Scholar
Vijay Mago
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vijay Mago.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Robinson, K., Mago, V. Birds of prey: identifying lexical irregularities in spam on Twitter. Wireless Netw 28, 1189–1196 (2022). https://doi.org/10.1007/s11276-018-01900-9

Download citation

Published: 11 December 2018
Issue Date: April 2022
DOI: https://doi.org/10.1007/s11276-018-01900-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Birds of prey: identifying lexical irregularities in spam on Twitter

Abstract

Access this article

Similar content being viewed by others

A Framework for Evaluating Anti Spammer Systems for Twitter

Spam detection on Twitter using a support vector machine and users’ features by identifying their interactions

Think Before RT: An Experimental Study of Abusing Twitter Trends

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Birds of prey: identifying lexical irregularities in spam on Twitter

Abstract

Access this article

Similar content being viewed by others

A Framework for Evaluating Anti Spammer Systems for Twitter

Spam detection on Twitter using a support vector machine and users’ features by identifying their interactions

Think Before RT: An Experimental Study of Abusing Twitter Trends

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation