Skip to main content
Log in

Birds of prey: identifying lexical irregularities in spam on Twitter

  • Published:
Wireless Networks Aims and scope Submit manuscript

Abstract

The advent of spam on social media platforms has lead to a number of problems not only for social media users but also for researchers mining social media data. While there has been substantial research on automated methods of spam detection on Twitter, research on the lexical content of spam on the platform is limited. A dataset of 301 million generic tweets was filtered through a URL blacklisting service to obtain 7207 tweets containing links to malicious web-pages. These tweets, considered spam, were combined with a random sample of non-spam tweets to obtain an overall dataset of 14,414 tweets. A total of 12 numerical tweet features were used to train and test a Random Forest algorithm with an overall classification accuracy of over 90%. In addition to the numerical features, the text of each tweet was processed to create four frequency-mapped corpora pertaining uniquely to spam and non-spam data. The corpora of words, emoji, numbers, and stop-words for spam and non-spam were plotted against each other to visualize differences in usage between the two groups. A clear distinction between words, and emoji used in spam, and non-spam tweets was observed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Interactive and non-rasterized versions of language plots can be accessed at http://spam.datalab.science.

References

  1. Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. Newton: OReilly Media, Inc.

    MATH  Google Scholar 

  2. Cervellini, P., Menezes, A.G. & Mago, V.K. (2016). Finding trendsetters on yelp dataset. In IEEE symposium series on computational intelligence (SSCI), 2016. pp. 1–7. IEEE

  3. Chen, C., Zhang, J., Xie, Y., Xiang, Y., Zhou, W., Hassan, M. M., et al. (2015). A performance evaluation of machine learning-based streaming spam tweets detection. IEEE Transactions on Computational Social Systems, 2(3), 6576. https://doi.org/10.1109/tcss.2016.2516039.

    Article  Google Scholar 

  4. Drucker, H., Wu, D., & Vapnik, V. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 10481054. https://doi.org/10.1109/72.788645.

    Article  Google Scholar 

  5. Kessler, J. (2017). Scattertext: a browser-based tool for visualizing how corpora differ. In: Proceedings of ACL 2017, system demonstrations. https://doi.org/10.18653/v1/p17-4015

  6. Kwak, H., Lee, C., Park, H. & Moon, S. (2010). What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web - WWW 10. https://doi.org/10.1145/1772690.1772751

  7. Lin, G., Sun, N., Nepal, S., Zhang, J., Xiang, Y., & Hassan, H. (2017). Statistical twitter spam detection demystified: Performance, stability and scalability. IEEE Access, 5, 1114211154. https://doi.org/10.1109/access.2017.2710540.

    Article  Google Scholar 

  8. Patel, I., Nguyen, H., Belyi, E., Getahun, Y., Abdulkareem, S., Giabbanelli, P.J. & Mago, V. (2017). Modeling information spread in polarized communities: Transitioning from legacy media to a facebook world. In: SoutheastCon, 2017. pp. 1–8. IEEE

  9. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    MathSciNet  MATH  Google Scholar 

  10. Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3–13.

    Google Scholar 

  11. Thomas, K., Grier, C., Song, D. & Paxson, V. (2011). Suspended accounts in retrospect: An analysis of twitter spam. In Proceedings of the 2011 ACM SIGCOMM conference on internet measurement conference. pp. 243–258. IMC ’11, ACM, New York, NY, USA. https://doi.org/10.1145/2068816.2068840

  12. Yang, C., Harkreader, R., & Gu, G. (2013). Empirical evaluation and new design for fighting evolving twitter spammers. IEEE Transactions on Information Forensics and Security, 8(8), 12801293. https://doi.org/10.1109/tifs.2013.2267732.

    Article  Google Scholar 

Download references

Acknowledgements

This research is funded by the NSERC Discovery grant; computing resources are provided by the High Performance Computing (HPC) lab and Department of Computer Science at Lakehead University, Canada. Authors are grateful to Darryl Willick, HPC Programmer at Lakehead University for supporting this long-term social media data mining project, and Andrew Heppner for reviewing and editing the final manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vijay Mago.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Robinson, K., Mago, V. Birds of prey: identifying lexical irregularities in spam on Twitter. Wireless Netw 28, 1189–1196 (2022). https://doi.org/10.1007/s11276-018-01900-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11276-018-01900-9

Keywords

Navigation