Skip to main content
Log in

SPADE: a social-spam analytics and detection framework

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Social media such as Facebook, MySpace, and Twitter have become increasingly important for attracting millions of users. Consequently, spammers are increasing using such networks for propagating spam. Although existing filtering techniques such as collaborative filters and behavioral analysis filters are able to significantly reduce spam, each social network needs to build its own independent spam filter and support a spam team to keep spam prevention techniques current. To alleviate those problems, we propose a framework for spam analytics and detection which can be used across all social network sites. Specifically, the proposed framework SPADE has numerous benefits including (1) new spam detected on one social network can quickly be identified across social networks; (2) accuracy of spam detection will be improved through cross-domain classification and associative classification; (3) other techniques (such as blacklists and message shingling) can be integrated and centralized; (4) new social networks can plug into the system easily, preventing spam at an early stage. In SPADE, we present a uniform schema model to allow cross-social network integration. In this paper, we define the user, message, and web page model. Moreover, we provide an experimental study of real datasets from social networks to demonstrate the flexibility and feasibility of our framework. We extensively evaluated two major classification approaches in SPADE: cross-domain classification and associative classification. In cross-domain classification, SPADE achieved over 0.92 F-measure and over 91 % detection accuracy on web page model using Naïve Bayes classifier. In associative classification, SPADE also achieved 0.89 F-measure on message model and 0.87 F-measure on user profile model, respectively. Both detection accuracies are beyond 85 %. Based on those results, our SPADE has been demonstrated to be a competitive spam detection solution to social media.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Weka is an open-source collection of machine learning algorithms that has become a standard tool in the machine learning community.

  2. https://mahout.apache.org/.

References

  • Becchetti L, Castillo C, Donato D, Baeza-Yates R, Leonardi S (2008) Link analysis for web spam detection. ACM Trans Web 2(1):42. Art No. 2

    Google Scholar 

  • Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on twitter. In: Proceedings of the seventh annual collaboration, electronic messaging, antiabuse and spam conference (CEAS 2010)

  • Bosma M, Meij E, Weerkamp W (2012) A framework for unsupervised spam detection in social networking sites. In: ECIR 2012 34th European conference on information retrieval, Barcelona, pp 364–375

  • Byun B, Lee C, Webb S, Irani D, Pu C (2009) An anti-spam filter combination framework for text-and-image emails through incremental learning. In: Proceedings of the sixth conference on email and anti-spam (CEAS 2009)

  • Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. Arxiv preprint

  • Caverlee J, Liu L, Webb S (2008) Socialtrust: tamper-resilient trust establishment in online communities. In: Proceedings of the 8th ACM/IEEE-CS joint conference on digital libraries

  • Caverlee J, Webb S (2008) A large-scale study of MySpace: observations and implications for online social networks. Proceedings of the international conference on weblogs and social media 8

  • Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054

    Article  Google Scholar 

  • Fazeen M, Dantu R, Guturu P (2011) Identification of leaders, lurkers, associates and spammers in a social network: context-dependent and context-independent approaches. Soc Netw Anal Min 1(3):241–254

    Article  Google Scholar 

  • Fetterly D, Manasse M, Najork M (2003) On the evolution of clusters of near-duplicate web pages. In: Proceedings of the first conference on Latin American web congress, LA-WEB ’03

  • Fetterly D, Manasse M, Najork M (2004) Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proceedings of the 7th international workshop on the web and databases: colocated with ACM SIGMOD/PODS 2004, WebDB ’04

  • Fetterly D, Manasse M, Najork M (2005) Detecting phrase-level duplication on the world wide web. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’05

  • Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407

    Article  MathSciNet  MATH  Google Scholar 

  • Google opensocial API (2011). http://code.google.com/apis/opensocial/

  • Gosier G (2009) Social networks as an attack platform: Facebook case study. In: Proceedings of the eighth international conference on networks

  • Gyongyi Z, Garcia-Monlina H, Pedersen J (2004) Combating web spam with trustrank. In: Proceeding of the thirtieth international conference on very large data bases, vol 30

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I (2009) The WEKA data mining software. ACM SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  • Han B, Baldwin T (2011) Lexical normalisation of short text messages: makn sens a #twitter. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. HLT ’11 association for computational linguistics, Stroudsburg, pp 368–378

  • Han JS, Park BJ (2012) Efficient detection of content polluters in social networks. In: ICITCS, pp 991–996

  • Hao S, Syed NA, Feamster N, Gray AG, Krasser S (2009) Detecting spammers with snare: Spatio-temporal network-level automatic reputation engine. In: Proceedings of the 18th conference on USENIX security symposium., SSYM’09CA, Berkeley, pp 101–118

  • He Q, Zhuang F, Li J, Shi Z (2010) Parallel implementation of classification algorithms based on MapReduce. Rough set and knowledge technology. Lecture notes in computer science vol 6401, pp 655–662

  • Hirai J, Raghavan S, Garcia-Molina H, Paepcke A (2000) WebBase: a repository of web pages. Comput Netw 33(1–6):277–293

    Article  Google Scholar 

  • HOOTSUITE social media dashboard (2011). http://hootsuite.com/

  • Irani D, Webb S, Giffin J, Pu C (2008) Evolutionary study of phishing. In: eCrime researchers summit, pp 1–10

  • Irani D, Webb S, Pu C (2010) Study of static classification of social spam profiles in myspace. In: Proceedings of the international AAAI conference on weblogs and social media

  • Irani D, Webb S, Pu C, Li K (2010) Study of trend-stuffing on twitter through text classification. In: Collaboration, electronic messaging, anti-abuse and spam conference (CEAS 2010)ACM, New York, pp 112–117

  • Jensen D, Neville J, Gallagher B (2004) Why collective inference improves relational classification. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’04, pp 593–598

  • Jin X, Lin CX, Luo J, Han J (2011) Socialspamguard: a data mining-based spam detection system for social media networks. In: Proceedings of the international conference on very large data bases

  • Kreibich C, Kanich C, Levchenko K, Enright B, Voelker G, Paxson V, Savage S (2008) On the spam campaign trail. In: Proceedings of the 1st usenix workshop on large-scale exploits and emergent threats, USENIX association, pp 1–9

  • Learmonth M (2010) Twitter getting serious about spam issue. http://adage.com/article/digital/digital-marketing-twitter-spam-issue/142800/

  • Lee K, Caverlee J, Kamath KY, Cheng Z (2012) Detecting collective attention spam. In: Proceedings of the 2nd joint WICOW/AIRWeb workshop on web quality, WebQuality ’12NY, New York, pp 48–55

  • Lex E, Seifert C, Granitzer M, Juffinger A (2010) Efficient cross-domain classification of weblogs. Int J Intell Comput Res 1(1):36–45

    Google Scholar 

  • Liu Y, Zhang M, Ma S, Ru L (2008) User behavior oriented web spam detection. In: Proceedings of the 17th international conference on world wide web, WWW ’08

  • Ma Y, Wang L, Li L (2010) A parallel and convergent support vector machine based on MapReduce. In: Computer engineering and networking, Lecture notes in electrical engineering, vol 277. Springer International Publishing, pp 585–592

  • Modi S (2013) Relational classification using multiple view approach with voting. Int J Comput Appl 70(16):31–36. Published by Foundation of Computer Science, New York

    Google Scholar 

  • Ntoulas A, Najork M, Manasse M, Fetterly D (2006) Detecting spam web pages through content analysis. In: Proceedings of the 15th international conference on world wide web, WWW ’06

  • Pan SJ, Ni X, Sun JT, Yang Q, Chen Z (2010) Cross-domain sentiment classification via spectral feature alignment. In: Proceedings of the 19th international conference on World wide web, WWW ’10, pp 751–760

  • Pu C, Webb S (2006) Observed trends in spam construction techniques: a case study of spam evolution. In: Proceedings of the third conference on email and anti-spam (CEAS 2006)

  • Pu C, Webb S, Kolesnikov O, Lee W, Lipton R (2006) Towards the integration of diverse spam filtering techniques. In: Proceedings of the IEEE international conference on granular computing (GrC06), pp 17–20

  • Radlinski F (2007) Addressing malicious noise in clickthrough data. In: Proceedings of the 3rd international workshop on adversarial information retrieval on the web (AIRWeb).

  • Rosen D, Barnett GA, Kim JH (2011) Social networks and online environments: when science and practice co-evolve. Soc Netw Anal Min 1(1):27–42

    Article  Google Scholar 

  • Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Learning for text categorization: papers from the 1998 workshop, vol 62, AAAI Technical, Report WS-98-05, Madison, pp 98–05

  • Sebastiani F (2005) Text categorization. In: Text mining and its applications to intelligence, CRM and knowledge management, WIT Press, pp 109–129

  • Spirin N, Han J (2012) Survey on web spam detection: principles and algorithms. SIGKDD Explor Newsl 13(2):50–64

    Article  Google Scholar 

  • Stein T, Chen E, Mangla K (2011) Facebook immune system. In: Proceedings of the forth ACM EuroSys workshop on social network systems (SNS2011)

  • Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time url spam filtering service. In: Proceedings of the IEEE symposium on security and privacy

  • Tweetdeck by twitter (2011). http://tweetdeck.com/

  • Voorhees E, Harman D, U.S. National Institute of Standards and Technology (2005) TREC: experiment and evaluation in information retrieval, MIT press, USA

  • Wang D (2014) Analysis and detection of low quality information in social networks. In: Proceedings of Ph.D. symposium at 30th IEEE international conference on data engineering (ICDE 2014), Chicago

  • Wang D, Irani D, Pu C (2011) A social-spam detection framework. In: Proceedings of the annual collaboration, electronic messaging, antiabuse and spam conference (CEAS 2011), pp 46–54

  • Wang D, Irani D, Pu C (2012) Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In: Proceedings of 8th IEEE international conference on collaborative computing: networking, applications and worksharing (CollaborateCom), pp 40–49

  • Wang D, Navathe SB, Liu L, Irani D, Tamersoy A, Pu C (2013) Click traffic analysis of short url spam on twitter. In: Proceedings of 9th IEEE international conference on collaborative computing: networking, applications and worksharing (CollaborateCom), pp 250–259

  • Wang P, Domeniconi C, Hu J (2008) Cross-domain text classification using wikipedia. IEEE Intell Inf Bull 9(1):36–45

    Google Scholar 

  • Webb S, Caverlee J, Pu C (2006) Introducing the webb spam corpus: using email spam to identify web spam automatically. In: Proceedings of the third conference on email and anti-spam (CEAS 2006)

  • Webb S, Caverlee J, Pu C (2007) Characterizing web spam using content and http session analysis. In: Proceedings of the fourth conference on email and anti-spam (CEAS 2007), pp 84–89

  • Webb S, Caverlee J, Pu C (2008) Predicting web spam with http session information. In: Proceedings of the seventeenth conference on information and knowledge management (CIKM 2008)

  • Webb S, Caverlee J, Pu C (2008) Social honeypots: making friends with a spammer near you. In: Proceedings of the fifth conference on email and anti-spam (CEAS 2008)

  • Wolfe AW (2011) Anthropologist view of social network analysis and data mining. Soc Netw Anal Min 1(1):3–19

    Article  Google Scholar 

  • Zhen Y, Li C (2008) Cross-domain knowledge transfer using semi-supervised classification. In: AI 2008: advances in artificial intelligence, vol 5360. Lecture notes in computer science, Springer, Berlin, pp 362–371

  • Zou M, Wang T, Li H, Yang D (2010) A general multi-relational classification approach using feature generation and selection. In: Cao L, Zhong J, Feng Y (eds) Advanced data mining and applications, vol 6441. Lecture notes in computer science, Springer, Berlin, pp 21–33

Download references

Acknowledgments

This research has been partially funded by National Science Foundation by CNS/SAVI (1250260), IUCRC/FRP (1127904), CISE/CNS (1138666), RAPID (1138666), CISE/CRI (0855180), NetSE (0905493) programs, and gifts, grants, or contracts from DARPA/I2O, Singapore Government, Fujitsu Labs, and Georgia Tech Foundation through the John P. Imlay, Jr. Chair endowment. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding agencies and companies mentioned above.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to De Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, D., Irani, D. & Pu, C. SPADE: a social-spam analytics and detection framework. Soc. Netw. Anal. Min. 4, 189 (2014). https://doi.org/10.1007/s13278-014-0189-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-014-0189-1

Keywords

Navigation