Skip to main content
Log in

Attribute selection for improving spam classification in online social networks: a rough set theory-based approach

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

As online social network (OSN) sites become increasingly popular, they are targeted by spammers who post malicious content on the sites. Hence, it is important to filter out spam accounts and spam posts from OSNs. There exist several prior works on spam classification on OSNs, which utilize various features to distinguish between spam and legitimate entities. The objective of this study is to improve such spam classification, by developing an attribute selection methodology that helps to find a smaller subset of the attributes which leads to better classification. Specifically, we apply the concepts of rough set theory to develop the attribute selection algorithm. We perform experiments over five different spam classification datasets over diverse OSNs and compare the performance of the proposed methodology with that of several baseline methodologies for attribute selection. We find that, for most of the datasets, the proposed methodology selects an attribute subset that is smaller than what is selected by the baseline methodologies, yet achieves better classification performance compared to the other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. http://www.internetlivestats.com/twitter-statistics/.

  2. A binary relation \(R \subseteq U \times U\) is called an equivalence relation if the relation is reflexive (\((x, x) \in R, \forall x \in U\)) symmetric (\((x, y) \in R\) implies \((y, x) \in R\)), and transitive (\((x, y) \in R\) and \((y, z) \in R\) imply \((x, z) \in R\)). The equivalence class of an element \(x \in U\) consists of all objects \(y \in U\) such that \((x, y) \in R\).

  3. While selecting a node with the highest in-degree, ties, if any, are resolved arbitrarily.

  4. Ties, if any, are resolved arbitrarily.

References

  • Ahmed F, Abulaish M (2013) A generic statistical approach for spam detection in online social networks. Comput Commun 36(10–11):1120–1129

    Article  Google Scholar 

  • Bandyopadhyay S, Bhadra T, Mitra P, Maulik U (2014) Integration of dense subgraph finding with feature clustering for unsupervised feature selection. Pattern Recogn Lett 40:104–112

    Article  Google Scholar 

  • Benevenuto F, Rodrigues T, Almeida V, Almeida J, Gonalves M (2009) Detecting spammers and content promoters in online video social networks. In: Proceedings of the annual Intl SIGIR conference, Boston, MA, USA

  • Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on Twitter. In: Proceedings of collaboration, electronic messaging, anti-abuse and spam conference (CEAS)

  • Caballero Y, Alvarez D, Bello R (2007) Feature selection algorithms using rough set theory. In: Proceedings of IEEE international conference on intelligent systems design and applications, pp 407–411

  • Capture-HPC. https://projects.honeynet.org/capture-hpc/

  • Caruana G, Li M (2012) A survey of emerging approaches to spam filtering. ACM Comput Surv 44(2):9:1–9:27

    Article  Google Scholar 

  • Chen Y, Miao D, Wang R (2010) A rough set approach to feature selection based on ant colony optimization. Pattern Recogn Lett 31(3):226–233

    Article  Google Scholar 

  • Chhabra S, Aggarwal A, Benevenuto F, Kumaraguru P (2011) Phi.sh/SPSSlashDollaroCiaL: the phishing landscape through short URLs. In: proceedings of collaboration, electronic messaging, anti-abuse and spam conference (CEAS)

  • Costa H, de Campos Merschmann LH, Barth F, Benevenuto F (2014) Pollution, bad-mouthing, and local marketing: the underground of location-based social networks. Elsevier Information Sciences, Amsterdam

    Google Scholar 

  • Costa H, Benevenuto F, de Campos Merschmann LH (2013) Detecting tip spam in location-based social networks. In: Proceedings of the 28th annual ACM symposium on applied computing (SAC)

  • Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(1–4):131–156

    Article  Google Scholar 

  • Deogun JS, Choubey SK, Raghavan VV, Sever H (1998) Feature selection and effective classifiers. J Am Soc Inf Sci 49(5):423–434

    Article  Google Scholar 

  • Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous valued attributes for classification learning. In: Proceedings of international joint conference on artificial intelligence, vol 2, pp 1022–1027

  • Gao H, Hu J, Wilson C, Li Z, Chen Y, Zhao BY (2010) Detecting and characterizing social spam campaigns. In: Proceedings of ACM international conference on internet measurement (IMC)

  • Garcia S, Luengo J, Saez JA, Lopez V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750

    Article  Google Scholar 

  • Google Safe Browsing API. https://developers.google.com/safe-browsing/

  • Grier C, Thomas K, Paxson V, Zhang M (2010) @spam: the underground on 140 characters or less. In: Proceedings of ACM international conference on computer and communications security (CCS), pp 27–37

  • Hall MA (1998) Correlation-based feature subset selection for machine learning. Ph.D. thesis, University of Waikato, Hamilton, New Zealand

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18

    Article  Google Scholar 

  • Heymann P, Koutrika G, Garcia-Molina H (2007) Fighting spam on social web sites: a survey of approaches and future challenges. IEEE Internet Comput 11:36–45

    Article  Google Scholar 

  • Infomap - community detection. http://www.mapequation.org/code.html

  • Karimpour J, Noroozi AA, Abadi A (2012) The impact of feature selection on web spam detection. Int J Intell Syst Appl 4(9):61–67

    Google Scholar 

  • Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324

    Article  MATH  Google Scholar 

  • Lee S, Kim J (2013) WarningBird: a near real-time detection system for suspicious URLs in Twitter stream. IEEE Trans Dependable Secure Comput 10(3):183–195

    Article  Google Scholar 

  • Lee K, Caverlee J, Webb S (2010) Uncovering social spammers: social honeypots + machine learning. In: Proceedings of ACM international conference on research and development in information retrieval (SIGIR), pp 435–442

  • Lee K, Eoff BD, Caverlee J (2011) Seven months with the devils: a long-term study of content polluters on Twitter. In: Proceedings of AAAI international conference on weblogs and social media (ICWSM)

  • Liu H, Setiono R (1996) A probabilistic approach to feature selection—a filter solution. In: 13th international conference on machine learning, pp 319–327

  • Martinez-Romo J, Araujo L (2013) Detecting malicious tweets in trending topics using a statistical analysis of language. Expert Syst Appl 40(8):2992–3000

    Article  Google Scholar 

  • Mitra P, Murthy CA, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312

    Article  Google Scholar 

  • Pawlak Z (1982) Rough sets: basic notion. Int J Comput Inf Sci 11(5):344–356

    Article  Google Scholar 

  • Pawlak Z (1998) Rough set theory and its applications to data analysis. Cybern Syst 29(7):661–688

    Article  MATH  Google Scholar 

  • Skowron A, Rauszer C (1992) The discernibility matrices and functions in information systems. In: Sowinski R (ed) Intelligent decision support. Handbook of applications and advances of the rough set theory, theory and decision library, vol 11. Kluwer Academic Publishers, Dordrecht, pp 331–362

    Google Scholar 

  • SURBL. http://www.surbl.org/

  • Swiniarski RW, Skowron A (2003) Rough set methods in feature selection and recognition. Pattern Recogn Lett 24(6):833–849

    Article  MATH  Google Scholar 

  • The Spamhaus Project. http://www.spamhaus.org/

  • Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time URL spam filtering service. In: Proceedings of IEEE symposium on security and privacy (2011)

  • Tseng CY, Sung PC, Chen MS (2011) Cosdes: a collaborative spam detection system with a novel e-mail abstraction scheme. IEEE Trans Knowl Data Eng 23(5):669–682

    Article  Google Scholar 

  • Twitter API Home. https://dev.twitter.com

  • Wagner S, Wagner D (2007) Comparing clusterings—an overview. Technical report 2006–04, Universität Karlsruhe (TH). http://digbib.ubka.uni-karlsruhe.de/volltexte/1000011477

  • Wild C, Seber G (2000) The Wilcoxon rank-sum test. In: Seber G (ed) Chance encounters: a first course in data analysis and inference. Wiley, New York

  • Xin G, Qiang G, Jing Z, Zheng-Chao Z (2010) An attribute reduction algorithm based on rough set, information entropy and ant colony optimization. In: Proceedings of IEEE international conference on signal processing, pp 1313–1317

  • Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the international conference on machine learning (ICML), pp 412–420

  • Yardi S, Romero D, Schoenebeck G, Boyd DM (2010) Detecting spam in a Twitter network. First Monday 15(1):1–13

    Google Scholar 

  • Zhai LY, Khoo LP, Fok SC (2002) Feature extraction using rough set theory and genetic algorithms—an application for the simplification of product quality evaluation. Comput Ind Eng 43(4):661–676

    Article  Google Scholar 

  • Zhang Y, Wang S, Wu L (2012) Spam detection via feature selection and decision tree. Adv Sci Lett 5(2):726–730

    Article  Google Scholar 

  • Zhang M, Yao JT (2004) A rough sets based approach to feature selection. In: Proceedings of IEEE annual meeting of the fuzzy information, pp 1313–1317

Download references

Acknowledgements

We thank the anonymous reviewers for their valuable comments and suggestions, which helped to improve the paper. We also acknowledge useful discussions with Arpan Das and Anirban Majumder in the early phases of the work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soumi Dutta.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (csv 31725 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dutta, S., Ghatak, S., Dey, R. et al. Attribute selection for improving spam classification in online social networks: a rough set theory-based approach. Soc. Netw. Anal. Min. 8, 7 (2018). https://doi.org/10.1007/s13278-017-0484-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-017-0484-8

Keywords

Navigation