Abstract
As online social network (OSN) sites become increasingly popular, they are targeted by spammers who post malicious content on the sites. Hence, it is important to filter out spam accounts and spam posts from OSNs. There exist several prior works on spam classification on OSNs, which utilize various features to distinguish between spam and legitimate entities. The objective of this study is to improve such spam classification, by developing an attribute selection methodology that helps to find a smaller subset of the attributes which leads to better classification. Specifically, we apply the concepts of rough set theory to develop the attribute selection algorithm. We perform experiments over five different spam classification datasets over diverse OSNs and compare the performance of the proposed methodology with that of several baseline methodologies for attribute selection. We find that, for most of the datasets, the proposed methodology selects an attribute subset that is smaller than what is selected by the baseline methodologies, yet achieves better classification performance compared to the other methods.
Similar content being viewed by others
Notes
A binary relation \(R \subseteq U \times U\) is called an equivalence relation if the relation is reflexive (\((x, x) \in R, \forall x \in U\)) symmetric (\((x, y) \in R\) implies \((y, x) \in R\)), and transitive (\((x, y) \in R\) and \((y, z) \in R\) imply \((x, z) \in R\)). The equivalence class of an element \(x \in U\) consists of all objects \(y \in U\) such that \((x, y) \in R\).
While selecting a node with the highest in-degree, ties, if any, are resolved arbitrarily.
Ties, if any, are resolved arbitrarily.
References
Ahmed F, Abulaish M (2013) A generic statistical approach for spam detection in online social networks. Comput Commun 36(10–11):1120–1129
Bandyopadhyay S, Bhadra T, Mitra P, Maulik U (2014) Integration of dense subgraph finding with feature clustering for unsupervised feature selection. Pattern Recogn Lett 40:104–112
Benevenuto F, Rodrigues T, Almeida V, Almeida J, Gonalves M (2009) Detecting spammers and content promoters in online video social networks. In: Proceedings of the annual Intl SIGIR conference, Boston, MA, USA
Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on Twitter. In: Proceedings of collaboration, electronic messaging, anti-abuse and spam conference (CEAS)
Caballero Y, Alvarez D, Bello R (2007) Feature selection algorithms using rough set theory. In: Proceedings of IEEE international conference on intelligent systems design and applications, pp 407–411
Capture-HPC. https://projects.honeynet.org/capture-hpc/
Caruana G, Li M (2012) A survey of emerging approaches to spam filtering. ACM Comput Surv 44(2):9:1–9:27
Chen Y, Miao D, Wang R (2010) A rough set approach to feature selection based on ant colony optimization. Pattern Recogn Lett 31(3):226–233
Chhabra S, Aggarwal A, Benevenuto F, Kumaraguru P (2011) Phi.sh/SPSSlashDollaroCiaL: the phishing landscape through short URLs. In: proceedings of collaboration, electronic messaging, anti-abuse and spam conference (CEAS)
Costa H, de Campos Merschmann LH, Barth F, Benevenuto F (2014) Pollution, bad-mouthing, and local marketing: the underground of location-based social networks. Elsevier Information Sciences, Amsterdam
Costa H, Benevenuto F, de Campos Merschmann LH (2013) Detecting tip spam in location-based social networks. In: Proceedings of the 28th annual ACM symposium on applied computing (SAC)
Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(1–4):131–156
Deogun JS, Choubey SK, Raghavan VV, Sever H (1998) Feature selection and effective classifiers. J Am Soc Inf Sci 49(5):423–434
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous valued attributes for classification learning. In: Proceedings of international joint conference on artificial intelligence, vol 2, pp 1022–1027
Gao H, Hu J, Wilson C, Li Z, Chen Y, Zhao BY (2010) Detecting and characterizing social spam campaigns. In: Proceedings of ACM international conference on internet measurement (IMC)
Garcia S, Luengo J, Saez JA, Lopez V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750
Google Safe Browsing API. https://developers.google.com/safe-browsing/
Grier C, Thomas K, Paxson V, Zhang M (2010) @spam: the underground on 140 characters or less. In: Proceedings of ACM international conference on computer and communications security (CCS), pp 27–37
Hall MA (1998) Correlation-based feature subset selection for machine learning. Ph.D. thesis, University of Waikato, Hamilton, New Zealand
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18
Heymann P, Koutrika G, Garcia-Molina H (2007) Fighting spam on social web sites: a survey of approaches and future challenges. IEEE Internet Comput 11:36–45
Infomap - community detection. http://www.mapequation.org/code.html
Karimpour J, Noroozi AA, Abadi A (2012) The impact of feature selection on web spam detection. Int J Intell Syst Appl 4(9):61–67
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Lee S, Kim J (2013) WarningBird: a near real-time detection system for suspicious URLs in Twitter stream. IEEE Trans Dependable Secure Comput 10(3):183–195
Lee K, Caverlee J, Webb S (2010) Uncovering social spammers: social honeypots + machine learning. In: Proceedings of ACM international conference on research and development in information retrieval (SIGIR), pp 435–442
Lee K, Eoff BD, Caverlee J (2011) Seven months with the devils: a long-term study of content polluters on Twitter. In: Proceedings of AAAI international conference on weblogs and social media (ICWSM)
Liu H, Setiono R (1996) A probabilistic approach to feature selection—a filter solution. In: 13th international conference on machine learning, pp 319–327
Martinez-Romo J, Araujo L (2013) Detecting malicious tweets in trending topics using a statistical analysis of language. Expert Syst Appl 40(8):2992–3000
Mitra P, Murthy CA, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312
Pawlak Z (1982) Rough sets: basic notion. Int J Comput Inf Sci 11(5):344–356
Pawlak Z (1998) Rough set theory and its applications to data analysis. Cybern Syst 29(7):661–688
Skowron A, Rauszer C (1992) The discernibility matrices and functions in information systems. In: Sowinski R (ed) Intelligent decision support. Handbook of applications and advances of the rough set theory, theory and decision library, vol 11. Kluwer Academic Publishers, Dordrecht, pp 331–362
SURBL. http://www.surbl.org/
Swiniarski RW, Skowron A (2003) Rough set methods in feature selection and recognition. Pattern Recogn Lett 24(6):833–849
The Spamhaus Project. http://www.spamhaus.org/
Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time URL spam filtering service. In: Proceedings of IEEE symposium on security and privacy (2011)
Tseng CY, Sung PC, Chen MS (2011) Cosdes: a collaborative spam detection system with a novel e-mail abstraction scheme. IEEE Trans Knowl Data Eng 23(5):669–682
Twitter API Home. https://dev.twitter.com
Wagner S, Wagner D (2007) Comparing clusterings—an overview. Technical report 2006–04, Universität Karlsruhe (TH). http://digbib.ubka.uni-karlsruhe.de/volltexte/1000011477
Wild C, Seber G (2000) The Wilcoxon rank-sum test. In: Seber G (ed) Chance encounters: a first course in data analysis and inference. Wiley, New York
Xin G, Qiang G, Jing Z, Zheng-Chao Z (2010) An attribute reduction algorithm based on rough set, information entropy and ant colony optimization. In: Proceedings of IEEE international conference on signal processing, pp 1313–1317
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the international conference on machine learning (ICML), pp 412–420
Yardi S, Romero D, Schoenebeck G, Boyd DM (2010) Detecting spam in a Twitter network. First Monday 15(1):1–13
Zhai LY, Khoo LP, Fok SC (2002) Feature extraction using rough set theory and genetic algorithms—an application for the simplification of product quality evaluation. Comput Ind Eng 43(4):661–676
Zhang Y, Wang S, Wu L (2012) Spam detection via feature selection and decision tree. Adv Sci Lett 5(2):726–730
Zhang M, Yao JT (2004) A rough sets based approach to feature selection. In: Proceedings of IEEE annual meeting of the fuzzy information, pp 1313–1317
Acknowledgements
We thank the anonymous reviewers for their valuable comments and suggestions, which helped to improve the paper. We also acknowledge useful discussions with Arpan Das and Anirban Majumder in the early phases of the work.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Dutta, S., Ghatak, S., Dey, R. et al. Attribute selection for improving spam classification in online social networks: a rough set theory-based approach. Soc. Netw. Anal. Min. 8, 7 (2018). https://doi.org/10.1007/s13278-017-0484-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-017-0484-8