Attribute selection for improving spam classification in online social networks: a rough set theory-based approach

Dutta, Soumi; Ghatak, Sujata; Dey, Ratnadeep; Das, Asit Kumar; Ghosh, Saptarshi

doi:10.1007/s13278-017-0484-8

Attribute selection for improving spam classification in online social networks: a rough set theory-based approach

Original Article
Published: 18 January 2018

Volume 8, article number 7, (2018)
Cite this article

Social Network Analysis and Mining Aims and scope Submit manuscript

Soumi Dutta^1,2,
Sujata Ghatak²,
Ratnadeep Dey¹,
Asit Kumar Das¹ &
…
Saptarshi Ghosh^1,3

738 Accesses
28 Citations
Explore all metrics

Abstract

As online social network (OSN) sites become increasingly popular, they are targeted by spammers who post malicious content on the sites. Hence, it is important to filter out spam accounts and spam posts from OSNs. There exist several prior works on spam classification on OSNs, which utilize various features to distinguish between spam and legitimate entities. The objective of this study is to improve such spam classification, by developing an attribute selection methodology that helps to find a smaller subset of the attributes which leads to better classification. Specifically, we apply the concepts of rough set theory to develop the attribute selection algorithm. We perform experiments over five different spam classification datasets over diverse OSNs and compare the performance of the proposed methodology with that of several baseline methodologies for attribute selection. We find that, for most of the datasets, the proposed methodology selects an attribute subset that is smaller than what is selected by the baseline methodologies, yet achieves better classification performance compared to the other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Article 05 March 2020

Kanish Shah, Henil Patel, … Manan Shah

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Article Open access 11 May 2022

Francisco Jáñez-Martino, Rocío Alaiz-Rodríguez, … Enrique Alegre

Selecting critical features for data classification based on machine learning methods

Article Open access 23 July 2020

Rung-Ching Chen, Christine Dewi, … Rezzy Eko Caraka

Notes

http://www.internetlivestats.com/twitter-statistics/.
A binary relation \(R \subseteq U \times U\) is called an equivalence relation if the relation is reflexive (\((x, x) \in R, \forall x \in U\)) symmetric (\((x, y) \in R\) implies \((y, x) \in R\)), and transitive (\((x, y) \in R\) and \((y, z) \in R\) imply \((x, z) \in R\)). The equivalence class of an element \(x \in U\) consists of all objects \(y \in U\) such that \((x, y) \in R\).
While selecting a node with the highest in-degree, ties, if any, are resolved arbitrarily.
Ties, if any, are resolved arbitrarily.

References

Ahmed F, Abulaish M (2013) A generic statistical approach for spam detection in online social networks. Comput Commun 36(10–11):1120–1129
Article Google Scholar
Bandyopadhyay S, Bhadra T, Mitra P, Maulik U (2014) Integration of dense subgraph finding with feature clustering for unsupervised feature selection. Pattern Recogn Lett 40:104–112
Article Google Scholar
Benevenuto F, Rodrigues T, Almeida V, Almeida J, Gonalves M (2009) Detecting spammers and content promoters in online video social networks. In: Proceedings of the annual Intl SIGIR conference, Boston, MA, USA
Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on Twitter. In: Proceedings of collaboration, electronic messaging, anti-abuse and spam conference (CEAS)
Caballero Y, Alvarez D, Bello R (2007) Feature selection algorithms using rough set theory. In: Proceedings of IEEE international conference on intelligent systems design and applications, pp 407–411
Capture-HPC. https://projects.honeynet.org/capture-hpc/
Caruana G, Li M (2012) A survey of emerging approaches to spam filtering. ACM Comput Surv 44(2):9:1–9:27
Article Google Scholar
Chen Y, Miao D, Wang R (2010) A rough set approach to feature selection based on ant colony optimization. Pattern Recogn Lett 31(3):226–233
Article Google Scholar
Chhabra S, Aggarwal A, Benevenuto F, Kumaraguru P (2011) Phi.sh/SPSSlashDollaroCiaL: the phishing landscape through short URLs. In: proceedings of collaboration, electronic messaging, anti-abuse and spam conference (CEAS)
Costa H, de Campos Merschmann LH, Barth F, Benevenuto F (2014) Pollution, bad-mouthing, and local marketing: the underground of location-based social networks. Elsevier Information Sciences, Amsterdam
Google Scholar
Costa H, Benevenuto F, de Campos Merschmann LH (2013) Detecting tip spam in location-based social networks. In: Proceedings of the 28th annual ACM symposium on applied computing (SAC)
Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(1–4):131–156
Article Google Scholar
Deogun JS, Choubey SK, Raghavan VV, Sever H (1998) Feature selection and effective classifiers. J Am Soc Inf Sci 49(5):423–434
Article Google Scholar
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous valued attributes for classification learning. In: Proceedings of international joint conference on artificial intelligence, vol 2, pp 1022–1027
Gao H, Hu J, Wilson C, Li Z, Chen Y, Zhao BY (2010) Detecting and characterizing social spam campaigns. In: Proceedings of ACM international conference on internet measurement (IMC)
Garcia S, Luengo J, Saez JA, Lopez V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750
Article Google Scholar
Google Safe Browsing API. https://developers.google.com/safe-browsing/
Grier C, Thomas K, Paxson V, Zhang M (2010) @spam: the underground on 140 characters or less. In: Proceedings of ACM international conference on computer and communications security (CCS), pp 27–37
Hall MA (1998) Correlation-based feature subset selection for machine learning. Ph.D. thesis, University of Waikato, Hamilton, New Zealand
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18
Article Google Scholar
Heymann P, Koutrika G, Garcia-Molina H (2007) Fighting spam on social web sites: a survey of approaches and future challenges. IEEE Internet Comput 11:36–45
Article Google Scholar
Infomap - community detection. http://www.mapequation.org/code.html
Karimpour J, Noroozi AA, Abadi A (2012) The impact of feature selection on web spam detection. Int J Intell Syst Appl 4(9):61–67
Google Scholar
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Article MATH Google Scholar
Lee S, Kim J (2013) WarningBird: a near real-time detection system for suspicious URLs in Twitter stream. IEEE Trans Dependable Secure Comput 10(3):183–195
Article Google Scholar
Lee K, Caverlee J, Webb S (2010) Uncovering social spammers: social honeypots + machine learning. In: Proceedings of ACM international conference on research and development in information retrieval (SIGIR), pp 435–442
Lee K, Eoff BD, Caverlee J (2011) Seven months with the devils: a long-term study of content polluters on Twitter. In: Proceedings of AAAI international conference on weblogs and social media (ICWSM)
Liu H, Setiono R (1996) A probabilistic approach to feature selection—a filter solution. In: 13th international conference on machine learning, pp 319–327
Martinez-Romo J, Araujo L (2013) Detecting malicious tweets in trending topics using a statistical analysis of language. Expert Syst Appl 40(8):2992–3000
Article Google Scholar
Mitra P, Murthy CA, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312
Article Google Scholar
Pawlak Z (1982) Rough sets: basic notion. Int J Comput Inf Sci 11(5):344–356
Article Google Scholar
Pawlak Z (1998) Rough set theory and its applications to data analysis. Cybern Syst 29(7):661–688
Article MATH Google Scholar
Skowron A, Rauszer C (1992) The discernibility matrices and functions in information systems. In: Sowinski R (ed) Intelligent decision support. Handbook of applications and advances of the rough set theory, theory and decision library, vol 11. Kluwer Academic Publishers, Dordrecht, pp 331–362
Google Scholar
SURBL. http://www.surbl.org/
Swiniarski RW, Skowron A (2003) Rough set methods in feature selection and recognition. Pattern Recogn Lett 24(6):833–849
Article MATH Google Scholar
The Spamhaus Project. http://www.spamhaus.org/
Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time URL spam filtering service. In: Proceedings of IEEE symposium on security and privacy (2011)
Tseng CY, Sung PC, Chen MS (2011) Cosdes: a collaborative spam detection system with a novel e-mail abstraction scheme. IEEE Trans Knowl Data Eng 23(5):669–682
Article Google Scholar
Twitter API Home. https://dev.twitter.com
Wagner S, Wagner D (2007) Comparing clusterings—an overview. Technical report 2006–04, Universität Karlsruhe (TH). http://digbib.ubka.uni-karlsruhe.de/volltexte/1000011477
Wild C, Seber G (2000) The Wilcoxon rank-sum test. In: Seber G (ed) Chance encounters: a first course in data analysis and inference. Wiley, New York
Xin G, Qiang G, Jing Z, Zheng-Chao Z (2010) An attribute reduction algorithm based on rough set, information entropy and ant colony optimization. In: Proceedings of IEEE international conference on signal processing, pp 1313–1317
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the international conference on machine learning (ICML), pp 412–420
Yardi S, Romero D, Schoenebeck G, Boyd DM (2010) Detecting spam in a Twitter network. First Monday 15(1):1–13
Google Scholar
Zhai LY, Khoo LP, Fok SC (2002) Feature extraction using rough set theory and genetic algorithms—an application for the simplification of product quality evaluation. Comput Ind Eng 43(4):661–676
Article Google Scholar
Zhang Y, Wang S, Wu L (2012) Spam detection via feature selection and decision tree. Adv Sci Lett 5(2):726–730
Article Google Scholar
Zhang M, Yao JT (2004) A rough sets based approach to feature selection. In: Proceedings of IEEE annual meeting of the fuzzy information, pp 1313–1317

Download references

Acknowledgements

We thank the anonymous reviewers for their valuable comments and suggestions, which helped to improve the paper. We also acknowledge useful discussions with Arpan Das and Anirban Majumder in the early phases of the work.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Shibpur, India
Soumi Dutta, Ratnadeep Dey, Asit Kumar Das & Saptarshi Ghosh
Department of Computer Science and Engineering, Institute of Engineering & Management, Kolkata, India
Soumi Dutta & Sujata Ghatak
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India
Saptarshi Ghosh

Authors

Soumi Dutta
View author publications
You can also search for this author in PubMed Google Scholar
Sujata Ghatak
View author publications
You can also search for this author in PubMed Google Scholar
Ratnadeep Dey
View author publications
You can also search for this author in PubMed Google Scholar
Asit Kumar Das
View author publications
You can also search for this author in PubMed Google Scholar
Saptarshi Ghosh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soumi Dutta.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (csv 31725 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dutta, S., Ghatak, S., Dey, R. et al. Attribute selection for improving spam classification in online social networks: a rough set theory-based approach. Soc. Netw. Anal. Min. 8, 7 (2018). https://doi.org/10.1007/s13278-017-0484-8

Download citation

Received: 19 June 2017
Revised: 27 December 2017
Accepted: 28 December 2017
Published: 18 January 2018
DOI: https://doi.org/10.1007/s13278-017-0484-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attribute selection for improving spam classification in online social networks: a rough set theory-based approach

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Selecting critical features for data classification based on machine learning methods

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (csv 31725 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Attribute selection for improving spam classification in online social networks: a rough set theory-based approach

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Selecting critical features for data classification based on machine learning methods

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (csv 31725 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation