Machine intelligence-based algorithms for spam filtering on document labeling

Gaurav, Devottam; Tiwari, Sanju Mishra; Goyal, Ayush; Gandhi, Niketa; Abraham, Ajith

doi:10.1007/s00500-019-04473-7

Machine intelligence-based algorithms for spam filtering on document labeling

Methodologies and Application
Published: 02 November 2019

Volume 24, pages 9625–9638, (2020)
Cite this article

Soft Computing Aims and scope Submit manuscript

Devottam Gaurav¹,
Sanju Mishra Tiwari²,
Ayush Goyal³,
Niketa Gandhi⁴ &
…
Ajith Abraham⁵

969 Accesses
39 Citations
Explore all metrics

Abstract

The internet has provided numerous modes for secure data transmission from one end station to another, and email is one of those. The reason behind its popular usage is its cost-effectiveness and facility for fast communication. In the meantime, many undesirable emails are generated in a bulk format for a monetary benefit called spam. Despite the fact that people have the ability to promptly recognize an email as spam, performing such task may waste time. To simplify the classification task of a computer in an automated way, a machine learning method is used. Due to limited availability of datasets for email spam, constrained data and the text written in an informal way are the most feasible issues that forced the current algorithms to fail to meet the expectations during classification. This paper proposed a novel, spam mail detection method based on the document labeling concept which classifies the new ones into ham or spam. Moreover, algorithms like Naive Bayes, Decision Tree and Random Forest (RF) are used in the classification process. Three datasets are used to evaluate how the proposed algorithm works. Experimental results illustrate that RF has higher accuracy when compared with other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

A random forest guided tour

Article 19 April 2016

A Review on Random Forest: An Ensemble Classifier

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Article Open access 11 May 2022

References

Ahuja L (2018) Handling web spamming using logic approach. In: International conference on advances in computing and data sciences. Springer, Singapore, pp 380–387
Google Scholar
Attenberg J, Weinberger K, Dasgupta A, Smola A, Zinkevich M (2009) Collaborative email-spam filtering with the hashing trick. In: Proceedings of the sixth conference on email and anti-spam
Bassiouni M, Ali M, El-Dahshan EA (2018) Ham and spam e-mails classification using machine learning techniques. J Appl Secur Res 13(3):315–331
Article Google Scholar
Bhat SY, Abulaish M, Mirza AA (2014) Spammer classification using ensemble methods over structural social network features. In: Proceedings of the 2014 IEEE/WIC/ACM international joint conferences on web intelligence (WI) and intelligent agent technologies (IAT), vol 02. IEEE Computer Society
Camastra F, Ciaramella A, Staiano A (2013) Machine learning and soft computing for ICT security: an overview of current trends. J Ambient Intell Humaniz Comput 4:235–247
Article Google Scholar
Chebrolu S, Abraham A, Thomas JP (2005) Feature deduction and ensemble design of intrusion detection systems. Comput Secur 24(4):295–307
Article Google Scholar
Christina V, Karpagavalli S, Suganya G (2010) A study on email spam filtering techniques. Int J Comput Appl 12(1):0975–8887
Google Scholar
DCC Spam Control Delayed Your E-Mail. http://umanitoba.ca/computing/ist/email/exchange/securityspamindex.html. Accessed 20 Dec 2018
Gaurav D, Yadav JKPS, Kaliyar RK, Goyal A (2019) Detection of false positive situation in review mining. Soft Computing and signal processing. Springer, Singapore, pp 83–90
Google Scholar
Gupta S, Kumar P, Abraham A (2013) A profile based network intrusion detection and prevention system for securing cloud environment. Int J Distrib Sensor Netw 9(3):364575
Article Google Scholar
Herrero A, Corchado E, Pellicer MA, Abraham A (2009) MOVIH-IDS: a mobile-visualization hybrid intrusion detection system. Neurocomputing 72(13–15):2775–2784
Article Google Scholar
http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/index.html. Accessed 31 Jan 2019
http://www.aueb.gr/users/ion/data/lingspam_public.tar.gz. Accessed 05 Feb 2019
http://www.aueb.gr/users/ion/data/PU123ACorpora.tar.gz. Accessed 10 Feb 2019
https://cacm.acm.org/magazines/2018/7/229047-youve-got-mail/fulltext?mobile=false. Accessed 20 Feb 2019
Staiano A, Di Taranto MD, Bloise E, Agostino MND, D’Angelo A, Marotta G, Gentile M, Jossa F, Iannuzzi A, Rubba P, Fortunato G (2013) Investigation of single nucleotide polymorphisms associated to familial combined hyperlipidemia with random forests. In: Neural nets and surroundings. Springer, Berlin, Heidelberg, pp 169–178
Chapter Google Scholar
Kim D, Deokseong S, Suhyoun C, Pilsung K (2019) Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Inf Sci 477:15–19
Article Google Scholar
Kumar RK, Poonkuzhali G, Sudhakar P (2012) Comparative study on email spam classifier using data mining techniques. In: Proceedings of the international multi-conference of engineers and computer scientists, vol 1, Hong Kong, pp 14–16
Liu TJ, Tsao WL, Lee CL (2010) A high performance image-spam filtering system. In: 2010 ninth international symposium on distributed computing and applications to business engineering and science (DCABES). IEEE, pp 445-449
Merugu S, Reddy MCS, Goyal E, Piplani L (2019) Text message classification using supervised machine learning algorithms. In: Kumar A, Mozar S (eds) ICCCE 2018. ICCCE 2018. Lecture Notes in Electrical Engineering, vol 500. Springer, Singapore, p 2019
Google Scholar
Microsoft Sender ID Framework. http://www.microsoft.com/mscorp/safety/technologies/senderid/default.mspx. Accessed 14 Jan 2019
Mishra S, Sagban R, Yakoob A, Gandhi N (2018) Swarm intelligence in anomaly detection systems: an overview. Int J Comput Appl 1–10. (2018)
Nizamani S, Memon N, Wiil UK, Karampelas P (2013) Modeling suspicious email detection using enhanced feature selection. arXiv:1312.1971
Oliveira JP (2019) Spam dataset analysis. https://rstudio-pubs-static.s3.amazonaws.com/65173_80cf15e9415c48d5a60bc54b042fccfe.html. Accessed 08 Aug 2019
Park YW, Klabjan D (2018) Three iteratively reweighted least squares algorithms for L1-norm principal component analysis. Knowl Inf Syst 54(3):541–565
Article Google Scholar
Pyzor’s homepage. https://sourceforge.net/p/pyzor/mailman/pyzor-announce/. Accessed 14 Dec 2018
Radev D (2008) CLAIR collection of fraud email, ACL data and code repository. ADCR2008T001
Razor’s homepage. http://razor.sourceforge.net/. Accessed on 05 Dec 2018
Sarwat N, Menon N, Glasdam M, Nguyen DD (2014) Detection of fraudulent emails by employing advanced feature abundance. Egypt Inform J 15:169–174
Article Google Scholar
Sender Policy Framework. http://www.openspf.org/Introduction. Accessed 24 Jan 2019
Sharaff A, Nagwani NK, Dhadse A (2016) Comparative study of classification algorithms for spam email detection. In: Shetty N, Prasad N, Nalini N (eds) Emerging research in computing, information, communication and applications. Springer, New Delhi
Google Scholar
Symantec Brightmail Anti-Spam. https://www.symantec.com/products/mail-security-exchange. Accessed 23 Dec 2018
Trivedi SK, Dey S (2013) Interplay between probabilistic classifiers and boosting algorithms for detecting complex unsolicited emails. J Adv Comput Netw 1(2):132–136
Article Google Scholar
Vidya Kumari KR, Kavitha CR (2019) Spam detection using machine learning in R. In: Smys S, Bestak R, Chen JZ, Kotuliak I (eds) International conference on computer networks and communication technologies. Lecture Notes on Data Engineering and Communications Technologies, vol 15. Springer, Singapore
Google Scholar
Yandex: Some Automatic Spam Detection Methods. http://company.yandex.ru/public/articles/antispam.xml. Accessed 03 Jan 2019
Yoon JW, Hyoungshick K, Huh JH (2010) Hybrid spam filtering for mobile communication. Comput Secur 29(4):446–459
Article Google Scholar
Youn S, McLeod D (2007) A comparative study for email classification. In: Elleithy K (ed) Advances and Innovations in systems, computing sciences and software engineering. Springer, Dordrecht
Google Scholar

Download references

Funding

This study was not funded by any grant.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Chandigarh University, Punjab, India
Devottam Gaurav
Ontology Engineering Group, Universidad Polytecnica de Madrid, Madrid, Spain
Sanju Mishra Tiwari
Department of Electrical Engineering and Computer Science, Texas A&M University - Kingsville, Kingsville, TX, USA
Ayush Goyal
University of Mumbai, Mumbai, India
Niketa Gandhi
Machine Intelligence Research Labs (MIR Labs), Auburn, WA, 98071, USA
Ajith Abraham

Authors

Devottam Gaurav
View author publications
You can also search for this author in PubMed Google Scholar
Sanju Mishra Tiwari
View author publications
You can also search for this author in PubMed Google Scholar
Ayush Goyal
View author publications
You can also search for this author in PubMed Google Scholar
Niketa Gandhi
View author publications
You can also search for this author in PubMed Google Scholar
Ajith Abraham
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sanju Mishra Tiwari.

Ethics declarations

Conflict of Interest

The authors have declare that they have no conflict of interest.

Human animal rights

No animals were involved. This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gaurav, D., Tiwari, S.M., Goyal, A. et al. Machine intelligence-based algorithms for spam filtering on document labeling. Soft Comput 24, 9625–9638 (2020). https://doi.org/10.1007/s00500-019-04473-7

Download citation

Published: 02 November 2019
Issue Date: July 2020
DOI: https://doi.org/10.1007/s00500-019-04473-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine intelligence-based algorithms for spam filtering on document labeling

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

A Review on Random Forest: An Ensemble Classifier

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Human animal rights

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Machine intelligence-based algorithms for spam filtering on document labeling

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

A Review on Random Forest: An Ensemble Classifier

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Human animal rights

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation