Skip to main content
Log in

Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Rapid growth in the volume of unsolicited and unwanted messages has inspired the development of many anti-spam methods. Supervised anti-spam filters using machine-learning methods have been particularly effective in categorizing spam and non-spam messages. These automatically integrate spam corpora pre-processing, appropriate word lists selection, and the calculation of word weights, usually in a bag-of-words fashion. To develop an accurate spam filter is challenging because spammers attempt to decrease the probability of spam detection by using legitimate words. Complex models are therefore needed to solve such a problem. However, existing spam filtering methods usually converge to a poor local minimum, cannot effectively handle high-dimensional data and suffer from overfitting issues. To overcome these problems, we propose a novel spam filter integrating an N-gram tf.idf feature selection, modified distribution-based balancing algorithm and a regularized deep multi-layer perceptron NN model with rectified linear units (DBB-RDNN-ReL). As demonstrated on four benchmark spam datasets (Enron, SpamAssassin, SMS spam collection and Social networking), the proposed approach enables capturing more complex features from high-dimensional data by additional layers of neurons. Another advantage of this approach is that no additional dimensionality reduction is necessary and spam dataset imbalance is addressed using a modified distribution-based algorithm. We compare the performance of the approach with that of state-of-the-art spam filters (Minimum Description Length, Factorial Design using SVM and NB, Incremental Learning C4.5, and Random Forest, Voting and Convolutional Neural Network) and several machine learning algorithms commonly used to classify text. We show that the proposed model outperforms these other methods in terms of classification accuracy, with fewer false negatives and false positives. Notably, the proposed spam filter classifies both major (legitimate) and minor (spam) classes well on personalized / non-personalized and balanced / imbalanced spam datasets. In addition, we show that the proposed model performs better than the results reported by previous studies in terms of accuracy. However, the high computational expenses related to additional hidden layers limit its application as an online spam filter and make it difficult to overcome the problem of concept drift.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://csmining.org/index.php/enron-spam-datasets.html

  2. http://csmining.org/index.php/spam-assassin-datasets.html

  3. https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

  4. http://ilps.science.uva.nl/framework-unsupervised-spam-detection-social-networking-sites/

  5. https://sourceforge.net/projects/weka-mdl-df/

References

  1. Abi-Haidar A, Rocha LM (2008) Adaptive spam detection inspired by the immune system. In: Artificial life XI, proceedings of the 11th international conference on the simulation and synthesis of living systems, pp 1–8. https://doi.org/10.1007/978-3-540-85072-4

  2. Ahmed I, Ali R, Guan D, Lee YK, Lee S, Chung T (2015) Semi-supervised learning using frequent itemset and ensemble learning for SMS classification. Expert Syst Appl 42(3):1065–1073. https://doi.org/10.1016/j.eswa.2014.08.054

    Article  Google Scholar 

  3. Almeida TA, Almeida J, Yamakami A (2011) Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers. J Internet Serv Appl 1(3):183–200. https://doi.org/10.1007/s13174-010-0014-7

    Article  Google Scholar 

  4. Almeida TA, Hidalgo JMG, Yamakami A (2011) Contributions to the study of SMS spam filtering: new collection and results. In: Proceedings of the 11th ACM symposium on document engineering, pp 259–262. https://doi.org/10.1145/2034691.2034742

  5. Almeida TA, Yamakami A (2012) Occam’s razor-based spam filter. J Internet Serv Appl 3(3):245–253. https://doi.org/10.1007/s13174-012-0067-x

    Article  Google Scholar 

  6. Almeida TA, Yamakami A (2016) Compression-based spam filter. Secur Commun Netw 9(4):327–335. https://doi.org/10.1002/sec.639

    Article  Google Scholar 

  7. Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000) An experimental comparison of Naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd annual int ACM SIGIR conference on research and development in information retrieval, pp 160–167. https://doi.org/10.1145/345508.345569

  8. Aragão MV, Frigieri EP, Ynoguti CA, Paiva AP (2016) Factorial design analysis applied to the performance of SMS anti-spam filtering systems. Expert Syst Appl 64:589–604. https://doi.org/10.1016/j.eswa.2016.08.038

    Article  Google Scholar 

  9. Barushka A, Hajek P (2016) Spam filtering using regularized neural networks with rectified linear units. In: AI*IA 2016 advances in artificial intelligence. Springer, pp 65–75. https://doi.org/10.1007/978-3-319-49130-1_6

  10. Basto-Fernandes V, Yevseyeva I, Méndez JR, Zhao J, Fdez-Riverola F, Emmerich MT (2016) A spam filtering multi-objective optimization study covering parsimony maximization and three-way classification. Appl Soft Comput 48:111–123. https://doi.org/10.1016/j.asoc.2016.06.043

    Article  Google Scholar 

  11. Bermejo P, Gámez JA, Puerta JM (2011) Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets. Expert Syst Appl 38(3):2072–2080. https://doi.org/10.1016/j.eswa.2010.07.146

    Article  Google Scholar 

  12. Bermejo P, Gámez JA, Puerta JM (2014) Speeding up incremental wrapper feature subset selection with Naive Bayes classifier. Knowl-Based Syst 55:140–147. https://doi.org/10.1016/j.knosys.2013.10.016

    Article  Google Scholar 

  13. Bosma M, Meij E, Weerkamp W (2012) A framework for unsupervised spam detection in social networking sites. In: European conference on information retrieval. Springer, Berlin, pp 364–375. https://doi.org/10.1007/978-3-642-28997-2_31

  14. Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  15. Carpinter J, Hunt R (2006) Tightening the net: a review of current and next generation spam filtering tools. Comput Secur 25(8):566–578. https://doi.org/10.1016/j.cose.2006.06.001

    Article  Google Scholar 

  16. Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. In: Proceedings of RANLP 2001, bulgaria, pp 58–64

  17. Caruana G, Li M (2012) A survey of emerging approaches to spam filtering. ACM Comput Surv 44(2):1–27. https://doi.org/10.1145/2089125.2089129

    Article  Google Scholar 

  18. Chhogyal K, Nayak A (2016) An empirical study of a simple Naive Bayes classifier based on ranking functions. In: Australasian joint conference on artificial intelligence. Springer, pp 324–331. https://doi.org/10.1007/978-3-319-50127-7_27

  19. Clark J, Koprinska I, Poon J (2003) A neural network based approach to automated e-mail classification. In: Proceedings of the IEEE/WIC international conference on web intell (WI’03). IEEE, pp 702–705. https://doi.org/10.1109/WI.2003.1241300

  20. Cormack GV (2006) Email spam filtering: a systematic review. Found Trends Inf Retr 1(4):335–455. https://doi.org/10.1561/1500000006

    Article  Google Scholar 

  21. Delany SJ, Buckley M, Greene D (2012) SMS spam filtering: methods and data. Expert Syst Appl 39 (10):9899–9908. https://doi.org/10.1016/j.eswa.2012.02.053

    Article  Google Scholar 

  22. Dhillon IS, Mallela S, Kumar R (2003) A divisive information-theoretic feature clustering algorithm for text classification. J Mach Learn Res 3:1265–1287. https://doi.org/10.1162/153244303322753661

    MathSciNet  MATH  Google Scholar 

  23. Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054. https://doi.org/10.1109/72.788645

    Article  Google Scholar 

  24. El Boujnouni M (2017) SMS spam filtering using N-gram method, information gain metric and an improved version of SVDD classifier. J Eng Sci Technol Rev 10(1):131–137

    Google Scholar 

  25. Fang A (2016) Applications of the maximum entropy principle in spam email classification. J Residuals Sci Technol 13(6):1–4. https://doi.org/10.12783/issn.1544-8053/13/6/1

    Google Scholar 

  26. Fawcett T (2003) In vivo spam filtering: a challenge problem for KDD. ACM SIGKDD Explor Newsl 5(2):140–148. https://doi.org/10.1145/980972.980990

    Article  Google Scholar 

  27. Fdez-Riverola F, Iglesias EL, Diaz F, Méndez JR, Corchado JM (2007) Spamhunting: an instance-based reasoning system for spam labelling and filtering. Dec Supp Syst 43(3):722–736. https://doi.org/10.1016/j.dss.2006.11.012

    Article  Google Scholar 

  28. Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. Journal-Japanese Soc For Artif Intell 14(5):771–780

    Google Scholar 

  29. Garcia S, Fernandez A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064. https://doi.org/10.1016/j.ins.2009.12.010

    Article  Google Scholar 

  30. Gheyas IA, Smith LS (2010) Feature subset selection in large dimensionality domains. Pattern Recognit 43(1):5–13. https://doi.org/10.1016/j.patcog.2009.06.009

    Article  MATH  Google Scholar 

  31. Guzella T, Caminhas W (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36(7):10206–10222. https://doi.org/10.1016/j.eswa.2009.02.037

    Article  Google Scholar 

  32. Hagenau M, Liebmann M, Neumann D (2013) Automated news reading: stock price prediction based on financial news using context-capturing features. Dec Supp Syst 55(3):685–697. https://doi.org/10.1016/j.dss.2013.02.006

    Article  Google Scholar 

  33. Hassan D (2016) Investigating the effect of combining text clustering with classification on improving spam email detection. In: Madureira A, Abraham A, Gamboa D, Novais P (eds) International conference on intelligent systems design and applications. Springer, Cham, pp 99–107. https://doi.org/10.1007/978-3-319-53480-0_10

  34. Henning JL (2006) SPEC CPU2006 Benchmark descriptions. ACM SIGARCH Comput Archit News 34 (4):1–17. https://doi.org/10.1145/1186736.1186737

    Article  Google Scholar 

  35. Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580

  36. Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE T Pattern Anal 24(3):289–300. https://doi.org/10.1109/34.990132

    Article  Google Scholar 

  37. Hoanca B (2006) How good are our weapons in the spam wars? IEEE Technol Soc Mag 25(1):22–30. https://doi.org/10.1109/MTAS.2006.1607720

    Article  Google Scholar 

  38. Jaitly N, Hinton G (2011) Learning a better representation of speech soundwaves using restricted Boltzmann machines, pp 5884–5887. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). https://doi.org/10.1109/ICASSP.2011.5947700

  39. Jiang S, Pang G, Wu M, Kuang L (2012) An improved k-nearest-neighbor algorithm for text categorization. Expert Syst Appl 39(1):1503–1509. https://doi.org/10.1016/j.eswa.2011.08.040

    Article  Google Scholar 

  40. Kaya Y, Ertuğrul ÖF (2016) A novel approach for spam email detection based on shifted binary patterns. Secur Commun Netw 9(10):1216–1225. https://doi.org/10.1002/sec.1412

    Article  Google Scholar 

  41. Khan A, Baharudin B, Lee L (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1(1):4–20. https://doi.org/10.1016/j.eswa.2011.08.040

    Google Scholar 

  42. Khorshidpour Z, Hashemi S, Hamzeh A (2017) Evaluation of random forest classifier in security domain. Appl Intell. https://doi.org/10.1007/s10489-017-0907-2

  43. Kim Y (2014) Convolutional neural networks for sentence classification. arXiv:1408.5882

  44. Koprinska I, Poon J, Clark J, Chan J (2007) Learning to classify e-mail. Inf Sci 177(10):2167–2187. https://doi.org/10.1016/j.ins.2006.12.005

    Article  Google Scholar 

  45. Lai C (2007) An empirical study of three machine learning methods for spam filtering. Knowl-Based Syst 20(3):249–254. https://doi.org/10.1016/j.knosys.2006.05.016

    Article  Google Scholar 

  46. Laorden C, Ugarte-Pedrero X, Santos I, Sanz B, Nieves J, Bringas PG (2014) Study on the effectiveness of anomaly detection for spam filtering. Inf Sci 277:421–444. https://doi.org/10.1016/j.ins.2014.02.114

    Article  Google Scholar 

  47. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791

    Article  Google Scholar 

  48. Liu Y, Wang Y, Feng L, Zhu X (2016) Term frequency combined hybrid feature selection method for spam filtering. Pattern Anal Applic 19(2):369–383. https://doi.org/10.1016/j.asoc.2016.06.043

    Article  MathSciNet  Google Scholar 

  49. Liu AC (2004) The effect of oversampling and undersampling on classifying imbalanced text datasets. The University of Texas at Austin, Austin. https://doi.org/10.1.1.101.5878

    Google Scholar 

  50. Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the 30th international conference on machine learning, vol 30, pp 1–6

  51. Méndez J, Corzo B, Glez-Peña D, Fdez-Riverola F, Díaz F (2007) Analyzing the performance of spam filtering methods when dimensionality of input vector changes. In: Perner P (ed) Machine learning and data mining in pattern recognition. Springer, Berlin, pp 364–378. https://doi.org/10.1007/978-3-540-73499-4_28

  52. Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with Naive Bayes - which Naive Bayes?. In: Third conference on email and antispam (CEAS), pp 27–28. https://doi.org/10.1.1.61.5542

  53. Mishra R, Thakur RS (2013) Analysis of random forest and Naive Bayes for spam mail using feature selection catagorization. Int J Comput Appl 80(3):42–47

    Google Scholar 

  54. Nagwani NK, Sharaff A (2017) SMS spam filtering and thread identification using bi-level text classification and clustering techniques. J Inf Sci 43(1):75–87. https://doi.org/10.1177/0165551515616310

    Article  Google Scholar 

  55. Najadat H, Abdulla N, Abooraig R, Nawasrah S (2016) Spam detection for mobile short messaging service using data mining classifiers. Int J Comput Sci Inf Secur 14(8):511–517

    Google Scholar 

  56. Nam J, Kim J, Mencía EL, Gurevych I, Fürnkranz J (2014) Large-scale multi-label text classification - revisiting neural networks. In: Calders T, Esposito F, Hüllermeier E, Melo R (eds) Machine learning and knowledge discovery in databases. Springer, Berlin, pp 437–452. https://doi.org/10.1007/978-3-662-44851-9_28

  57. Obied A, Alhajj R (2009) Fraudulent and malicious sites on the web. Appl Intell 30(2):112–120. https://doi.org/10.1007/s10489-007-0102-y

    Article  Google Scholar 

  58. Rozza A, Lombardi G, Casiraghi E (2009) Novel IPCA-based classifiers and their application to spam filtering. In: Ninth international conference on intelligent systems design and applications, ISDA’09. IEEE, pp 797–802. https://doi.org/10.1109/ISDA.2009.21

  59. Quinlan JR (1996) Improved use of continuous attributes in c4. 5. J Artificial Intell Res 4:77–90. https://doi.org/10.1613/jair.279

    Article  MATH  Google Scholar 

  60. Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Learn for text categorization, papers from the 1998 workshop, vol 62, pp 98–105. https://doi.org/10.1.1.48.1254

  61. Sanghani G, Kotecha K (2016) Personalized spam filtering using incremental training of support vector machine. IEEE, pp 323–328. In: International conference on computing, analytics and security trends (CAST). https://doi.org/10.1109/CAST.2016.7914988

  62. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47. https://doi.org/10.1145/505282.505283

    Article  Google Scholar 

  63. Shams R, Mercer RE (2013) Personalized spam filtering with natural language attributes. In: 12th international conference on machine learning and applications (ICMLA), vol 2. IEEE, pp 127–132. https://doi.org/10.1109/ICMLA.2013.117

  64. Shams R, Mercer RE (2016) Supervised classification of spam emails with natural language stylometry. Neural Comput Appl 27(8):2315–2331. https://doi.org/10.1007/s00521-015-2069-7

    Article  Google Scholar 

  65. Shen H, Li Z (2014) Leveraging social networks for effective spam filtering. IEEE Trans Comput 63(11):2743–2759. https://doi.org/10.1109/TC.2013.152

    Article  MathSciNet  MATH  Google Scholar 

  66. Sheu JJ, Chen YK, Chu KT, Tang JH, Yang WP (2016) An intelligent three-phase spam filtering method based on decision tree data mining. Secur Commun Netw 9(17):4013–4026. https://doi.org/10.1002/sec.1584

    Article  Google Scholar 

  67. Sheu JJ, Chu KT, Li NF, Lee CC (2017) An efficient incremental learning mechanism for tracking concept drift in spam filtering. PloS One 12(2):e0171518. https://doi.org/10.1371/journal.pone.0171518

    Article  Google Scholar 

  68. Silva RM, Alberto TC, Almeida TA, Yamakami A (2017) Towards filtering undesired short text messages using an online learning approach with semantic indexing. Expert Syst Appl 83:314–325. https://doi.org/10.1016/j.eswa.2017.04.055

    Article  Google Scholar 

  69. Talbot D (2008) Where spam is born. MIT Technol Rev

  70. Trivedi SK, Dey S (2013) An enhanced genetic programming approach for detecting unsolicited emails. In: IEEE 16th international conference on computational science and engineering (CSE), pp 1153–1160. https://doi.org/10.1109/CSE.2013.171

  71. Trivedi SK, Dey S (2016) A combining classifiers approach for detecting email spams. In: 30th international conference on advanced information networking and applications workshops (WAINA). IEEE, pp 355–360. https://doi.org/10.1109/WAINA.2016.127

  72. Trivedi SK, Dey S (2016) A comparative study of various supervised feature selection methods for spam classification. In: Proceedings of the 2nd international conference on information and communication technology for competitive strategies. ACM, p 64. https://doi.org/10.1145/2905055.2905122

  73. Tzortzis G, Likas A (2007) Deep belief networks for spam filtering. In: 19th IEEE international conference on tools with artificial intelligence, ICTAI 2007, vol 2. IEEE, pp 306–309. https://doi.org/10.1109/ICTAI.2007.65

  74. Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl-Based Syst 36:226–235. https://doi.org/10.1016/j.knosys.2012.06.005

    Article  Google Scholar 

  75. Uysal AK, Gunal S, Ergin S, Gunal ES (2012) A novel framework for SMS spam filtering. In: 2012 international symposium on innovations in intelligent systems and applications (INISTA). IEEE, pp 1–4. https://doi.org/10.1109/INISTA.2012.6246947

  76. Vyas T, Prajapati P, Gadhwal S (2015) A survey and evaluation of supervised machine learning techniques for spam e-mail filtering. In: IEEE international conference on electrical, computer and communication technologies (ICECCT). IEEE, pp 1–7. https://doi.org/10.1109/ICECCT.2015.7226077

  77. Watkins A, Timmis J (2004) Artificial immune recognition system (AIRS): an immune-inspired supervised learning algorithm. Genet Program Evolvable Mach 5(3):291–317. https://doi.org/10.1023/B:GENP.0000030197.83685.94

    Article  Google Scholar 

  78. Wei CP, Chen HC, Cheng TH (2008) Effective spam filtering: a single-class learning and ensemble approach. Decis Supp Syst 45(3):491–503. https://doi.org/10.1016/j.dss.2007.06.010

    Article  Google Scholar 

  79. Wu CH, Tsai CH (2009) Robust classification for spam filtering by back-propagation neural networks using behavior-based features. Appl Intell 31:107–121. https://doi.org/10.1007/s10489-008-0116-0

    Article  Google Scholar 

  80. Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: International conference on machine learning, vol 3, pp 856–863

  81. Yu B, Xu ZB (2008) A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowl-Based Syst 21(4):355–362. https://doi.org/10.1016/j.knosys.2008.01.001

    Article  Google Scholar 

  82. Yue X, Abraham A, Chi ZX, Hao YY, Mo H (2007) Artificial immune system inspired behavior-based anti-spam filter. Soft Comput - A Fusion of Found, Methodol and Appl 11(8):729–740. https://doi.org/10.1007/s00500-006-0116-0

    Google Scholar 

  83. Zhang Y, Wang S, Phillips P, Ji G (2014) Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowl-Based Syst 64:22–31. https://doi.org/10.1016/j.knosys.2014.03.015

    Article  Google Scholar 

  84. Zhang L, Zhu J, Yao T (2004) An evaluation of statistical spam filtering techniques. ACM Trans Asian Lang Inf Process 3(4):243–269. https://doi.org/10.1.1.109.7685

    Article  Google Scholar 

  85. Zheng X, Zeng Z, Chen Z, Yu Y, Rong C (2015) Detecting spammers on social networks. Neurocomputing 159:27–34. https://doi.org/10.1016/j.neucom.2015.02.047

    Article  Google Scholar 

  86. Zhou B, Yao Y, Luo J (2014) Cost-sensitive three-way email spam filtering. J Intell Inf Syst 42(1):19–45. https://doi.org/10.1007/s10844-013-0254-7

    Article  Google Scholar 

  87. Zitar RA, Hamdan A (2013) Genetic optimized artificial immune system in spam detection: a review and a model. Artif Intell Rev 40(3):305–377. https://doi.org/10.1007/s10462-011-9285-z

    Article  Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge the help provided by constructive comments of the anonymous referees.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Petr Hajek.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barushka, A., Hajek, P. Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks. Appl Intell 48, 3538–3556 (2018). https://doi.org/10.1007/s10489-018-1161-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-1161-y

Keywords

Navigation