Skip to main content
Log in

Term frequency combined hybrid feature selection method for spam filtering

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Feature selection is an important technology on improving the efficiency and accuracy of spam filtering. Among the numerous methods, document frequency-based feature selections ignore the effect of term frequency information, thus always deduce unsatisfactory results. In this paper, a hybrid method (called HBM), which combines the document frequency information and term frequency information is proposed. To maintain the category distinguishing ability of the selected features, an optimal document frequency-based feature selection (called ODFFS) is chosen; terms which are indeed discriminative will be selected by ODFFS. For the remaining features, term frequency information is considered and the terms with the highest HBM values are selected. Further, a novel method called feature subset evaluating parameter optimization (FSEPO) is proposed for parameter optimization. Experiments with support vector machine (SVM) and Naïve Bayesian (NB) classifiers are applied on four corpora: PU1, LingSpam, SpamAssian and Trec2007. Six feature selections: information gain, Chi square, improved Gini-index, multi-class odds ratio, normalized term frequency-based discriminative power measure and comprehensively measure feature selection are compared with HBM. Experimental results show that, HBM is significantly superior to other feature selection methods on four corpora when SVM and NB are applied, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Androutsopoulos I, Koutsias J, Chandrinos KV, Paliouras G, Spyropoulos C (2000) An evaluation of naive Bayesian anti-spam filtering. InL Proceedings of the workshop on machine learning in the new information age

  2. Azam N, Yao J (2012) Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst Appl 39(5):4760–4768

    Article  Google Scholar 

  3. Bermejo P, Ossa L, Gámez JA, Puerta JM (2012) Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking. Knowl-Based Syst 25(1):35–44

    Article  Google Scholar 

  4. Boubezoul A, Paris S (2012) Application of global optimization methods to model and feature selection. Pattern Recogn 45(10):3676–3686

    Article  MATH  Google Scholar 

  5. Breiman L, Friedman JH, Olshen RA (1984) Classification and regression trees. Wadsworth International Group, Monterey

    MATH  Google Scholar 

  6. Chen CM, Lee HM, Chang YJ (2009) Two novel feature selection approaches for web page classification. Expert Syst Appl 36(1):260–272

    Article  Google Scholar 

  7. Chen JN, Huang HK, Tian SF, Qu YL (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3):5432–5435

    Article  Google Scholar 

  8. Clark J, Koprinska I, Poon J (2003) A neural network based approach to automated e-mail classification. In: Proceedings of the IEEE/WIC international conference on web intelligence (WI 03)

  9. Cormack GV (2007) TREC 2007 spam track overview. In: Proceedings of TREC 2007: the 16th text retrieval conference

  10. Correa RF, Ludermir TB (2006) Improving self-organization of document collections by semantic mapping. Neurocomputing 70(1):62–69

    Article  Google Scholar 

  11. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874

    Article  MathSciNet  Google Scholar 

  12. Forman G (2008) BNS feature scaling: an improved representation over TFIDF for SVM text classification. In: Proceedings of the ACM conference on information and knowledge management. ACM, New York, pp 263–279

  13. Gomez JC, Moens MF (2012) PCA document reconstruction for email classification. Comput Stat Data Anal 56(3):741–751

    Article  MathSciNet  Google Scholar 

  14. Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36(7):10206–10222

    Article  Google Scholar 

  15. Lee C, Lee GG (2006) Information gain and divergence-based feature selection for machine learning-based text categorization. Inf Process Manag 42(1):155–165

    Article  Google Scholar 

  16. Liu Y, Wang G, Chen H, Dong H, Zhu X, Wang S (2011) An improved particle swarm optimization for feature selection. J Bionic Eng 8(2):191–200

    Article  Google Scholar 

  17. López FR, Jiménez-Salazar H, Pinto D (2007) A competitive term selection method for information retrieval. In: Proceedings of 8th international conference on computational linguistics and intelligent text processing, (CICLing’07), Lecture notes in computer science, vol 4394, pp 468–475

  18. McCallum A, Nigam K (2007) A comparison of event models for naive Bayes text classification. In: EACL ‘03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, vol 1, pp 307–314

  19. Mengle SSR, Goharian N (2009) Ambiguity measure feature selection algorithm. J Am Soc Inform Sci Technol 60(5):1037–1050

    Article  Google Scholar 

  20. Mladenic D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Decis Support Syst 35(1):45–87

    Article  Google Scholar 

  21. Ogura H, Amano H, Kondo M (2009) Feature selection with a measure of deviations from poisson in text categorization. Expert Syst Appl 36(3):6826–6832

    Article  Google Scholar 

  22. Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106

    Google Scholar 

  23. Ruiz R, Riquelme JC, Aguilar-Ruiz JS, García-Torres M (2012) Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches. Expert Syst Appl 39(12):11094–11102

    Article  Google Scholar 

  24. Salton G, Clement TY (1973) On the construction of effective vocabularies for information retrieval. In: Proceedings of the 1973 meeting on programming languages and information retrieval. ACM, New York, pp 48–60

  25. Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18:613–620

    Article  MATH  Google Scholar 

  26. Santos I, Laorden C, Sanz B, Bringas PG (2012) Enhanced topic-based vector space model for semantics-aware spam filtering. Expert Syst Appl 39(1):437–444

    Article  Google Scholar 

  27. Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33(1):1–5

    Article  Google Scholar 

  28. SpamAssassin (2005) Spamassassin public corpus. http://spamassassin.apache.org/publiccorpus/. Accessed June 2008

  29. Tezel SK (2009) Improving SVM classification on imbalanced data sets in distance space. Ninth IEEE international conference on data mining

  30. Tretyakov K (2004) Machine learning techniques in spam filtering. Data mining problem-oriented seminar MTAT.03.177, pp 60–79

  31. Willett P (2006) The Porter stemming algorithm: then and now. Progr Electron Libr Inf Syst 40(3):219–223

    MathSciNet  Google Scholar 

  32. Yan J, Liu N, Zhang B, Yan S, Chen Z, Cheng Q (2005) OCFS: optimal orthogonal centroid feature selection for text categorization. In: Proceedings of the 28th annual international ACM Sinformation gainIR conference on research and development in information retrieval, ACM, New York, pp 122–129

  33. Yang J, Liu Y, Liu Z, Zhu X, Zhang X (2011) A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowl-Based Syst 24(6):904–914

    Article  Google Scholar 

  34. Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754

    Article  Google Scholar 

  35. Yang Y, Pedersen J (1997) A comparative study on feature set selection in text categorization, In: Fisher DH (ed) Proceedings of the 14th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 412–420

  36. Youn S, McLeod D (2007) A comparative study for email classification. Advances and innovations in systems, computing sciences and software engineering, pp 387–391

  37. Yu B, Xu Z (2008) A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowl-Based Syst 21(4):355–362

    Article  Google Scholar 

  38. Yu SN, Lee MY (2012) Conditional mutual information-based feature selection for congestive heart failure recognition using heart rate variability. Comput Methods Programs Biomed 108(1):299–309

    Article  Google Scholar 

  39. Zhang Y, Li S, Wang T, Zhang Z (2012) Divergence-based feature selection for separate classes. Neurocomputing 101(4):32–42

    Google Scholar 

  40. Zhu Y, Tan Y (2011) A local-concentration-based feature extraction approach for spam filtering. IEEE Trans Inf Forensics Secur 6(2):486–497

    Article  Google Scholar 

Download references

Acknowledgments

This research is supported by National Natural Science Foundation of China under Grant No. 60971089 and National Electronic Development Foundation of China under Grant No. 2009537.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaodong Zhu.

Additional information

The authors claim that none of the material in the paper has been published or is under consideration for publication elsewhere.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Wang, Y., Feng, L. et al. Term frequency combined hybrid feature selection method for spam filtering. Pattern Anal Applic 19, 369–383 (2016). https://doi.org/10.1007/s10044-014-0408-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-014-0408-4

Keywords

Navigation