Abstract
Feature selection is an important technology on improving the efficiency and accuracy of spam filtering. Among the numerous methods, document frequency-based feature selections ignore the effect of term frequency information, thus always deduce unsatisfactory results. In this paper, a hybrid method (called HBM), which combines the document frequency information and term frequency information is proposed. To maintain the category distinguishing ability of the selected features, an optimal document frequency-based feature selection (called ODFFS) is chosen; terms which are indeed discriminative will be selected by ODFFS. For the remaining features, term frequency information is considered and the terms with the highest HBM values are selected. Further, a novel method called feature subset evaluating parameter optimization (FSEPO) is proposed for parameter optimization. Experiments with support vector machine (SVM) and Naïve Bayesian (NB) classifiers are applied on four corpora: PU1, LingSpam, SpamAssian and Trec2007. Six feature selections: information gain, Chi square, improved Gini-index, multi-class odds ratio, normalized term frequency-based discriminative power measure and comprehensively measure feature selection are compared with HBM. Experimental results show that, HBM is significantly superior to other feature selection methods on four corpora when SVM and NB are applied, respectively.
Similar content being viewed by others
References
Androutsopoulos I, Koutsias J, Chandrinos KV, Paliouras G, Spyropoulos C (2000) An evaluation of naive Bayesian anti-spam filtering. InL Proceedings of the workshop on machine learning in the new information age
Azam N, Yao J (2012) Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst Appl 39(5):4760–4768
Bermejo P, Ossa L, Gámez JA, Puerta JM (2012) Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking. Knowl-Based Syst 25(1):35–44
Boubezoul A, Paris S (2012) Application of global optimization methods to model and feature selection. Pattern Recogn 45(10):3676–3686
Breiman L, Friedman JH, Olshen RA (1984) Classification and regression trees. Wadsworth International Group, Monterey
Chen CM, Lee HM, Chang YJ (2009) Two novel feature selection approaches for web page classification. Expert Syst Appl 36(1):260–272
Chen JN, Huang HK, Tian SF, Qu YL (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3):5432–5435
Clark J, Koprinska I, Poon J (2003) A neural network based approach to automated e-mail classification. In: Proceedings of the IEEE/WIC international conference on web intelligence (WI 03)
Cormack GV (2007) TREC 2007 spam track overview. In: Proceedings of TREC 2007: the 16th text retrieval conference
Correa RF, Ludermir TB (2006) Improving self-organization of document collections by semantic mapping. Neurocomputing 70(1):62–69
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874
Forman G (2008) BNS feature scaling: an improved representation over TFIDF for SVM text classification. In: Proceedings of the ACM conference on information and knowledge management. ACM, New York, pp 263–279
Gomez JC, Moens MF (2012) PCA document reconstruction for email classification. Comput Stat Data Anal 56(3):741–751
Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36(7):10206–10222
Lee C, Lee GG (2006) Information gain and divergence-based feature selection for machine learning-based text categorization. Inf Process Manag 42(1):155–165
Liu Y, Wang G, Chen H, Dong H, Zhu X, Wang S (2011) An improved particle swarm optimization for feature selection. J Bionic Eng 8(2):191–200
López FR, Jiménez-Salazar H, Pinto D (2007) A competitive term selection method for information retrieval. In: Proceedings of 8th international conference on computational linguistics and intelligent text processing, (CICLing’07), Lecture notes in computer science, vol 4394, pp 468–475
McCallum A, Nigam K (2007) A comparison of event models for naive Bayes text classification. In: EACL ‘03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, vol 1, pp 307–314
Mengle SSR, Goharian N (2009) Ambiguity measure feature selection algorithm. J Am Soc Inform Sci Technol 60(5):1037–1050
Mladenic D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Decis Support Syst 35(1):45–87
Ogura H, Amano H, Kondo M (2009) Feature selection with a measure of deviations from poisson in text categorization. Expert Syst Appl 36(3):6826–6832
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Ruiz R, Riquelme JC, Aguilar-Ruiz JS, García-Torres M (2012) Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches. Expert Syst Appl 39(12):11094–11102
Salton G, Clement TY (1973) On the construction of effective vocabularies for information retrieval. In: Proceedings of the 1973 meeting on programming languages and information retrieval. ACM, New York, pp 48–60
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18:613–620
Santos I, Laorden C, Sanz B, Bringas PG (2012) Enhanced topic-based vector space model for semantics-aware spam filtering. Expert Syst Appl 39(1):437–444
Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33(1):1–5
SpamAssassin (2005) Spamassassin public corpus. http://spamassassin.apache.org/publiccorpus/. Accessed June 2008
Tezel SK (2009) Improving SVM classification on imbalanced data sets in distance space. Ninth IEEE international conference on data mining
Tretyakov K (2004) Machine learning techniques in spam filtering. Data mining problem-oriented seminar MTAT.03.177, pp 60–79
Willett P (2006) The Porter stemming algorithm: then and now. Progr Electron Libr Inf Syst 40(3):219–223
Yan J, Liu N, Zhang B, Yan S, Chen Z, Cheng Q (2005) OCFS: optimal orthogonal centroid feature selection for text categorization. In: Proceedings of the 28th annual international ACM Sinformation gainIR conference on research and development in information retrieval, ACM, New York, pp 122–129
Yang J, Liu Y, Liu Z, Zhu X, Zhang X (2011) A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowl-Based Syst 24(6):904–914
Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754
Yang Y, Pedersen J (1997) A comparative study on feature set selection in text categorization, In: Fisher DH (ed) Proceedings of the 14th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 412–420
Youn S, McLeod D (2007) A comparative study for email classification. Advances and innovations in systems, computing sciences and software engineering, pp 387–391
Yu B, Xu Z (2008) A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowl-Based Syst 21(4):355–362
Yu SN, Lee MY (2012) Conditional mutual information-based feature selection for congestive heart failure recognition using heart rate variability. Comput Methods Programs Biomed 108(1):299–309
Zhang Y, Li S, Wang T, Zhang Z (2012) Divergence-based feature selection for separate classes. Neurocomputing 101(4):32–42
Zhu Y, Tan Y (2011) A local-concentration-based feature extraction approach for spam filtering. IEEE Trans Inf Forensics Secur 6(2):486–497
Acknowledgments
This research is supported by National Natural Science Foundation of China under Grant No. 60971089 and National Electronic Development Foundation of China under Grant No. 2009537.
Author information
Authors and Affiliations
Corresponding author
Additional information
The authors claim that none of the material in the paper has been published or is under consideration for publication elsewhere.
Rights and permissions
About this article
Cite this article
Liu, Y., Wang, Y., Feng, L. et al. Term frequency combined hybrid feature selection method for spam filtering. Pattern Anal Applic 19, 369–383 (2016). https://doi.org/10.1007/s10044-014-0408-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-014-0408-4