Term frequency combined hybrid feature selection method for spam filtering

Liu, Yuanning; Wang, Youwei; Feng, Lizhou; Zhu, Xiaodong

doi:10.1007/s10044-014-0408-4

Term frequency combined hybrid feature selection method for spam filtering

Theoretical Advances
Published: 22 August 2014

Volume 19, pages 369–383, (2016)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Yuanning Liu¹,
Youwei Wang¹,
Lizhou Feng¹ &
…
Xiaodong Zhu¹

426 Accesses
12 Citations
Explore all metrics

Abstract

Feature selection is an important technology on improving the efficiency and accuracy of spam filtering. Among the numerous methods, document frequency-based feature selections ignore the effect of term frequency information, thus always deduce unsatisfactory results. In this paper, a hybrid method (called HBM), which combines the document frequency information and term frequency information is proposed. To maintain the category distinguishing ability of the selected features, an optimal document frequency-based feature selection (called ODFFS) is chosen; terms which are indeed discriminative will be selected by ODFFS. For the remaining features, term frequency information is considered and the terms with the highest HBM values are selected. Further, a novel method called feature subset evaluating parameter optimization (FSEPO) is proposed for parameter optimization. Experiments with support vector machine (SVM) and Naïve Bayesian (NB) classifiers are applied on four corpora: PU1, LingSpam, SpamAssian and Trec2007. Six feature selections: information gain, Chi square, improved Gini-index, multi-class odds ratio, normalized term frequency-based discriminative power measure and comprehensively measure feature selection are compared with HBM. Experimental results show that, HBM is significantly superior to other feature selection methods on four corpora when SVM and NB are applied, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A high-quality feature selection method based on frequent and correlated items for text classification

Article Open access 04 June 2023

A new feature selection metric for text classification: eliminating the need for a separate pruning stage

Article 11 April 2021

A new feature selection method for handling redundant information in text classification

Article 01 February 2018

References

Androutsopoulos I, Koutsias J, Chandrinos KV, Paliouras G, Spyropoulos C (2000) An evaluation of naive Bayesian anti-spam filtering. InL Proceedings of the workshop on machine learning in the new information age
Azam N, Yao J (2012) Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst Appl 39(5):4760–4768
Article Google Scholar
Bermejo P, Ossa L, Gámez JA, Puerta JM (2012) Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking. Knowl-Based Syst 25(1):35–44
Article Google Scholar
Boubezoul A, Paris S (2012) Application of global optimization methods to model and feature selection. Pattern Recogn 45(10):3676–3686
Article MATH Google Scholar
Breiman L, Friedman JH, Olshen RA (1984) Classification and regression trees. Wadsworth International Group, Monterey
MATH Google Scholar
Chen CM, Lee HM, Chang YJ (2009) Two novel feature selection approaches for web page classification. Expert Syst Appl 36(1):260–272
Article Google Scholar
Chen JN, Huang HK, Tian SF, Qu YL (2009) Feature selection for text classification with Naïve Bayes. Expert Syst Appl 36(3):5432–5435
Article Google Scholar
Clark J, Koprinska I, Poon J (2003) A neural network based approach to automated e-mail classification. In: Proceedings of the IEEE/WIC international conference on web intelligence (WI 03)
Cormack GV (2007) TREC 2007 spam track overview. In: Proceedings of TREC 2007: the 16th text retrieval conference
Correa RF, Ludermir TB (2006) Improving self-organization of document collections by semantic mapping. Neurocomputing 70(1):62–69
Article Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874
Article MathSciNet Google Scholar
Forman G (2008) BNS feature scaling: an improved representation over TFIDF for SVM text classification. In: Proceedings of the ACM conference on information and knowledge management. ACM, New York, pp 263–279
Gomez JC, Moens MF (2012) PCA document reconstruction for email classification. Comput Stat Data Anal 56(3):741–751
Article MathSciNet Google Scholar
Guzella TS, Caminhas WM (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36(7):10206–10222
Article Google Scholar
Lee C, Lee GG (2006) Information gain and divergence-based feature selection for machine learning-based text categorization. Inf Process Manag 42(1):155–165
Article Google Scholar
Liu Y, Wang G, Chen H, Dong H, Zhu X, Wang S (2011) An improved particle swarm optimization for feature selection. J Bionic Eng 8(2):191–200
Article Google Scholar
López FR, Jiménez-Salazar H, Pinto D (2007) A competitive term selection method for information retrieval. In: Proceedings of 8th international conference on computational linguistics and intelligent text processing, (CICLing’07), Lecture notes in computer science, vol 4394, pp 468–475
McCallum A, Nigam K (2007) A comparison of event models for naive Bayes text classification. In: EACL ‘03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, vol 1, pp 307–314
Mengle SSR, Goharian N (2009) Ambiguity measure feature selection algorithm. J Am Soc Inform Sci Technol 60(5):1037–1050
Article Google Scholar
Mladenic D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Decis Support Syst 35(1):45–87
Article Google Scholar
Ogura H, Amano H, Kondo M (2009) Feature selection with a measure of deviations from poisson in text categorization. Expert Syst Appl 36(3):6826–6832
Article Google Scholar
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Google Scholar
Ruiz R, Riquelme JC, Aguilar-Ruiz JS, García-Torres M (2012) Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches. Expert Syst Appl 39(12):11094–11102
Article Google Scholar
Salton G, Clement TY (1973) On the construction of effective vocabularies for information retrieval. In: Proceedings of the 1973 meeting on programming languages and information retrieval. ACM, New York, pp 48–60
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18:613–620
Article MATH Google Scholar
Santos I, Laorden C, Sanz B, Bringas PG (2012) Enhanced topic-based vector space model for semantics-aware spam filtering. Expert Syst Appl 39(1):437–444
Article Google Scholar
Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Expert Syst Appl 33(1):1–5
Article Google Scholar
SpamAssassin (2005) Spamassassin public corpus. http://spamassassin.apache.org/publiccorpus/. Accessed June 2008
Tezel SK (2009) Improving SVM classification on imbalanced data sets in distance space. Ninth IEEE international conference on data mining
Tretyakov K (2004) Machine learning techniques in spam filtering. Data mining problem-oriented seminar MTAT.03.177, pp 60–79
Willett P (2006) The Porter stemming algorithm: then and now. Progr Electron Libr Inf Syst 40(3):219–223
MathSciNet Google Scholar
Yan J, Liu N, Zhang B, Yan S, Chen Z, Cheng Q (2005) OCFS: optimal orthogonal centroid feature selection for text categorization. In: Proceedings of the 28th annual international ACM Sinformation gainIR conference on research and development in information retrieval, ACM, New York, pp 122–129
Yang J, Liu Y, Liu Z, Zhu X, Zhang X (2011) A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowl-Based Syst 24(6):904–914
Article Google Scholar
Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754
Article Google Scholar
Yang Y, Pedersen J (1997) A comparative study on feature set selection in text categorization, In: Fisher DH (ed) Proceedings of the 14th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 412–420
Youn S, McLeod D (2007) A comparative study for email classification. Advances and innovations in systems, computing sciences and software engineering, pp 387–391
Yu B, Xu Z (2008) A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowl-Based Syst 21(4):355–362
Article Google Scholar
Yu SN, Lee MY (2012) Conditional mutual information-based feature selection for congestive heart failure recognition using heart rate variability. Comput Methods Programs Biomed 108(1):299–309
Article Google Scholar
Zhang Y, Li S, Wang T, Zhang Z (2012) Divergence-based feature selection for separate classes. Neurocomputing 101(4):32–42
Google Scholar
Zhu Y, Tan Y (2011) A local-concentration-based feature extraction approach for spam filtering. IEEE Trans Inf Forensics Secur 6(2):486–497
Article Google Scholar

Download references

Acknowledgments

This research is supported by National Natural Science Foundation of China under Grant No. 60971089 and National Electronic Development Foundation of China under Grant No. 2009537.

Author information

Authors and Affiliations

Jilin University, No. 2699, Qianjin Street, Changchun, 130012, Jilin, China
Yuanning Liu, Youwei Wang, Lizhou Feng & Xiaodong Zhu

Authors

Yuanning Liu
View author publications
You can also search for this author in PubMed Google Scholar
Youwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lizhou Feng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaodong Zhu.

Additional information

The authors claim that none of the material in the paper has been published or is under consideration for publication elsewhere.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Y., Wang, Y., Feng, L. et al. Term frequency combined hybrid feature selection method for spam filtering. Pattern Anal Applic 19, 369–383 (2016). https://doi.org/10.1007/s10044-014-0408-4

Download citation

Received: 19 December 2012
Accepted: 22 July 2014
Published: 22 August 2014
Issue Date: May 2016
DOI: https://doi.org/10.1007/s10044-014-0408-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Term frequency combined hybrid feature selection method for spam filtering

Abstract

Access this article

Similar content being viewed by others

A high-quality feature selection method based on frequent and correlated items for text classification

A new feature selection metric for text classification: eliminating the need for a separate pruning stage

A new feature selection method for handling redundant information in text classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Term frequency combined hybrid feature selection method for spam filtering

Abstract

Access this article

Similar content being viewed by others

A high-quality feature selection method based on frequent and correlated items for text classification

A new feature selection metric for text classification: eliminating the need for a separate pruning stage

A new feature selection method for handling redundant information in text classification

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation