A new feature selection algorithm based on binomial hypothesis testing for spam filtering

doi:10.1016/j.knosys.2011.04.006

Knowledge-Based Systems

Volume 24, Issue 6, August 2011, Pages 904-914

https://doi.org/10.1016/j.knosys.2011.04.006 Get rights and content

Abstract

Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in spam filtering. We proposed a new method, named Bi-Test, which utilizes binomial hypothesis testing to estimate whether the probability of a feature belonging to the spam satisfies a given threshold or not. We have evaluated Bi-Test on six benchmark spam corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010), using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVM), and compared it with four famous feature selection algorithms (information gain, χ²-statistic, improved Gini index and Poisson distribution). The experiments show that Bi-Test performs significantly better than χ²-statistic and Poisson distribution, and produces comparable performance with information gain and improved Gini index in terms of F1 measure when Naïve Bayes classifier is used; it achieves comparable performance with the other methods when SVM classifier is used. Moreover, Bi-Test executes faster than the other four algorithms.

Introduction

Since the increasing unsolicited E-mail, commonly referred to as spam, leads to a tremendous waste of resources [37], many researchers have devoted themselves to distinguishing spam and normal email (legitimate or ham). So far, the spam filtering based on the contents of mail is a significant approach although there have been a variety of approaches.

In spam filtering, the appropriate pre-processing steps are required before a mail is fed into a classifier, which include tokenization, lemmatization, stop-words removal and representation [15]. Among them, the representation of the mail is significant [32]. At present, a mail is usually represented as a vector of weighted terms (word or n-gram) [9], [34]. There are two phases during the process of converting a mail into a vector of weighted terms. Firstly, a vector space model [15], [40], namely bag-of-words [39], is built, which covers all unique items occurring in the training corpus. Secondly, each mail is mapped into a feature vector based on both bag-of-words and contents of the mail [34]. It is a particularity of the vector space model that the number of features can easily reach the orders of tens of thousands even for a moderate-sized train set [22]. The high-dimensionality is a big hurdle in applying many sophisticated learning algorithms in text categorization. Therefore, dimensionality reduction has been a major research area.

The goal of the dimensionality reduction is to reduce vector space without sacrificing the performance of the categorization, and it is tackled by very different techniques [32]. The feature selection is the most commonly used method in the field of text classification. Blum and Langley [3] had grouped the feature selection methods into three classes: embed, wrapper, and filtering. The characteristics of the embed approach is that the feature selection process is clearly embedded in the basic induction algorithm. The wrapper approach is to select feature subset using the evaluation function as a wrapper around the learning algorithm, and these features will be used on the same learning algorithm [19], [26]. The filtering approach selects the feature subset using the evaluation function that is independent to the learning method [26]. The most popular and computationally fast approach to the feature selection is the filtering approach [12], and the proposed method Bi-Test in this study is also a filtering approach. There are numerous well-known feature selection algorithms, such as document frequency (DF), information gain (IG), χ²-statistic [38], cross entropy, odds ratios (OR) [25], mutual information [38], bi-normal separation (BNS) [11], best terms: an efficient feature-selection [12], the most relevant with category [6], [24], improved Gini index [33], class discriminating measure (CDM) [5], measure using Poisson distribution [27], ambiguity measure (AM) [24], and so on. Most of these methods calculate a score based on the probability or frequency of every feature in bag-of-words, then a rank of features is made according to the feature’s score, and the top k features are selected; however, the problem of these measures is that the score of each feature must be calculated, and all features according to the score then should be ranked.

To tackle these problems, A new feature selection method based on binomial distribution [29] is proposed, which utilizes binomial hypothesis testing to complete feature selection according to the number of spam and hams in which a feature occurs. Bi-Test algorithm avoids calculating the score of features and ranking. To evaluate Bi-Test method, we used two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVM) on six benchmark spam corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010) and compared it with four feature selection algorithms (Information gain, χ²-statistic, Improved Gini index and Poisson distribution). The experiments show that Bi-Test performs significantly better than χ²-statistic and Poisson distribution and produces comparable performance with information gain and improved Gini index in terms of F1 measure when Naïve Bayes classifier is used; it can also achieves comparable performance with the other methods when SVM classifier is used.

The rest of this paper is organized as follows: Section 2 presents the state of the art for feature selection methods. Section 3 describes and analyzes the basic principle and implementation of the Bi-Test method. The experimental details and results are given in Section 4 and the statistical analysis and discussion are presented in Section 5. Our conclusion and future work direction are provided in the last section.

Section snippets

Related work

Numerous feature selection methods have been widely applied to text categorization in recent years. Yang and Pedersen [38] have concluded that χ²-statistic and information gain are the most effective for the automatic text classification. Recently, it has been claimed that the effectiveness of the improved Gini index is comparable with those of the χ²-statistic and information gain [33]. Ogura et al. [27] concluded that $χ_{p}^{2}$ and Gini index are substantially superior to information gain and χ²

Motivation

The goal of feature selection in text categorization is to reduce the dimension of vector-space model without compromising the performance of the classifier. Many methods such as information gain, Chi-statistic, improved Gini index, BNS, AM, and so on can measure category information included in a feature. BNS [11] thinks that if a feature is popular in the positive class, the feature contains more positive class information. Moreover, if a feature belongs to only a category, the feature is

Experimental setting

We used six corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010), and 10-fold cross validation [32] in this study. There are 1090 pieces of mail in pu1 corpus, including 480 spam and 610 hams. The pu2 corpus contains 140 spam and 570 hams. There are 4130 pieces of mail in the pu3 corpus, including 1280 spam and 2310 hams. There are 570 spam and 570 hams in pua corpus. We discarded some messages found in “unused” directory of pu1, pu2, pu3 and pua. Moreover, Tokens are separated by white

Statistical analysis

In order to perform a comprehensive comparison with the other algorithm, the statistical tests for comparisons of more algorithms on multiple data sets [8], [13], [14], which is more essential to typical machine learning studies, is adopted. The statistics offers more powerful specialized procedures for testing the significance of differences between multiple means [10]. In this paper, the Iman and Davenport test [17] which derived from the Friedman test is adopted. Friedman test is a

Conclusions

We proposed a novelty feature selection method based on binomial distribution hypothesis testing, named Bi-Test. We think any feature extracted from a mail only belongs to either spam or ham, so this event is a binomial experiment, and every feature in the feature vector space is subject to binomial distribution whose probability is unknown but constant. However, the probability can be evaluated by binomial hypothesis testing. If the probability of a feature is close to 0 or 1, it indicates

Acknowledgments

This research is supported by National Natural Science Foundation of China under Grant No. 60971089 and National Electronic Development Foundation of China under Grant No. 2009537.

References (40)

A.L. Blum et al.
Selection of relevant features and examples in machine learning
Artificial Intelligence
(1997)
J. Chen et al.
Feature selection for text classification with Naive Bayes
Expert Systems with Applications
(2009)
Z. Chen et al.
A preprocess algorithm of filtering irrelevant information based on the minimum class difference
Knowledge-Based System
(2006)
T. Fawcett
An introduction to ROC analysis
Pattern Recognition Letters
(2006)
T.S. Guzella et al.
A review of machine learning approaches to spam filtering
Expert Systems with Applications
(2009)
C. Lee et al.
Information gain and divergence-based feature selection for machine learning-based text categorization
Infromation Processing and Management
(2006)
D. Mladenic et al.
Feature selection on hierarchy of web documents
Decision Support Systems
(2003)
H. Ogura et al.
Feature selection with a measure of deviations from Poisson in text categorization
Expert Systems with Applications
(2009)
C. Pozna et al.
New results in modelling derived from Bayesian filtering
Knowledge-Based Systems
(2010)
W. Shang et al.
A novel feature selection algorithm for text categorization
Expert Systems with Applications
(2007)

J.-M. Yang et al.

Recommendation based on rational inferences in collaborative filtering

Knowledge-Based Systems

(2009)

E. Youn et al.

Class dependent feature scaling method using Naive Bayes classifier for text datamining

Pattern Recognition Letters

(2009)

B. Yu et al.

Combining neural networks and semantic feature space for email classification

Knowledge-Based Systems

(2009)

I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, G. Paliouras, C.D. Spyropoulos, An evaluation of Naive Bayesian...

I. Androutsopoulos, G. Paliouras, E. Michelakis, Learning to filter unsolicited commercial E-mail, in: Technical Report...

C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, 2001....

W.J. Conover

Practical Nonparametric Statistics

(1998)

J. Demšar

Statistical comparisons of classifiers over multiple data sets

Journal of Machine Learning Research

(2006)

H. Drucker et al.

Support vector machines for spam categorization

IEEE Transactions on Neural Networks

(1999)

G. Forman

An extensive empirical study of feature selection metrics for text classification

Journal of Machine Learning Research

(2003)

Cited by (58)

A graph based preordonnances theoretic supervised feature selection in high dimensional data
2022, Knowledge-Based Systems
Citation Excerpt :
For the univariate feature selection methods, each feature is evaluated individually. In the literature, there are several univariate feature selection techniques such as Mutual Information [6], Gini Index [7], Laplacian Score [8], T-test [9], Chi-square statistic [10], Symmetric Uncertainty [11], Information Gain [12], and etc. Whereas, for the multivariate feature selection methods, multiple features are evaluated simultaneously.
Generally, for high-dimensional datasets, only some features are relevant, while others are irrelevant or redundant. In the machine learning field, the use of a strategy for eliminating insignificant features from a dataset is very important for the classification task. Feature selection is the process of identifying the most informative features that help in predicting sample classes efficiently in order to achieve better classification performance. In this research paper, a new hybrid feature selection strategy for high-dimensional datasets is proposed to find the most discriminative subset of features for the dataset with the irrelevant and redundant features discarded. The proposed algorithm is called Maximal Clique based on the coefficients $Ψ$ (MaC $Ψ$ algorithm). The MaC $Ψ$ method has the capability to handle categorical, numerical, and hybrid datasets. Furthermore, it can be applied either to binary or multi-class classification problems. The global structure of the MaC $Ψ$ algorithm can be described by three steps. In the first step, a weight is proposed to evaluate the importance of each feature in the dataset by balancing the trade-off between two novel measures of relevance and redundancy, and then the $K$ most important features are selected to form the candidate subset, where $K$ is taken as user input. In the second phase, a wrapper method based on graph theory is applied to the subset retained from the first step to extract the optimal subset of features. In the last stage, the final subset of features with the highest classification performance and the lowest number of features is obtained by applying the backward elimination algorithm to the optimal subset. The performance of the MaC $Ψ$ methodology is investigated on artificial as well as real-world datasets with different dimensionalities. The statistical analysis of the experimental results clearly indicates that the MaC $Ψ$ approach achieves competitive results in terms of the classification accuracy and the number of selected features compared with some state-of-the-art approaches.
Feature selection using hybrid poor and rich optimization algorithm for text classification
2021, Pattern Recognition Letters
In order to reduce the high dimensional feature space in the text classification, feature selection plays a significant role. The dimension reduction of feature space reduces the computation cost and improves the text classification system accuracy. Hence, the identification of an optimal combination of features is an essential task in text classification. In this paper, the proposed work introduces a novel hybrid feature selection method based on binary poor and rich optimization algorithm (HBPRO) to obtain the appropriate subset of optimal features. The optimal feature subset which is selected by our proposed work is evaluated using Nave Bayes classifier with two popular benchmark text corpus datasets. The experimental results confirm that the proposed feature selection scheme (HBPRO) produces higher accuracy with a reduced number of features when compared with other feature selection techniques.
Partial maximum correlation information: A new feature selection method for microarray data classification
2019, Neurocomputing
Feature (gene) selection of microarray data is a very important and challenging task. This paper proposes a new feature selection method, partial maximum correlation information (PMCI), for microarray data classification. The proposed method extracts several orthogonal components from the feature space to evaluate the importance of each feature. In PMCI method, the component extraction is based on the correlation between feature space and class coding space. Meanwhile, this paper also provides a new generic class-encoding scheme for multi-class problems. This type of encoding scheme can ensure the smooth implementation for PMCI and has a clear geometric meaning. To validate the performance of the proposed method, this paper selects six state-of-the-art algorithms for comparison based on ten widely used microarray benchmark datasets for experiment. The experimental results show that PMCI is an effective feature selection method with good performance and lower time complexity.
A novel multivariate filter method for feature selection in text classification problems
2018, Engineering Applications of Artificial Intelligence
Citation Excerpt :
In TV it is assumed that features with higher variance values contain valuable information (Liu et al., 2005). TS measures a term’s importance based on how commonly the term is likely to appear in similar documents (Yang, 1995). The OR method evaluates the ratio of odds occurring in positive classes to its odds in negative classes (Mengle and Goharian, 2009).
With increasing number of documents in digital format, automatic text categorization has become a crucial task in pattern recognition problems. To ease the classification task, feature selection methods have been introduced to reduce the dimensionality of the feature space, and thus improve the classification performance. In this paper a novel filter method for feature selection, called Multivariate Relative Discrimination Criterion (MRDC), is proposed for text classification. The proposed method focuses on the reduction of redundant features using minimal-redundancy and maximal-relevancy concepts. To this end, the proposed method takes into account document frequencies for each term, while estimating their usefulness. The proposed method not only selects the features with maximum relevancy, but also the redundancy between them is takes into account using a correlation metric. MRDC does not employ any learning algorithm to evaluate the usefulness of the selected features, and thus it can be categorized as a filter method. In order to assess the effectiveness of the proposed method, several experiments are performed on three real-world datasets. The obtained results are compared to the state-of-the-art filter methods. The reported results show that in most cases MRDC results in better classification performance than others.
A kernelized non-parametric classifier based on feature ranking in anisotropic Gaussian kernel
2017, Neurocomputing
Non-parametric methods make no assumptions about the form of data distribution and estimate it directly from the data. Kernel density estimation is a non-parametric method which estimates the probability density function of an unknown distribution. To estimate the density using a kernel estimator, it is necessary to have a bandwidth selection procedure. This paper proposes a kernelized non-parametric classifier based on feature ranking in anisotropic Gaussian kernel (KNR-AGK) and focuses on the selection of different bandwidths in kernel density estimation. KNR-AGK uses the rank of features to learn the parameters of an anisotropic Gaussian kernel and considers these ranks as kernel bandwidths in different dimensions. In the proposed method, the rank of features is also used for feature selection based on filter methods to exclude low-ranked features that have a negative impact on the performance of KNR-AGK. To evaluate the performance of the proposed method, comprehensive experiments are conducted on several benchmark datasets. Experiment results show that the proposed classifier has better performance than Gaussian kernel density estimation based classifier.
Extracting failure time data from industrial maintenance records using text mining
2017, Advanced Engineering Informatics
Reliability modelling requires accurate failure time of an asset. In real industrial cases, such data are often buried in different historical databases which were set up for purposes other than reliability modelling. In particular, two data sets are commonly available: work orders (WOs), which detail maintenance activities on the asset, and downtime data (DD), which details when the asset was taken offline. Each is incomplete from a failure perspective, where one wishes to know whether each downtime event was due to failure or scheduled activities.
In this paper, a text mining approach is proposed to extract accurate failure time data from WOs and DD. A keyword dictionary is constructed using WO text descriptions and classifiers are constructed and applied to attribute each of the DD events to one of two classes: failure or nonfailure. The proposed method thus identifies downtime events whose descriptions are consistent with urgent unplanned WOs. The applicability of the methodology is demonstrated on maintenance data sets from an Australian electricity and sugar processing companies. Analysis of the text of the identified failure events seems to confirm the accurate identification of failures in DD. The results are expected to be immediately useful in improving the estimation of failure times (and thus the reliability models) for real-world assets.

View all citing articles on Scopus

View full text

A new feature selection algorithm based on binomial hypothesis testing for spam filtering

Abstract

Introduction

Section snippets

Related work

Motivation

Experimental setting

Statistical analysis

Conclusions

Acknowledgments

Artificial Intelligence

Expert Systems with Applications

Knowledge-Based System

Pattern Recognition Letters

Expert Systems with Applications

Infromation Processing and Management

Decision Support Systems

Expert Systems with Applications

Knowledge-Based Systems

Expert Systems with Applications

Knowledge-Based Systems

Pattern Recognition Letters

Knowledge-Based Systems

Practical Nonparametric Statistics

Statistical comparisons of classifiers over multiple data sets

Journal of Machine Learning Research

Support vector machines for spam categorization

IEEE Transactions on Neural Networks

An extensive empirical study of feature selection metrics for text classification

Journal of Machine Learning Research