A new feature selection algorithm based on binomial hypothesis testing for spam filtering
Introduction
Since the increasing unsolicited E-mail, commonly referred to as spam, leads to a tremendous waste of resources [37], many researchers have devoted themselves to distinguishing spam and normal email (legitimate or ham). So far, the spam filtering based on the contents of mail is a significant approach although there have been a variety of approaches.
In spam filtering, the appropriate pre-processing steps are required before a mail is fed into a classifier, which include tokenization, lemmatization, stop-words removal and representation [15]. Among them, the representation of the mail is significant [32]. At present, a mail is usually represented as a vector of weighted terms (word or n-gram) [9], [34]. There are two phases during the process of converting a mail into a vector of weighted terms. Firstly, a vector space model [15], [40], namely bag-of-words [39], is built, which covers all unique items occurring in the training corpus. Secondly, each mail is mapped into a feature vector based on both bag-of-words and contents of the mail [34]. It is a particularity of the vector space model that the number of features can easily reach the orders of tens of thousands even for a moderate-sized train set [22]. The high-dimensionality is a big hurdle in applying many sophisticated learning algorithms in text categorization. Therefore, dimensionality reduction has been a major research area.
The goal of the dimensionality reduction is to reduce vector space without sacrificing the performance of the categorization, and it is tackled by very different techniques [32]. The feature selection is the most commonly used method in the field of text classification. Blum and Langley [3] had grouped the feature selection methods into three classes: embed, wrapper, and filtering. The characteristics of the embed approach is that the feature selection process is clearly embedded in the basic induction algorithm. The wrapper approach is to select feature subset using the evaluation function as a wrapper around the learning algorithm, and these features will be used on the same learning algorithm [19], [26]. The filtering approach selects the feature subset using the evaluation function that is independent to the learning method [26]. The most popular and computationally fast approach to the feature selection is the filtering approach [12], and the proposed method Bi-Test in this study is also a filtering approach. There are numerous well-known feature selection algorithms, such as document frequency (DF), information gain (IG), χ2-statistic [38], cross entropy, odds ratios (OR) [25], mutual information [38], bi-normal separation (BNS) [11], best terms: an efficient feature-selection [12], the most relevant with category [6], [24], improved Gini index [33], class discriminating measure (CDM) [5], measure using Poisson distribution [27], ambiguity measure (AM) [24], and so on. Most of these methods calculate a score based on the probability or frequency of every feature in bag-of-words, then a rank of features is made according to the feature’s score, and the top k features are selected; however, the problem of these measures is that the score of each feature must be calculated, and all features according to the score then should be ranked.
To tackle these problems, A new feature selection method based on binomial distribution [29] is proposed, which utilizes binomial hypothesis testing to complete feature selection according to the number of spam and hams in which a feature occurs. Bi-Test algorithm avoids calculating the score of features and ranking. To evaluate Bi-Test method, we used two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVM) on six benchmark spam corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010) and compared it with four feature selection algorithms (Information gain, χ2-statistic, Improved Gini index and Poisson distribution). The experiments show that Bi-Test performs significantly better than χ2-statistic and Poisson distribution and produces comparable performance with information gain and improved Gini index in terms of F1 measure when Naïve Bayes classifier is used; it can also achieves comparable performance with the other methods when SVM classifier is used.
The rest of this paper is organized as follows: Section 2 presents the state of the art for feature selection methods. Section 3 describes and analyzes the basic principle and implementation of the Bi-Test method. The experimental details and results are given in Section 4 and the statistical analysis and discussion are presented in Section 5. Our conclusion and future work direction are provided in the last section.
Section snippets
Related work
Numerous feature selection methods have been widely applied to text categorization in recent years. Yang and Pedersen [38] have concluded that χ2-statistic and information gain are the most effective for the automatic text classification. Recently, it has been claimed that the effectiveness of the improved Gini index is comparable with those of the χ2-statistic and information gain [33]. Ogura et al. [27] concluded that and Gini index are substantially superior to information gain and χ2
Motivation
The goal of feature selection in text categorization is to reduce the dimension of vector-space model without compromising the performance of the classifier. Many methods such as information gain, Chi-statistic, improved Gini index, BNS, AM, and so on can measure category information included in a feature. BNS [11] thinks that if a feature is popular in the positive class, the feature contains more positive class information. Moreover, if a feature belongs to only a category, the feature is
Experimental setting
We used six corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010), and 10-fold cross validation [32] in this study. There are 1090 pieces of mail in pu1 corpus, including 480 spam and 610 hams. The pu2 corpus contains 140 spam and 570 hams. There are 4130 pieces of mail in the pu3 corpus, including 1280 spam and 2310 hams. There are 570 spam and 570 hams in pua corpus. We discarded some messages found in “unused” directory of pu1, pu2, pu3 and pua. Moreover, Tokens are separated by white
Statistical analysis
In order to perform a comprehensive comparison with the other algorithm, the statistical tests for comparisons of more algorithms on multiple data sets [8], [13], [14], which is more essential to typical machine learning studies, is adopted. The statistics offers more powerful specialized procedures for testing the significance of differences between multiple means [10]. In this paper, the Iman and Davenport test [17] which derived from the Friedman test is adopted. Friedman test is a
Conclusions
We proposed a novelty feature selection method based on binomial distribution hypothesis testing, named Bi-Test. We think any feature extracted from a mail only belongs to either spam or ham, so this event is a binomial experiment, and every feature in the feature vector space is subject to binomial distribution whose probability is unknown but constant. However, the probability can be evaluated by binomial hypothesis testing. If the probability of a feature is close to 0 or 1, it indicates
Acknowledgments
This research is supported by National Natural Science Foundation of China under Grant No. 60971089 and National Electronic Development Foundation of China under Grant No. 2009537.
References (40)
- et al.
Selection of relevant features and examples in machine learning
Artificial Intelligence
(1997) - et al.
Feature selection for text classification with Naive Bayes
Expert Systems with Applications
(2009) - et al.
A preprocess algorithm of filtering irrelevant information based on the minimum class difference
Knowledge-Based System
(2006) An introduction to ROC analysis
Pattern Recognition Letters
(2006)- et al.
A review of machine learning approaches to spam filtering
Expert Systems with Applications
(2009) - et al.
Information gain and divergence-based feature selection for machine learning-based text categorization
Infromation Processing and Management
(2006) - et al.
Feature selection on hierarchy of web documents
Decision Support Systems
(2003) - et al.
Feature selection with a measure of deviations from Poisson in text categorization
Expert Systems with Applications
(2009) - et al.
New results in modelling derived from Bayesian filtering
Knowledge-Based Systems
(2010) - et al.
A novel feature selection algorithm for text categorization
Expert Systems with Applications
(2007)
Recommendation based on rational inferences in collaborative filtering
Knowledge-Based Systems
Class dependent feature scaling method using Naive Bayes classifier for text datamining
Pattern Recognition Letters
Combining neural networks and semantic feature space for email classification
Knowledge-Based Systems
Practical Nonparametric Statistics
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
Support vector machines for spam categorization
IEEE Transactions on Neural Networks
An extensive empirical study of feature selection metrics for text classification
Journal of Machine Learning Research
Cited by (58)
A graph based preordonnances theoretic supervised feature selection in high dimensional data
2022, Knowledge-Based SystemsCitation Excerpt :For the univariate feature selection methods, each feature is evaluated individually. In the literature, there are several univariate feature selection techniques such as Mutual Information [6], Gini Index [7], Laplacian Score [8], T-test [9], Chi-square statistic [10], Symmetric Uncertainty [11], Information Gain [12], and etc. Whereas, for the multivariate feature selection methods, multiple features are evaluated simultaneously.
Feature selection using hybrid poor and rich optimization algorithm for text classification
2021, Pattern Recognition LettersA novel multivariate filter method for feature selection in text classification problems
2018, Engineering Applications of Artificial IntelligenceCitation Excerpt :In TV it is assumed that features with higher variance values contain valuable information (Liu et al., 2005). TS measures a term’s importance based on how commonly the term is likely to appear in similar documents (Yang, 1995). The OR method evaluates the ratio of odds occurring in positive classes to its odds in negative classes (Mengle and Goharian, 2009).
Extracting failure time data from industrial maintenance records using text mining
2017, Advanced Engineering Informatics