ABSTRACT
In spam filtering techniques, the classification of emails are performed on the basis of a collection words that are extracted from the training set. The accuracy and performance of the classifier highly depends on features and length of feature space. Feature selection methods are used in such scenario for evaluating the best features for classification. In an attempt to develop strong spam filtering model we rank the features using Chi--Square feature ranking method and also investigate the effectiveness of feature length on classification accuracy. The results are promising and also the feature ranking method proposed is effective than other methods referred in the literature.
- Bing, Z., Yao, Y. and Luo, J. 2010. A Three-Way Decision Approach to Email Spam Filtering. ACM.Google Scholar
- Chen, J., Huang, H., Tian, S. and Qu, Y. 2009. Feature Selection for Text Classification with Naive Bayes. Expert Systems with Applications, vol. 36, 5432--5435. Google ScholarDigital Library
- Fragoudis, D., Meretakis, D. and Likothanassis, S. 2005. Best Terms: An Efficient Feature-Selection Algorithm for Text Categorization. Knowledge and Information Systems, vol. 8, 16--33. Google ScholarDigital Library
- Freund, Y. and Schapire, R. 1996. Experiments with a New Boosting Algorithm. Machine Learning: Proceedings of the Thirteenth International Conference, 148--156.Google Scholar
- Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I. 2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations, vol. 11, Issue 1. Google ScholarDigital Library
- Han, J. and Kamber, M. 2011. Data Mining: Concepts and Techniques. Elsevier (June 09, 2011), ISBN: 978-0-12-381479-1. Google ScholarDigital Library
- Liaw, A. and Wiener, M. 2002. Classification and Regression by Random Forest. R News (Dec 2002), vol. 2/3, 18--22.Google ScholarCross Ref
- Quinlan, R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers. San Mateo, CA. Google ScholarDigital Library
- Sebastiani, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, vol. 34, 1--47. Google ScholarDigital Library
- Shang, W., Huang. H., and Zhu, H. 2007. A Novel Feature Selection Algorithm for Text Categorization. Expert Systems with Applications, vol. 33, 1--5. Google ScholarDigital Library
- Thiago, S., and Walmir, M. 2009. A Review of Machine Learning Approaches to Spam Filtering. Expert Systems with Applications, vol. 36, 10206--10222. Google ScholarDigital Library
- Thomas, J., Raj, N. and Vinod, P. 2014. Robust Feature Vector for Spam Classification. In Proceedings of the International Conference on Data Sciences. Universities Press, (Feb. 2014), ISBN: 978-81-7371-926-4, 87--95.Google Scholar
- Yang, J., Liu, Y., Liu, Z., Zhu, X. and Zhang, X. 2011. A New Feature Selection Algorithm based on Binomial Hypothesis Testing for Spam Filtering. Knowledge-Based Systems, vol. 24, 904--914. Google ScholarDigital Library
- Yang, J., Liu, Y., Zhu, X., Liu, Z. and Zhang, X. 2012. A New Feature Selection based on Comprehensive Measurement both in Inter-category and Intra-category for Text Categorization. Information Processing and Management, vol. 48, 741--754. Google ScholarDigital Library
- Yang, Y. and Pedersen, J. 1997. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, 412--420. Google ScholarDigital Library
- Zhao, W., Wang, Y. and Li, D. 2010. A New Feature Selection Algorithm in Text Categorization. International Symposium on Computer, Communication, Control and Automation.Google Scholar
- Zhu, Y. and Tan, Y. 2011. A Local-Concentration-Based Feature Extraction Approach for Spam Filtering. IEEE Transactions on Information Forensics and Security (Jun. 2011), vol. 6, no. 2. Google ScholarDigital Library
- SpamAssassin dataset: (Last accessed on Mar 2015) http://spamassassin.apache.org/publiccorpus/Google Scholar
Index Terms
- A Novel Data Mining Approach for Detecting Spam Emails using Robust Chi-Square Features
Recommendations
An evaluation of statistical spam filtering techniques
This paper evaluates five supervised learning methods in the context of statistical spam filtering. We study the impact of different feature pruning methods and feature set sizes on each learner's performance using cost-sensitive measures. It is ...
Searching for Interacting Features for Spam Filtering
ISNN '08: Proceedings of the 5th international symposium on Neural Networks: Advances in Neural NetworksIn this paper, we introduce a novel feature selection method--INTERACT to select relevant words of emails for spam email filtering, i.e. classifying an email as spam or legitimate. Four traditional feature selection methods in text categorization ...
Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails
Detection of the spam emails within a set of email files has become challenging task for researchers. Identification of an effective classifier is based not only on high accuracy of detection but also on low false alarm rates, and the need to use as few ...
Comments