research-article

A Novel Data Mining Approach for Detecting Spam Emails using Robust Chi-Square Features

Authors:
Mugdha Sharma

Rukmini Devi Institute of Advanced Studies, 2A & 2B, Phase -- 1, Madhuban Chowk, Rohini, Delhi - 110085, India, +91-9711968412

Rukmini Devi Institute of Advanced Studies, 2A & 2B, Phase -- 1, Madhuban Chowk, Rohini, Delhi - 110085, India, +91-9711968412
View Profile

,
Jasmeen Kaur

Rukmini Devi Institute of Advanced Studies, 2A & 2B, Phase -- 1, Madhuban Chowk, Rohini, Delhi - 110085, India, +91-8447304677

Rukmini Devi Institute of Advanced Studies, 2A & 2B, Phase -- 1, Madhuban Chowk, Rohini, Delhi - 110085, India, +91-8447304677
View Profile

WCI '15: Proceedings of the Third International Symposium on Women in Computing and InformaticsAugust 2015Pages 49–53https://doi.org/10.1145/2791405.2791507

Published:10 August 2015Publication History

WCI '15: Proceedings of the Third International Symposium on Women in Computing and Informatics

Pages 49–53

ABSTRACT

In spam filtering techniques, the classification of emails are performed on the basis of a collection words that are extracted from the training set. The accuracy and performance of the classifier highly depends on features and length of feature space. Feature selection methods are used in such scenario for evaluating the best features for classification. In an attempt to develop strong spam filtering model we rank the features using Chi--Square feature ranking method and also investigate the effectiveness of feature length on classification accuracy. The results are promising and also the feature ranking method proposed is effective than other methods referred in the literature.

References

Bing, Z., Yao, Y. and Luo, J. 2010. A Three-Way Decision Approach to Email Spam Filtering. ACM.Google Scholar
Chen, J., Huang, H., Tian, S. and Qu, Y. 2009. Feature Selection for Text Classification with Naive Bayes. Expert Systems with Applications, vol. 36, 5432--5435. Google ScholarDigital Library
Fragoudis, D., Meretakis, D. and Likothanassis, S. 2005. Best Terms: An Efficient Feature-Selection Algorithm for Text Categorization. Knowledge and Information Systems, vol. 8, 16--33. Google ScholarDigital Library
Freund, Y. and Schapire, R. 1996. Experiments with a New Boosting Algorithm. Machine Learning: Proceedings of the Thirteenth International Conference, 148--156.Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I. 2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations, vol. 11, Issue 1. Google ScholarDigital Library
Han, J. and Kamber, M. 2011. Data Mining: Concepts and Techniques. Elsevier (June 09, 2011), ISBN: 978-0-12-381479-1. Google ScholarDigital Library
Liaw, A. and Wiener, M. 2002. Classification and Regression by Random Forest. R News (Dec 2002), vol. 2/3, 18--22.Google ScholarCross Ref
Quinlan, R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers. San Mateo, CA. Google ScholarDigital Library
Sebastiani, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, vol. 34, 1--47. Google ScholarDigital Library
Shang, W., Huang. H., and Zhu, H. 2007. A Novel Feature Selection Algorithm for Text Categorization. Expert Systems with Applications, vol. 33, 1--5. Google ScholarDigital Library
Thiago, S., and Walmir, M. 2009. A Review of Machine Learning Approaches to Spam Filtering. Expert Systems with Applications, vol. 36, 10206--10222. Google ScholarDigital Library
Thomas, J., Raj, N. and Vinod, P. 2014. Robust Feature Vector for Spam Classification. In Proceedings of the International Conference on Data Sciences. Universities Press, (Feb. 2014), ISBN: 978-81-7371-926-4, 87--95.Google Scholar
Yang, J., Liu, Y., Liu, Z., Zhu, X. and Zhang, X. 2011. A New Feature Selection Algorithm based on Binomial Hypothesis Testing for Spam Filtering. Knowledge-Based Systems, vol. 24, 904--914. Google ScholarDigital Library
Yang, J., Liu, Y., Zhu, X., Liu, Z. and Zhang, X. 2012. A New Feature Selection based on Comprehensive Measurement both in Inter-category and Intra-category for Text Categorization. Information Processing and Management, vol. 48, 741--754. Google ScholarDigital Library
Yang, Y. and Pedersen, J. 1997. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, 412--420. Google ScholarDigital Library
Zhao, W., Wang, Y. and Li, D. 2010. A New Feature Selection Algorithm in Text Categorization. International Symposium on Computer, Communication, Control and Automation.Google Scholar
Zhu, Y. and Tan, Y. 2011. A Local-Concentration-Based Feature Extraction Approach for Spam Filtering. IEEE Transactions on Information Forensics and Security (Jun. 2011), vol. 6, no. 2. Google ScholarDigital Library
SpamAssassin dataset: (Last accessed on Mar 2015) http://spamassassin.apache.org/publiccorpus/Google Scholar

Index Terms

A Novel Data Mining Approach for Detecting Spam Emails using Robust Chi-Square Features
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Feature selection
2. Information systems
  1. Information systems applications
    1. Data mining
  2. World Wide Web
    1. Web applications
      1. Internet communications tools
        Email

Recommendations

An evaluation of statistical spam filtering techniques

This paper evaluates five supervised learning methods in the context of statistical spam filtering. We study the impact of different feature pruning methods and feature set sizes on each learner's performance using cost-sensitive measures. It is ...
Read More
Searching for Interacting Features for Spam Filtering
ISNN '08: Proceedings of the 5th international symposium on Neural Networks: Advances in Neural Networks

In this paper, we introduce a novel feature selection method--INTERACT to select relevant words of emails for spam email filtering, i.e. classifying an email as spam or legitimate. Four traditional feature selection methods in text categorization ...
Read More
Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails

Detection of the spam emails within a set of email files has become challenging task for researchers. Identification of an effective classifier is based not only on high accuracy of detection but also on low false alarm rates, and the need to use as few ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WCI '15: Proceedings of the Third International Symposium on Women in Computing and Informatics
August 2015
763 pages
ISBN:9781450333610
DOI:10.1145/2791405
Editor:
Indu Nair
SCMS, Kochi, India
,
General Chairs:
Sushmita Mitra
Indian Statistical Institute, Kolkata, India
,
Ljiljana Trajković
Simon Fraser University, Canada
,
Program Chairs:
Punam Bedi
University of Delhi, India
,
Suzanne McIntosh
New York University and Cloudera Inc., USA
,
M. S. Rajasree
IIITM-K, Trivandrum, India
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Chi-Square evaluation
Dimensionality reduction
Feature selection
Spam filtering
Text categorization
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
WCI '15 Paper Acceptance Rate98of452submissions,22%Overall Acceptance Rate98of452submissions,22%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 145
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Novel Data Mining Approach for Detecting Spam Emails using Robust Chi-Square Features

WCI '15: Proceedings of the Third International Symposium on Women in Computing and Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

An evaluation of statistical spam filtering techniques

Searching for Interacting Features for Spam Filtering

Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails