skip to main content
10.1145/2791405.2791507acmotherconferencesArticle/Chapter ViewAbstractPublication PageswciConference Proceedingsconference-collections
research-article

A Novel Data Mining Approach for Detecting Spam Emails using Robust Chi-Square Features

Authors Info & Claims
Published:10 August 2015Publication History

ABSTRACT

In spam filtering techniques, the classification of emails are performed on the basis of a collection words that are extracted from the training set. The accuracy and performance of the classifier highly depends on features and length of feature space. Feature selection methods are used in such scenario for evaluating the best features for classification. In an attempt to develop strong spam filtering model we rank the features using Chi--Square feature ranking method and also investigate the effectiveness of feature length on classification accuracy. The results are promising and also the feature ranking method proposed is effective than other methods referred in the literature.

References

  1. Bing, Z., Yao, Y. and Luo, J. 2010. A Three-Way Decision Approach to Email Spam Filtering. ACM.Google ScholarGoogle Scholar
  2. Chen, J., Huang, H., Tian, S. and Qu, Y. 2009. Feature Selection for Text Classification with Naive Bayes. Expert Systems with Applications, vol. 36, 5432--5435. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Fragoudis, D., Meretakis, D. and Likothanassis, S. 2005. Best Terms: An Efficient Feature-Selection Algorithm for Text Categorization. Knowledge and Information Systems, vol. 8, 16--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Freund, Y. and Schapire, R. 1996. Experiments with a New Boosting Algorithm. Machine Learning: Proceedings of the Thirteenth International Conference, 148--156.Google ScholarGoogle Scholar
  5. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I. 2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations, vol. 11, Issue 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Han, J. and Kamber, M. 2011. Data Mining: Concepts and Techniques. Elsevier (June 09, 2011), ISBN: 978-0-12-381479-1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Liaw, A. and Wiener, M. 2002. Classification and Regression by Random Forest. R News (Dec 2002), vol. 2/3, 18--22.Google ScholarGoogle ScholarCross RefCross Ref
  8. Quinlan, R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers. San Mateo, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Sebastiani, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, vol. 34, 1--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Shang, W., Huang. H., and Zhu, H. 2007. A Novel Feature Selection Algorithm for Text Categorization. Expert Systems with Applications, vol. 33, 1--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Thiago, S., and Walmir, M. 2009. A Review of Machine Learning Approaches to Spam Filtering. Expert Systems with Applications, vol. 36, 10206--10222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Thomas, J., Raj, N. and Vinod, P. 2014. Robust Feature Vector for Spam Classification. In Proceedings of the International Conference on Data Sciences. Universities Press, (Feb. 2014), ISBN: 978-81-7371-926-4, 87--95.Google ScholarGoogle Scholar
  13. Yang, J., Liu, Y., Liu, Z., Zhu, X. and Zhang, X. 2011. A New Feature Selection Algorithm based on Binomial Hypothesis Testing for Spam Filtering. Knowledge-Based Systems, vol. 24, 904--914. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yang, J., Liu, Y., Zhu, X., Liu, Z. and Zhang, X. 2012. A New Feature Selection based on Comprehensive Measurement both in Inter-category and Intra-category for Text Categorization. Information Processing and Management, vol. 48, 741--754. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yang, Y. and Pedersen, J. 1997. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, 412--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Zhao, W., Wang, Y. and Li, D. 2010. A New Feature Selection Algorithm in Text Categorization. International Symposium on Computer, Communication, Control and Automation.Google ScholarGoogle Scholar
  17. Zhu, Y. and Tan, Y. 2011. A Local-Concentration-Based Feature Extraction Approach for Spam Filtering. IEEE Transactions on Information Forensics and Security (Jun. 2011), vol. 6, no. 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. SpamAssassin dataset: (Last accessed on Mar 2015) http://spamassassin.apache.org/publiccorpus/Google ScholarGoogle Scholar

Index Terms

  1. A Novel Data Mining Approach for Detecting Spam Emails using Robust Chi-Square Features

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          WCI '15: Proceedings of the Third International Symposium on Women in Computing and Informatics
          August 2015
          763 pages
          ISBN:9781450333610
          DOI:10.1145/2791405

          Copyright © 2015 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 10 August 2015

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          WCI '15 Paper Acceptance Rate98of452submissions,22%Overall Acceptance Rate98of452submissions,22%
        • Article Metrics

          • Downloads (Last 12 months)1
          • Downloads (Last 6 weeks)0

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader