Abstract
In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and content-based filtering techniques are carried out and discussed. Information Gain, χ 2-text, Mutual Information and Document Frequency feature selection methods have been analysed in conjunction with Naïve Bayes, boosting trees, Support Vector Machines and ECUE models in different scenarios. From the experiments carried out the underlying ideas behind feature selection methods are identified and applied for improving the feature selection process of SpamHunting, a novel anti-spam filtering software able to accurate classify suspicious e-mails.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Spam statistics, http://www.theregister.co.uk/security/spam/
Oard, D.W.: The state of the art in text filtering. User Modeling and User-Adapted Interaction 7, 141–178 (1997)
Wittel, G.L., Wu, S.F.: On Attacking Statistical Spam Filters. In: Proc. of the First Conference on E-mail and Anti-Spam CEAS (2004)
Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical Report 2004/2, NCSR Demokritos (2004)
Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Analyzing the Impact of Corpus Preprocessing on Anti-Spam Filtering Software. Research on Computing Science (to appear, 2005)
Corchado, J.M., Corchado, E.S., Aiken, J., Fyfe, C., Fdez-Riverola, F., Glez-Bedia, M.: Maximum Likelihood Hebbian Learning Based Retrieval Method for CBR Systems. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, pp. 107–121. Springer, Heidelberg (2003)
Corchado, J.M., Aiken, J., Corchado, E., Lefevre, N., Smyth, T.: Quantifying the ocean’s CO2 budget with a coHeL-IBR system. In: Funk, P., González Calero, P.A. (eds.) ECCBR 2004. LNCS, vol. 3155, pp. 533–546. Springer, Heidelberg (2004)
Fdez-Riverola, F., Lorenzo, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: SpamHunting: An Instance-Based Reasoning System for Spam Labelling and Filtering. Decision Support Systems (to appear, 2006)
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization – Papers from the AAAI Workshop, Technical Report WS-98-05, pp. 55–62 (1998)
Carreras, X., Màrquez, L.: Boosting trees for anti-spam e-mail filtering. In: Proc. of the 4th International Conference on Recent Advances in Natural Language Processing, pp. 58–64 (2001)
Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Statistics for Engineering and Information Science (1999)
Delany, S.J., Cunningham, P., Coyle, L.: An Assessment of Case-base Reasoning for Spam Filtering. In: Proc. of Fifteenth Irish Conference on Artificial Intelligence and Cognitive Science: AICS 2004, pp. 9–18 (2004)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of the Fourteenth International Conference on Machine Learning: ICML 1997, pp. 412–420 (1997)
Mitchell, T.: Machine Learning. Mc Graw Hill, New York (1996)
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proc. of the 7th International Conference on Information and Knowledge Management, pp. 229–237 (1998)
Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proc. of the ACL, vol. 27, pp. 76–83 (1989)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Drucker, H.D., Wu, D., Vapnik, V.: Support Vector Machines for spam categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)
Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Sholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 185–208 (1999)
Schapire, R.E., Singer, Y.: BoosTexter: a boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)
Tsymbal, A.: The problem of concept drift: definitions and related work, available at: http://www.cs.tcd.ie
Graham, P.: Better Bayesian filtering. In: Proc. of the MIT Spam Conference (2003)
Kolcz, A., Alspector, J.: SVM-based filtering of e-mail spam with content specific misclassification costs. In: Proc. of the ICDM Workshop on Text Mining (2001)
Hovold, J.: Naïve Bayes Spam Filtering Using Word-Position-Based Attributes. In: Proc. of the Second Conference on Email and Anti-Spam CEAS 2005, http://www.ceas.cc/papers-2005/144.pdf
Gama, J., Castillo, G.: Adaptive Bayes. In: Garijo, F.J., Riquelme, J.-C., Toro, M. (eds.) IBERAMIA 2002. LNCS, vol. 2527, pp. 765–774. Springer, Heidelberg (2002)
Scholz, M., Klinkenberg, R.: An Ensemble Classifier for Drifting Concepts. In: Proc. of the Second International Workshop on Knowledge Discovery from Data Streams, pp. 53–64 (2005)
Syed, N.A., Liu, H., Sung, K.K.: Handling Concept Drifts in Incremental Learning with Support Vector Machines. In: Proc. of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 317–321 (1999)
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C.D., Stamatopoulos, P.: Learning to Filter Spam E-Mail: A Comparison of a Naïve Bayesian and a Memory-Based Approach. In: Zighed, A.D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 1–13. Springer, Heidelberg (2000)
Daelemans, W., Jakub, Z., Sloot, K., Bosh, A.: TiMBL. Tilburg Memory Based Learning, version 5.1, Reference Guide. ILK, Computational Linguistics, Tilburg University, http://ilk.uvt.nl/software.html#timbl
Lenz, M., Auriol, E., Manago, M.: Diagnosis and Decision Support. In: Lenz, M., Bartsch-Spörl, B., Burkhard, H.-D., Wess, S. (eds.) Case-Based Reasoning Technology. LNCS, vol. 1400, pp. 51–90. Springer, Heidelberg (1998)
Frakes, B., Baeza-Yates, R.: Information Retrieval: Data Structures & Algorithms. Prentice-Hall, Englewood Cliffs (2000)
NIST: National Institute of Science and Technology. Reuters corpora (2004), http://trec.nist.gov/data/reuters/reuters.html
Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C., Stamatopoulos, P.: A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 6(1), 49–73 (2003)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proc. of the 14th International Joint Conference on Artificial Intelligence: IJCAI 1995, pp. 1137–1143 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Méndez, J.R., Fdez-Riverola, F., Díaz, F., Iglesias, E.L., Corchado, J.M. (2006). A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain. In: Perner, P. (eds) Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining. ICDM 2006. Lecture Notes in Computer Science(), vol 4065. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11790853_9
Download citation
DOI: https://doi.org/10.1007/11790853_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-36036-0
Online ISBN: 978-3-540-36037-7
eBook Packages: Computer ScienceComputer Science (R0)