Abstract
Email filters based on learned models should be developed from appropriate training and test sets. A k-fold cross-validation is commonly presented in the literature as a method of mixing old and new messages to produce these data sets. We show that this results in overly optimistic estimates of the email filter’s accuracy in classifying future messages because the training set has a higher probability of containing messages that are similar to those in the test set. We propose a method that preserves the chronology of the email messages in the data sets.
Similar content being viewed by others
References
Androutsopoulos, I., et al.: Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach. In: Zaragoza, H., Gallinari, P., Rajman, M. (eds.) Proceedings of the workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2000), Lyon, France, September 2000, pp. 1–13 (2000)
Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)
Androutsopoulos homepage, http://www.aueb.gr/users/ion/publications.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fu, CL., Silver, D. (2004). Time-Sensitive Sampling for Spam Filtering. In: Tawfik, A.Y., Goodwin, S.D. (eds) Advances in Artificial Intelligence. Canadian AI 2004. Lecture Notes in Computer Science(), vol 3060. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24840-8_54
Download citation
DOI: https://doi.org/10.1007/978-3-540-24840-8_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22004-6
Online ISBN: 978-3-540-24840-8
eBook Packages: Springer Book Archive