Abstract
In this paper, we study the problem of filtering unsolicited bulk emails, also known as spam emails. We apply a k-NN algorithm with a similarity measure called resemblance and compare it with the naive Bayes and the k-NN algorithm with TF-IDF weighting. Experimental evaluation shows that our method produces the lowest-cost results under different cost models of classification. Compared with TF-IDF weighting, our method is more practical in a dynamic environment. Also, our method successfully catches a notorious class of spams called picospams. We believe that it will be a useful member in a hybrid classifier.
This research was fully supported by a grant from the Research Grants Council of the Hong Kong SAR, China [CityU 1198/03E].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Spam war. Technology Review (July/August 2003)
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Spyropoulos, C.: An experimental comparison of naive bayesian and keywordbased anti-spam filtering with personal email messages. In: SIGIR 2000, pp. 160–167 (2000)
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. In: Proc. of the Workshop on Machine Learning and Textual Information Access PKDD 2000 (2000)
Broder, A.Z.: On the resemblance and containment of documents. In: SEQUENCES 1997, pp. 21–29. IEEE Computer Society, Los Alamitos (1997)
Brutlag, C., Meek, J.: Challenges of the email domain for text classification. In: 17th ICMP, July 2000, pp. 103–110 (2000)
Cranor, L.F., LaMacchia, B.A.: Spam! Communications of the ACM 41(8), 103–110 (1998)
Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Trans. on Neural Networks 10(5), 1048–1054 (1999)
Graham-Cumming, J.: How to beat an adaptive spam filter. In: MIT Spam Conference (January 2004)
Maria Gomez Hidalgo, J.: Evaluating cost-sensitive unsolicited bulk email categorization. In: Proc. of ACM Symp. on Applied computing, pp. 615–620 (2002)
Pantel, P., Lin, D.: A spam classification and organization program. In: Proc. of AAAI 1998 Workshop on Learning for Text Categorization, pp. 95–98 (1998)
Poon, C.K., Chang, M.: An email classifier based on resemblance. In: Proc. of 14th ISMIS, pp. 334–338 (2003)
Rabin, M.O.: Fingerprint by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University (1981)
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk e-mail. In: Proc. of AAAI 1998 Workshop on Learning for Text Categorization (1998)
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: Stacking classifiers for anti-spam filtering of e-mail. In: 6th Conf. on EMNLP, Carnegie Mellon U., Pittsburgh, USA, pp. 44–50 (2001)
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: A memory-based approach to anti-spam filtering for mailing lists. Information Retrieval, 49–73 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chang, M., Poon, C.K. (2005). Catching the Picospams. In: Hacid, MS., Murray, N.V., RaÅ›, Z.W., Tsumoto, S. (eds) Foundations of Intelligent Systems. ISMIS 2005. Lecture Notes in Computer Science(), vol 3488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11425274_66
Download citation
DOI: https://doi.org/10.1007/11425274_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25878-0
Online ISBN: 978-3-540-31949-8
eBook Packages: Computer ScienceComputer Science (R0)