ABSTRACT
We discuss the problem of clustering elements according to the sources that have generated them. For elements that are characterized by independent binary attributes, a closed-form Bayesian solution exists. We derive a solution for the case of dependent attributes that is based on a transformation of the instances into a space of independent feature functions. We derive an optimization problem that produces a mapping into a space of independent binary feature vectors; the features can reflect arbitrary dependencies in the input space. This problem setting is motivated by the application of spam filtering for email service providers. Spam traps deliver a real-time stream of messages known to be spam. If elements of the same campaign can be recognized reliably, entire spam and phishing campaigns can be contained. We present a case study that evaluates Bayesian clustering for this application.
- Haider, P., Brefeld, U., & Scheffer, T. (2007). Supervised Clustering of Streaming Data for Email Batch Detection. Proceedings of the 24th International Conference on Machine Learning (pp. 345--352). Google ScholarDigital Library
- Heller, K. A., & Ghahramani, Z. (2005). Bayesian hierarchical clustering. Proceedings of the 22nd International Conference on Machine Learning (pp. 297--304). Google ScholarDigital Library
- Lau, J., & Green, P. (2007). Bayesian Model-Based Clustering Procedures. Journal of Computational and Graphical Statistics, 16, 526--558.Google ScholarCross Ref
- Teo, C., Globerson, A., Roweis, S., & Smola, A. (2008). Convex Learning with Invariances. Advances in Neural Information Processing Systems, 20, 1489--1496.Google Scholar
- Webb, G., Boughton, J., & Wang, Z. (2005). Not So Naive Bayes: Aggregating One-Dependence Estimators. Machine Learning, 58, 5--24. Google ScholarDigital Library
- Williams, C. (2000). A MCMC approach to hierarchical mixture modelling. Advances in Neural Information Processing Systems, 12, 680--686.Google Scholar
- Zheng, Z., & Webb, G. (2000). Lazy Learning of Bayesian Rules. Machine Learning, 41, 53--84. Google ScholarDigital Library
Index Terms
- Bayesian clustering for email campaign detection
Recommendations
Who Is Sending a Spam Email: Clustering and Characterizing Spamming Hosts
Information Security and Cryptology -- ICISC 2013AbstractIn this work, we propose a spam analyzing system that clusters the spamming hosts, characterizes and visualizes the spammers’ behaviors, and detects malicious clusters. The proposed system integrates behavior profiling in IP address level, IP ...
A Comprehensive Study of Email Spam Botnet Detection
The problem of email spam has grown significantly over the past few years. It is not just a nuisance for users but also it is damaging for those who fall for scams and other attacks. This is due to the complexity intensification of email spamming ...
Comments