Abstract
Image spam is a new trend in the family of email spams. The new image spams employ a variety of image processing technologies to create random noises. In this paper, we propose a semi-supervised approach, regularized discriminant EM algorithm (RDEM), to detect image spam emails, which leverages small amount of labeled data and large amount of unlabeled data for identifying spams and training a classification model simultaneously. Compared with fully supervised learning algorithms, the semi-supervised learning algorithm is more suitedin adversary classification problems, because the spammers are actively protecting their work by constantly making changes to circumvent the spam detection. It makes the cost too high for fully supervised learning to frequently collect sufficient labeled data for training. Experimental results demonstrate that our approach achieves 91.66% high detection rate with less than 2.96% false positive rate, meanwhile it significantly reduces the labeling cost.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Sophos Plc: http://www.sophos.com/pressoffice/news/articles/2008/07/dirtydozjul08.html
Necleus Research: http://nucleusresearch.com/research/notes-and-reports/spamthe-repeat-offender/
McAfee: http://www.avertlabs.com/research/blog/index.php/2007/05/25/arespammers-giving-up-on-image-spam/
Hayati, P., Potdar, V.: Evaluation of spam detection and prevention frameworks for email and image spam a state of art. In: Proc. Conf. on Information Integration and Web-based Application and Services, Linz, Austria (November 2008)
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk e-mail. In: Proc. AAAI Workshop on Learning for Text Categorization, Madison, Wisconsin (July 1998)
Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10, 1048–1054 (1999)
Carreras, X., Salgado, J.G.: Boosting trees for anti-spam email filtering. In: Proc. the 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, pp. 58–64 (2001)
Boykin, P.O., Roychowdhury, V.P.: Leveraging social networks to fight spam. Computer 38(4), 61–68 (2005)
Blosser, J., Josephsen, D.: Scalable centralized bayesian spam mitigation withbogofilter. In: USENIX LISA (2004)
Li, K., Zhong, Z.: Fast statistical spam filter by approximate classifications. In: ACM SIGMETRICS, pp. 347–358 (2006)
Fumera, G., Pillai, I., Rolir, F.: Spam filtering based on the analysis of text information embedded into images. Journal of Machine Learning Research 6, 2699–2720 (2006)
Biggio, B., Fumera, G., Pillai, I., Roli, F.: Image spam filtering using visual information. In: ICIAP (2007)
SpamAssassin: http://spamassassin.apache.org
Gao, Y., Yang, M., Zhao, X., Pardo, B., Wu, Y., Pappas, T., Choudhary, A.: Imagespam hunter. In: Proc. of the 33rd IEEE International Conference on Acoustics, Speech, and Signal Processing, Las Vegas, NV, USA (April 2008)
Dredze, M., Gevaryahu, R., Elias-Bachrach, A.: Learning fast classifiers for imagespam. In: Proc. the 4th Conference on Email and Anti-Spam (CEAS), California, USA (August 2007)
Mehta, B., Nangia, S., Gupta, M., Nejdl, W.: Detecting image spam using visual features and near duplicate detection. In: Proc. the 17th International World Wide Web Conference, Beijing, China (April 2008)
Wang, Z., Josephson, W., Lv, Q., Charikar, M., Li, K.: Filtering image spam with near-duplicate detection. In: Proc. the 4th Conference on Email and Anti-Spam (CEAS), California, USA (August 2007)
Dalvi, N., Domingos, P., Mausam, S.S., Verma, D.: Adversarial classification. In: Tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 99–108 (2004)
Zhu, X.: Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison (2005)
Fisher, R.A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, 179–188 (1936)
Wu, Y., Tian, Q., Huang, T.S.: Discriminant-em algorithm with application to image retrieval. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Hilton Head Island, SC, USA, June 2000, vol. I (2000)
He, X., Niyogi, P.: Locality preserving projections. In: Thrun, S., Saul, L., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16. MIT Press, Cambridge (2004)
He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.: Face recognition using laplacianfaces. IEEE Transaction on Pattern Analysis and Machine Intelligence 27(3), 328–340 (2005)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), 1–38 (1977)
Cai, D., He, X., Han, J.: Semi-supervised discriminant analysis. In: Proc. the 11th IEEE International Conference on Computer Vision (ICCV), Rio de Janeiro, Brazil (October 2007)
Yang, J., Yan, S., Huang, T.: Ubiquitously supervised subspace learning. IEEE Transactions on Image Processing 18(2), 241–249 (2009)
Friedman, J.H.: Regularized discriminant analysis. Journal of the American Statistical Association 84(405), 165–175 (1989)
Ng, T.T., Chang, S.F.: Classifying photographic and photorealistic computer graphic images using natural image statistics. Technical report, Columbia University (October 2004)
Ng, T.T., Chang, S.F., Hsu, Y.F., Xie, L., Tsui, M.P.: Physics-motivated features for distinguishing photographic images and computer graphics. In: ACM Multimedia, Singapore (November 2005)
Ng, T.T., Chang, S.F., Tsui, M.P.: Lessons learned from online classification of photo-realistic computer graphics and photographs. In: IEEE Workshop on Signal Processing Applications for Public Security and Forensics (SAFE) (April 2007)
Mäenpä, T.: The local binary pattern approach to texture analysis extensions and applications. Ph.D thesis, Infotech Oulu, University of Oulu, Oulu, Finland (August 2003)
Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8(6), 679–698 (1986)
Huang, J., Kumar, S.R., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Los Alamitos (1997)
Tu, Z.: Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering. In: Tenth IEEE International Conference on Computer Vision (2005)
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gao, Y., Yang, M., Choudhary, A. (2009). Semi Supervised Image Spam Hunter: A Regularized Discriminant EM Approach. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2009. Lecture Notes in Computer Science(), vol 5678. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03348-3_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-03348-3_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03347-6
Online ISBN: 978-3-642-03348-3
eBook Packages: Computer ScienceComputer Science (R0)