Semi Supervised Image Spam Hunter: A Regularized Discriminant EM Approach

Gao, Yan; Yang, Ming; Choudhary, Alok

doi:10.1007/978-3-642-03348-3_17

Yan Gao²⁵,
Ming Yang²⁶ &
Alok Choudhary²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5678))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

2267 Accesses
10 Citations

Abstract

Image spam is a new trend in the family of email spams. The new image spams employ a variety of image processing technologies to create random noises. In this paper, we propose a semi-supervised approach, regularized discriminant EM algorithm (RDEM), to detect image spam emails, which leverages small amount of labeled data and large amount of unlabeled data for identifying spams and training a classification model simultaneously. Compared with fully supervised learning algorithms, the semi-supervised learning algorithm is more suitedin adversary classification problems, because the spammers are actively protecting their work by constantly making changes to circumvent the spam detection. It makes the cost too high for fully supervised learning to frequently collect sufficient labeled data for training. Experimental results demonstrate that our approach achieves 91.66% high detection rate with less than 2.96% false positive rate, meanwhile it significantly reduces the labeling cost.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Image spam analysis and detection

Article 14 October 2016

DeepCapture: Image Spam Detection Using Deep Learning and Data Augmentation

A Self-training Method for Detection of Phishing Websites

References

Sophos Plc: http://www.sophos.com/pressoffice/news/articles/2008/07/dirtydozjul08.html
Necleus Research: http://nucleusresearch.com/research/notes-and-reports/spamthe-repeat-offender/
McAfee: http://www.avertlabs.com/research/blog/index.php/2007/05/25/arespammers-giving-up-on-image-spam/
Hayati, P., Potdar, V.: Evaluation of spam detection and prevention frameworks for email and image spam a state of art. In: Proc. Conf. on Information Integration and Web-based Application and Services, Linz, Austria (November 2008)
Google Scholar
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk e-mail. In: Proc. AAAI Workshop on Learning for Text Categorization, Madison, Wisconsin (July 1998)
Google Scholar
Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10, 1048–1054 (1999)
Article Google Scholar
Carreras, X., Salgado, J.G.: Boosting trees for anti-spam email filtering. In: Proc. the 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, pp. 58–64 (2001)
Google Scholar
Boykin, P.O., Roychowdhury, V.P.: Leveraging social networks to fight spam. Computer 38(4), 61–68 (2005)
Article Google Scholar
Blosser, J., Josephsen, D.: Scalable centralized bayesian spam mitigation withbogofilter. In: USENIX LISA (2004)
Google Scholar
Li, K., Zhong, Z.: Fast statistical spam filter by approximate classifications. In: ACM SIGMETRICS, pp. 347–358 (2006)
Google Scholar
Fumera, G., Pillai, I., Rolir, F.: Spam filtering based on the analysis of text information embedded into images. Journal of Machine Learning Research 6, 2699–2720 (2006)
Google Scholar
Biggio, B., Fumera, G., Pillai, I., Roli, F.: Image spam filtering using visual information. In: ICIAP (2007)
Google Scholar
SpamAssassin: http://spamassassin.apache.org
Gao, Y., Yang, M., Zhao, X., Pardo, B., Wu, Y., Pappas, T., Choudhary, A.: Imagespam hunter. In: Proc. of the 33rd IEEE International Conference on Acoustics, Speech, and Signal Processing, Las Vegas, NV, USA (April 2008)
Google Scholar
Dredze, M., Gevaryahu, R., Elias-Bachrach, A.: Learning fast classifiers for imagespam. In: Proc. the 4th Conference on Email and Anti-Spam (CEAS), California, USA (August 2007)
Google Scholar
Mehta, B., Nangia, S., Gupta, M., Nejdl, W.: Detecting image spam using visual features and near duplicate detection. In: Proc. the 17th International World Wide Web Conference, Beijing, China (April 2008)
Google Scholar
Wang, Z., Josephson, W., Lv, Q., Charikar, M., Li, K.: Filtering image spam with near-duplicate detection. In: Proc. the 4th Conference on Email and Anti-Spam (CEAS), California, USA (August 2007)
Google Scholar
Dalvi, N., Domingos, P., Mausam, S.S., Verma, D.: Adversarial classification. In: Tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 99–108 (2004)
Google Scholar
Zhu, X.: Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison (2005)
Google Scholar
Fisher, R.A.: The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, 179–188 (1936)
Article Google Scholar
Wu, Y., Tian, Q., Huang, T.S.: Discriminant-em algorithm with application to image retrieval. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Hilton Head Island, SC, USA, June 2000, vol. I (2000)
Google Scholar
He, X., Niyogi, P.: Locality preserving projections. In: Thrun, S., Saul, L., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16. MIT Press, Cambridge (2004)
Google Scholar
He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.: Face recognition using laplacianfaces. IEEE Transaction on Pattern Analysis and Machine Intelligence 27(3), 328–340 (2005)
Article Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Cai, D., He, X., Han, J.: Semi-supervised discriminant analysis. In: Proc. the 11th IEEE International Conference on Computer Vision (ICCV), Rio de Janeiro, Brazil (October 2007)
Google Scholar
Yang, J., Yan, S., Huang, T.: Ubiquitously supervised subspace learning. IEEE Transactions on Image Processing 18(2), 241–249 (2009)
Article MathSciNet Google Scholar
Friedman, J.H.: Regularized discriminant analysis. Journal of the American Statistical Association 84(405), 165–175 (1989)
Article MathSciNet Google Scholar
Ng, T.T., Chang, S.F.: Classifying photographic and photorealistic computer graphic images using natural image statistics. Technical report, Columbia University (October 2004)
Google Scholar
Ng, T.T., Chang, S.F., Hsu, Y.F., Xie, L., Tsui, M.P.: Physics-motivated features for distinguishing photographic images and computer graphics. In: ACM Multimedia, Singapore (November 2005)
Google Scholar
Ng, T.T., Chang, S.F., Tsui, M.P.: Lessons learned from online classification of photo-realistic computer graphics and photographs. In: IEEE Workshop on Signal Processing Applications for Public Security and Forensics (SAFE) (April 2007)
Google Scholar
Mäenpä, T.: The local binary pattern approach to texture analysis extensions and applications. Ph.D thesis, Infotech Oulu, University of Oulu, Oulu, Finland (August 2003)
Google Scholar
Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8(6), 679–698 (1986)
Article Google Scholar
Huang, J., Kumar, S.R., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Los Alamitos (1997)
Google Scholar
Tu, Z.: Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering. In: Tenth IEEE International Conference on Computer Vision (2005)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of EECS, Northwestern University, Evanston, IL, USA
Yan Gao & Alok Choudhary
NEC Laboratories America, Cupertino, CA, USA
Ming Yang

Authors

Yan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Ming Yang
View author publications
You can also search for this author in PubMed Google Scholar
Alok Choudhary
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Knowledge Science & Engineering Institute, School of Education Technology, Beijing Normal University, Xinjiekouwai Ave. 19, 100875, Beijing, China
Ronghuai Huang
The Hong Kong University of Science and Technology, Clear Water Bay,, Hong Kong, Hong Kong
Qiang Yang
School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, BC, Canada
Jian Pei
Faculty of Economics, University of Porto, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal
João Gama
School of Information, Zhongguancum, Renmin University, 100872, Beijing, China
Xiaofeng Meng
School of Information Technology and Electrical Engineering, The University of Queensland, 4072, St. Lucia, Queensland, Australia
Xue Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, Y., Yang, M., Choudhary, A. (2009). Semi Supervised Image Spam Hunter: A Regularized Discriminant EM Approach. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2009. Lecture Notes in Computer Science(), vol 5678. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03348-3_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-03348-3_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03347-6
Online ISBN: 978-3-642-03348-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics