Abstract
Image based spam email can easily circumvent widely used text based spam email filters. More and more spammers are adapting the technology. Being able to detect the nature of email from its image content is urgently needed. We propose to use OCR (optical character recognition) technology to extract the embedded text from the images and then assess the nature of the email by the extracted text using the same text based engine. This approach avoids maintaining an extra image based detection engine and also takes the benefit of the strong and reasonably mature text based engine. The success of this approach relies on the accuracy of the OCR. However, regardless of how good an OCR is, misrecognition is unavoidable. Therefore, a Markov model which has the ability to tolerate misspells is also proposed. The solution proposed in this paper can be integrated smoothly into existing spam email filters.
This research work is supported by the divisional grants from the Division of Business, Law and Information Sciences, University of Canberra, Australia, and the university grants from University of Canberra, Australia.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Keizer, G.: Spam Could Cost Businesses Worldwide $50 Billion (accessed 09 October 2005), http://www.informationweek.com/story/showArticle.jhtml?articleID=60403649
Symantec: Love Letter Worm (accessed October 2005), http://securityresponse.symantec.com/avcenter/venc/data/vbs.loveletter.a.html
Symantec: Slammer Virus, (accessed October 2005), http://securityresponse.symantec.com/avcenter/venc/data/w32.sqlexp.worm.html
Lemos, R.: Counting the cost of Slammer (accessed 11 October 2005), http://news.com.com/2102-1001_3-982955.html?tag=st.util.print
SpamAssassin: The Apache SpamAssassin Project, http://spamassassin.apache.org/
Sahami, M., et al.: A Bayesian Approach to Filtering Junk E-mail. In: AAAI- 1998 Workshop on Learning for Text Categorization (1998)
Sakkis, G., et al.: A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. INFORMATION RETRIEVAL 6(1), 49–73 (2003)
Carreras, X., Marquez, L.: Boosting Trees for Anti-Spam Email Filtering. In: 4th International Conference on Recent Advances in Natural Language Processing (RANLP-2001) (2001)
Zhang, L., Yao, T.-S.: Filtering Junk Mail with A Maximum Entropy Model. In: 20th International Conference on Computer Processing of Oriental Languages (ICCPOL 2003) (2003)
Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)
Chuan, Z., et al.: A LVQ-based neural network anti-spam email approach. ACM SIGOPS Operating Systems Review 39(1), 34–39 (2005)
Zhou, Y., Mulekar, M.S., Nerellapalli, P.: Adaptive Spam Filtering Using Dynamic Feature Space. In: 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2005) (2005)
Graham-Cumming, J.: The Spammers’ Compendium, http://www.jgc.org/tsc/
Wu, C.-T., et al.: Using visual features for anti-spam filtering. In: IEEE International Conference on Image Processing (ICIP 2005) (2005)
Aradhye, H.B., Myers, G.K., Herson, J.A.: Image Analysis for Efficient Categorization of Image-based Spam E-mail. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), IEEE, Los Alamitos (2005)
Wu, C.-T.: Embedded-Text Detection and Its Application to Anti-Spam Filtering. University of California, Santa Barbara: Santa Barbarra, CA, USA (2005)
Eikvil, L.O.: Optical Character Recognition. Oslo, Norway, Norwegian Computing Center (1993)
Tran, D., et al.: A Proposed Statistical Model for Spam Email Detection (submitted for publishing 2006)
Postel, J.B.: Simple Mail Transfer Protocol, http://www.ietf.org/rfc/rfc0821.txt
Freed, N., Borenstein, N.: Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types (accessed May 2006), http://www.ietf.org/rfc/rfc2046.txt
ripmime (accessed May 2006), http://www.pldaniels.com/ripmime
gocr (accessed May 2006), http://jocr.sourceforge.net
Pelletier, L., Almhana, J., Choulakian, V.: Adaptive filtering of spam. In: (CNSR 2004) Second Annual Conference on Communication Networks and Services Research (2004)
Tran, D., Sharma, D.: Markov Modeling Method for Written Language Identification and Verification. In: the Sixth International Conference on Intelligent Technologies InTech 2005, Thailand (2005)
Tran, D.: New Background Modeling for Speaker Verification. In: INTERSPEECH, ICSLP Conference, Korea (2004)
Ma, W., Tran, D., Sharma, D.: Detecting image based spam email by using OCR and trigram methods. In: Proceedings of Asia-Pacific Workshop on Visual Information Processing (VIP 2006), Beijing, China (November 2006)
Tran, D., Markov, D.S.: Models for Written Language Identification. In: The 12th International Conference on Neural Information Processing, Taiwan, pp. 67–70 (30 October-2 November 2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ma, W., Tran, D., Sharma, D. (2007). Detecting Image Based Spam Email. In: Szczuka, M.S., et al. Advances in Hybrid Information Technology. ICHIT 2006. Lecture Notes in Computer Science(), vol 4413. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77368-9_17
Download citation
DOI: https://doi.org/10.1007/978-3-540-77368-9_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77367-2
Online ISBN: 978-3-540-77368-9
eBook Packages: Computer ScienceComputer Science (R0)