skip to main content
10.1145/2030376.2030386acmotherconferencesArticle/Chapter ViewAbstractPublication PagesceasConference Proceedingsconference-collections
research-article

Spam detection using web page content: a new battleground

Published: 01 September 2011 Publication History

Abstract

Traditional content-based e-mail spam filtering takes into account content of e-mail messages and apply machine learning techniques to infer patterns that discriminate spams from hams. In particular, the use of content-based spam filtering unleashed an unending arms race between spammers and filter developers, given the spammers' ability to continuously change spam message content in ways that might circumvent the current filters. In this paper, we propose to expand the horizons of content-based filters by taking into consideration the content of the Web pages linked by e-mail messages.
We describe a methodology for extracting pages linked by URLs in spam messages and we characterize the relationship between those pages and the messages. We then use a machine learning technique (a lazy associative classifier) to extract classification rules from the web pages that are relevant to spam detection. We demonstrate that the use of information from linked pages can nicely complement current spam classification techniques, as portrayed by SpamAssassin. Our study shows that the pages linked by spams are a very promising battleground.

References

[1]
D. S. Anderson, C. Fleizach, S. Savage, and G. M. Voelker. Spamscatter: Characterizing Internet Scam Hosting Infrastructure. In Proceedings of the 16th IEEE Security Symposium, pages 135--148, 2007.
[2]
I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, and C. D. Spyropoulos. An evaluation of naive bayesian anti-spam filtering. CoRR, cs.CL/0006013, 2000.
[3]
B. Biggio, G. Fumera, and F. Roli. Evade hard multiple classifier systems. In O. Okun and G. Valentini, editors, Supervised and Unsupervised Ensemble Methods and Their Applications, volume 245, pages 15--38. Springer Berlin/Heidelberg, 2008.
[4]
D. Chinavle, P. Kolari, T. Oates, and T. Finin. Ensembles in adversarial classification for spam. In CIKM '09: Proceeding of the 18th ACM conference on Information and knowledge management, pages 2015--2018, New York, NY, USA, 2009. ACM.
[5]
G. V. Cormack. Email spam filtering: A systematic review. Found. Trends Inf. Retr., 1:335--455, April 2008.
[6]
N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 99--108, New York, NY, USA, 2004. ACM.
[7]
H. Drucker, D. Wu, and V. N. Vapnik. Support vector machines for spam categorization. IEEE TRANSACTIONS ON NEURAL NETWORKS, 10(5):1048--1054, 1999.
[8]
C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 973--978, 2001.
[9]
eSoft. Pharma-fraud continues to dominate spam. www.esoft.com/network-security-threat-blog/pharma-fraud-continues-to-dominate-spam/, 2010.
[10]
T. Fawcett. "in vivo" spam filtering: a challenge problem for kdd. SIGKDD Explor. Newsl., 5:140--148, December 2003.
[11]
I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 649--656, New York, NY, USA, 2007. ACM.
[12]
J. Goodman, G. V. Cormack, and D. Heckerman. Spam and the ongoing battle for the inbox. Commun. ACM, 50(2):24--33, 2007.
[13]
B. Guenter. Spam Archive, 2011. http://untroubled.org/spam/.
[14]
P. H. C. Guerra, D. Guedes, J. Wagner Meira, C. Hoepers, M. H. P. C. Chaves, and K. Steding-Jessen. Exploring the spam arms race to characterize spam evolution. In Proceedings of the 7th Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), Redmond, WA, 2010.
[15]
P. H. C. Guerra, D. Pires, D. Guedes, J. Wagner Meira, C. Hoepers, and K. Steding-Jessen. A campaign-based characterization of spamming strategies. In Proceedings of the 5th Conference on e-mail and anti-spam (CEAS), Mountain View, CA, 2008.
[16]
M. Illger, J. Straub, W. Gansterer, and C. Proschinger. The economy of spam. Technical report, Faculty of Computer Science, University of Vienna, 2006.
[17]
S. M. Labs. MessageLabs Intelligence: 2010 annual security report. http://www.messagelabs.com/mlireport/MessageLabsIntelligence_2010_Annual_Report_FINAL.pdf, 2010.
[18]
K. Levchenko, A. Pitsillidis, N. Chachra, B. Enright, M. Felegyhazi, C. Grier, T. Halvorson, C. K. C. Kanich, H. Liu, D. McCoy, N. Weaver, V. P. G. M. Voelker, and S. Savage. Click trajectories: End-to-end analysis of the spam value chain. Proceedings of the IEEE Symposium on Security and Privacy, 2011.
[19]
Libcurl, 2011. http://curl.haxx.se/libcurl/.
[20]
Lynx, 2011. http://lynx.browser.org/.
[21]
J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Beyond blacklists: learning to detect malicious web sites from suspicious urls. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '09, pages 1245--1254, New York, NY, USA, 2009. ACM.
[22]
A. Ntoulas and M. Manasse. Detecting spam web pages through content analysis. In In Proceedings of the World Wide Web conference, pages 83--92. ACM Press, 2006.
[23]
C. Pu, S. Member, S. Webb, O. Kolesnikov, W. Lee, and R. Lipton. Towards the integration of diverse spam filtering techniques. In Proceedings of the IEEE International Conference on Granular Computing (GrC06), Atlanta, GA, pages 17--20, 2006.
[24]
C. Pu and S. Webb. Observed trends in spam construction techniques: a case study of spam evolution. Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS), 2006.
[25]
SpamAssassin, 2011. http://spamassassin.apache.org.
[26]
Y. Sun, A. K. C. Wong, and I. Y. Wang. An overview of associative classifiers, 2006.
[27]
K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song. Monarch: Providing real-time URL spam filtering as a service. In Proceedings of the IEEE Symposium on Security and Privacy, Los Alamitos, CA, USA, 2011. IEEE Computer Society.
[28]
A. Veloso, W. M. Jr., and M. J. Zaki. Lazy associative classification. In ICDM, pages 645--654. IEEE Computer Society, 2006.
[29]
A. Veloso, W. M. Jr., and M. J. Zaki. Calibrated lazy associative classification. In S. de Amo, editor, Proceedings of The Brazilian Symposium on Databases (SBBD), pages 135--149. SBC, 2008.
[30]
A. Veloso and W. Meira Jr. Lazy associative classification for content-based spam detection. In Proceedings of the Fourth Latin American Web Congress, pages 154--161, Washington, DC, USA, 2006. IEEE Computer Society.
[31]
S. Webb. Introducing the webb spam corpus: Using email spam to identify web spam automatically. In In Proceedings of the 3rd Conference on Email and AntiSpam (CEAS) (Mountain View), 2006.

Cited By

View all
  • (2022)Meme Detection of Journalists from Social Media by Using Data Mining TechniquesInternational Journal of Innovations in Science and Technology10.33411/IJIST/20220404024:4(1055-1069)Online publication date: 7-Nov-2022
  • (2022)Multilayer Perceptron Optimization Approaches for Detecting Spam on Social Media Based on Recursive Feature EliminationApplications of Artificial Intelligence and Machine Learning10.1007/978-981-19-4831-2_41(501-510)Online publication date: 14-Sep-2022
  • (2014)Evaluation of Content Based Spam Filtering Using Data Mining Approach Applied on Text and Image CorpusProceedings of the Third International Conference on Soft Computing for Problem Solving10.1007/978-81-322-1771-8_50(561-577)Online publication date: 4-Mar-2014

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CEAS '11: Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
September 2011
230 pages
ISBN:9781450307888
DOI:10.1145/2030376
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2011

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

CEAS '11

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Meme Detection of Journalists from Social Media by Using Data Mining TechniquesInternational Journal of Innovations in Science and Technology10.33411/IJIST/20220404024:4(1055-1069)Online publication date: 7-Nov-2022
  • (2022)Multilayer Perceptron Optimization Approaches for Detecting Spam on Social Media Based on Recursive Feature EliminationApplications of Artificial Intelligence and Machine Learning10.1007/978-981-19-4831-2_41(501-510)Online publication date: 14-Sep-2022
  • (2014)Evaluation of Content Based Spam Filtering Using Data Mining Approach Applied on Text and Image CorpusProceedings of the Third International Conference on Soft Computing for Problem Solving10.1007/978-81-322-1771-8_50(561-577)Online publication date: 4-Mar-2014

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media