research-article

Spam detection using web page content: a new battleground

Authors:

Marco Túlio Ribeiro,

Pedro H. Calais Guerra,

Leonardo Vilela,

Adriano Veloso,

Dorgival Guedes,

Wagner Meira, Jr.,

Marcelo H. P. C. Chaves,

Klaus Steding-Jessen,

Cristine HoepersAuthors Info & Claims

CEAS '11: Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference

Pages 83 - 91

https://doi.org/10.1145/2030376.2030386

Published: 01 September 2011 Publication History

Abstract

Traditional content-based e-mail spam filtering takes into account content of e-mail messages and apply machine learning techniques to infer patterns that discriminate spams from hams. In particular, the use of content-based spam filtering unleashed an unending arms race between spammers and filter developers, given the spammers' ability to continuously change spam message content in ways that might circumvent the current filters. In this paper, we propose to expand the horizons of content-based filters by taking into consideration the content of the Web pages linked by e-mail messages.

We describe a methodology for extracting pages linked by URLs in spam messages and we characterize the relationship between those pages and the messages. We then use a machine learning technique (a lazy associative classifier) to extract classification rules from the web pages that are relevant to spam detection. We demonstrate that the use of information from linked pages can nicely complement current spam classification techniques, as portrayed by SpamAssassin. Our study shows that the pages linked by spams are a very promising battleground.

References

[1]

D. S. Anderson, C. Fleizach, S. Savage, and G. M. Voelker. Spamscatter: Characterizing Internet Scam Hosting Infrastructure. In Proceedings of the 16th IEEE Security Symposium, pages 135--148, 2007.

Digital Library

[2]

I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, and C. D. Spyropoulos. An evaluation of naive bayesian anti-spam filtering. CoRR, cs.CL/0006013, 2000.

[3]

B. Biggio, G. Fumera, and F. Roli. Evade hard multiple classifier systems. In O. Okun and G. Valentini, editors, Supervised and Unsupervised Ensemble Methods and Their Applications, volume 245, pages 15--38. Springer Berlin/Heidelberg, 2008.

[4]

D. Chinavle, P. Kolari, T. Oates, and T. Finin. Ensembles in adversarial classification for spam. In CIKM '09: Proceeding of the 18th ACM conference on Information and knowledge management, pages 2015--2018, New York, NY, USA, 2009. ACM.

Digital Library

[5]

G. V. Cormack. Email spam filtering: A systematic review. Found. Trends Inf. Retr., 1:335--455, April 2008.

Digital Library

[6]

N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 99--108, New York, NY, USA, 2004. ACM.

Digital Library

[7]

H. Drucker, D. Wu, and V. N. Vapnik. Support vector machines for spam categorization. IEEE TRANSACTIONS ON NEURAL NETWORKS, 10(5):1048--1054, 1999.

Digital Library

[8]

C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 973--978, 2001.

Digital Library

[9]

eSoft. Pharma-fraud continues to dominate spam. www.esoft.com/network-security-threat-blog/pharma-fraud-continues-to-dominate-spam/, 2010.

[10]

T. Fawcett. "in vivo" spam filtering: a challenge problem for kdd. SIGKDD Explor. Newsl., 5:140--148, December 2003.

Digital Library

[11]

I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 649--656, New York, NY, USA, 2007. ACM.

Digital Library

[12]

J. Goodman, G. V. Cormack, and D. Heckerman. Spam and the ongoing battle for the inbox. Commun. ACM, 50(2):24--33, 2007.

Digital Library

[13]

B. Guenter. Spam Archive, 2011. http://untroubled.org/spam/.

[14]

P. H. C. Guerra, D. Guedes, J. Wagner Meira, C. Hoepers, M. H. P. C. Chaves, and K. Steding-Jessen. Exploring the spam arms race to characterize spam evolution. In Proceedings of the 7th Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), Redmond, WA, 2010.

[15]

P. H. C. Guerra, D. Pires, D. Guedes, J. Wagner Meira, C. Hoepers, and K. Steding-Jessen. A campaign-based characterization of spamming strategies. In Proceedings of the 5th Conference on e-mail and anti-spam (CEAS), Mountain View, CA, 2008.

[16]

M. Illger, J. Straub, W. Gansterer, and C. Proschinger. The economy of spam. Technical report, Faculty of Computer Science, University of Vienna, 2006.

[17]

S. M. Labs. MessageLabs Intelligence: 2010 annual security report. http://www.messagelabs.com/mlireport/MessageLabsIntelligence_2010_Annual_Report_FINAL.pdf, 2010.

[18]

K. Levchenko, A. Pitsillidis, N. Chachra, B. Enright, M. Felegyhazi, C. Grier, T. Halvorson, C. K. C. Kanich, H. Liu, D. McCoy, N. Weaver, V. P. G. M. Voelker, and S. Savage. Click trajectories: End-to-end analysis of the spam value chain. Proceedings of the IEEE Symposium on Security and Privacy, 2011.

Digital Library

[19]

Libcurl, 2011. http://curl.haxx.se/libcurl/.

[20]

Lynx, 2011. http://lynx.browser.org/.

[21]

J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Beyond blacklists: learning to detect malicious web sites from suspicious urls. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '09, pages 1245--1254, New York, NY, USA, 2009. ACM.

Digital Library

[22]

A. Ntoulas and M. Manasse. Detecting spam web pages through content analysis. In In Proceedings of the World Wide Web conference, pages 83--92. ACM Press, 2006.

Digital Library

[23]

C. Pu, S. Member, S. Webb, O. Kolesnikov, W. Lee, and R. Lipton. Towards the integration of diverse spam filtering techniques. In Proceedings of the IEEE International Conference on Granular Computing (GrC06), Atlanta, GA, pages 17--20, 2006.

[24]

C. Pu and S. Webb. Observed trends in spam construction techniques: a case study of spam evolution. Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS), 2006.

[25]

SpamAssassin, 2011. http://spamassassin.apache.org.

[26]

Y. Sun, A. K. C. Wong, and I. Y. Wang. An overview of associative classifiers, 2006.

[27]

K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song. Monarch: Providing real-time URL spam filtering as a service. In Proceedings of the IEEE Symposium on Security and Privacy, Los Alamitos, CA, USA, 2011. IEEE Computer Society.

Digital Library

[28]

A. Veloso, W. M. Jr., and M. J. Zaki. Lazy associative classification. In ICDM, pages 645--654. IEEE Computer Society, 2006.

Digital Library

[29]

A. Veloso, W. M. Jr., and M. J. Zaki. Calibrated lazy associative classification. In S. de Amo, editor, Proceedings of The Brazilian Symposium on Databases (SBBD), pages 135--149. SBC, 2008.

Digital Library

[30]

A. Veloso and W. Meira Jr. Lazy associative classification for content-based spam detection. In Proceedings of the Fourth Latin American Web Congress, pages 154--161, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[31]

S. Webb. Introducing the webb spam corpus: Using email spam to identify web spam automatically. In In Proceedings of the 3rd Conference on Email and AntiSpam (CEAS) (Mountain View), 2006.

Cited By

Khan SAshraf AShoaib MIftikhar MSiddiq IKhan MFaisal A(2022)Meme Detection of Journalists from Social Media by Using Data Mining TechniquesInternational Journal of Innovations in Science and Technology10.33411/IJIST/20220404024:4(1055-1069)Online publication date: 7-Nov-2022
https://doi.org/10.33411/IJIST/2022040402
Garg PSingh S(2022)Multilayer Perceptron Optimization Approaches for Detecting Spam on Social Media Based on Recursive Feature EliminationApplications of Artificial Intelligence and Machine Learning10.1007/978-981-19-4831-2_41(501-510)Online publication date: 14-Sep-2022
https://doi.org/10.1007/978-981-19-4831-2_41
Sharma AKaur PAnand S(2014)Evaluation of Content Based Spam Filtering Using Data Mining Approach Applied on Text and Image CorpusProceedings of the Third International Conference on Soft Computing for Problem Solving10.1007/978-81-322-1771-8_50(561-577)Online publication date: 4-Mar-2014
https://doi.org/10.1007/978-81-322-1771-8_50

Index Terms

Spam detection using web page content: a new battleground

Recommendations

Spam Detection: Technologies for spam detection

The underlying problem with spam detection is how to define spam. Simon Heron of Network Box examines current techniques for defining and detecting spam and how spamming itself has evolved in order to avoid detection. From early whitelisting and ...
Content-based analysis to detect Arabic web spam

Search engines are important outlets for information query and retrieval. They have to deal with the continual increase of information available on the web, and provide users with convenient access to such huge amounts of information. Furthermore, with ...
A distributed content independent method for spam detection
HotBots'07: Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets

The amount of spam has skyrocketed in the recent past. Traditionally, spam was sent by single source mass mailers (spammers), making it relatively easy to screen out through the use of blacklists. Recently spammers started using botnets to send out the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CEAS '11: Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference

September 2011

230 pages

ISBN:9781450307888

DOI:10.1145/2030376

General Chair:
Vidyasagar Potdar
Curtin University, Australia

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conselho Nacional de Desenvolvimento Científico e Tecnológico

Conference

CEAS '11

CEAS '11: The 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference

September 1 - 2, 2011

Perth, Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
274
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Khan SAshraf AShoaib MIftikhar MSiddiq IKhan MFaisal A(2022)Meme Detection of Journalists from Social Media by Using Data Mining TechniquesInternational Journal of Innovations in Science and Technology10.33411/IJIST/20220404024:4(1055-1069)Online publication date: 7-Nov-2022
https://doi.org/10.33411/IJIST/2022040402
Garg PSingh S(2022)Multilayer Perceptron Optimization Approaches for Detecting Spam on Social Media Based on Recursive Feature EliminationApplications of Artificial Intelligence and Machine Learning10.1007/978-981-19-4831-2_41(501-510)Online publication date: 14-Sep-2022
https://doi.org/10.1007/978-981-19-4831-2_41
Sharma AKaur PAnand S(2014)Evaluation of Content Based Spam Filtering Using Data Mining Approach Applied on Text and Image CorpusProceedings of the Third International Conference on Soft Computing for Problem Solving10.1007/978-81-322-1771-8_50(561-577)Online publication date: 4-Mar-2014
https://doi.org/10.1007/978-81-322-1771-8_50

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten