research-article

Beyond blacklists: learning to detect malicious web sites from suspicious URLs

Authors:

Lawrence K. Saul,

Geoffrey M. VoelkerAuthors Info & Claims

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1245 - 1254

https://doi.org/10.1145/1557019.1557153

Published: 28 June 2009 Publication History

Abstract

Malicious Web sites are a cornerstone of Internet criminal activities. As a result, there has been broad interest in developing systems to prevent the end user from visiting such sites. In this paper, we describe an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lexical and host-based properties of malicious Web site URLs. These methods are able to learn highly predictive models by extracting and automatically analyzing tens of thousands of features potentially indicative of suspicious URLs. The resulting classifiers obtain 95-99% accuracy, detecting large numbers of malicious Web sites from their URLs, with only modest false positives.

Supplementary Material

JPG File (p1245-ma.jpg)

Download
8.88 KB

MP4 File (p1245-ma.mp4)

Download
61.31 MB

References

[1]

S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair. A Comparison of Machine Learning Techniques for Phishing Detection. In Proceedings of the Anti-Phishing Working Group eCrime Researchers Summit, Pittsburgh, PA, Oct. 2007.

Digital Library

[2]

Against Intuition. WOT Web of Trust. http://www.mywot.com.

[3]

D. S. Anderson, C. Fleizach, S. Savage, and G. M. Voelker. Spamscatter: Characterizing Internet Scam Hosting Infrastructure. In Proc. of the USENIX Security Symposium, Boston, MA, Aug. 2007.

Digital Library

[4]

A. Bergholz, J.-H. Chang, G. Paaß, F. Reichartz, and S. Strobel. Improved Phishing Detection using Model-Based Features. In Proceedings of the Conference on Email and Anti-Spam (CEAS), Mountain View, CA, Aug. 2008.

[5]

C. M. Bishop. Pattern Recognition and Machine Learning. Springer Publishing Company, New York, NY, 2006.

Digital Library

[6]

C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Vector Machines. http://www.csie.ntu.edu.tw/ cjlin/libsvm/.

[7]

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classification. http://www.csie.ntu.edu.tw/ cjlin/liblinear/.

Digital Library

[8]

I. Fette, N. Sadeh, and A. Tomasic. Learning to Detect Phishing Emails. In Proceedings of the International World Wide Web Conference (WWW), Banff, Alberta, Canada, May 2007.

Digital Library

[9]

S. Garera, N. Provos, M. Chew, and A. D. Rubin. A Framework for Detection and Measurement of Phishing Attacks. In Proceedings of the ACM Workshop on Rapid Malcode (WORM), Alexandria, VA, Nov. 2007.

Digital Library

[10]

Google. Google Toolbar. http://tools.google.com/firefox/toolbar/.

[11]

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Publishing Company, New York, NY, 2001.

[12]

IronPort. IronPort Web Reputation: Protect and Defend Against URL-Based Threat. IronPort White Paper, 2008.

[13]

P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog Identification and Splog Detection. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, Stanford, CA, Mar. 2006.

[14]

J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying Suspicious URLs: An Application of Large-Scale Online Learning. In Proc. of the International Conference on Machine Learning (ICML), Montreal, Quebec, June 2009.

Digital Library

[15]

McAfee. SiteAdvisor. http://www.siteadvisor.com.

[16]

D. K. McGrath and M. Gupta. Behind Phishing: An Examination of Phisher Modi Operandi. In Proc. of the USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET), San Francisco, CA, Apr. 2008.

Digital Library

[17]

A. Moshchuk, T. Bragin, D. Deville, S. D. Gribble, and H. M. Levy. SpyProxy: Execution-based Detection of Malicious Web Content. In Proc. of the USENIX Security Symposium, Boston, MA, Aug. 2007.

Digital Library

[18]

A. Moshchuk, T. Bragin, S. D. Gribble, and H. M. Levy. A Crawler-Based Study of Spyware on the Web. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), San Diego, CA, Feb. 2006.

[19]

Netscape. DMOZ Open Directory Project. http://www.dmoz.org.

[20]

Y. Niu, Y.-M. Wang, H. Chen, M. Ma, and F. Hsu. A Quantitative Study of Forum Spamming Using Context-based Analysis. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), San Diego, CA, Mar. 2007.

[21]

OpenDNS. PhishTank. http://www.phishtank.com.

[22]

N. Provos, P. Mavrommatis, M. A. Rajab, and F. Monrose. All Your iFRAMEs Point to Us. In Proc. of the USENIX Security Symposium, San Jose, CA, July 2008.

Digital Library

[23]

B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, 2002.

Digital Library

[24]

F. Sha, A. Park, and L. K. Saul. Multiplicative Updates for L_1-Regularized Linear and Logistic Regression. In Proceedings of the Symposium on Intelligent Data Analysis (IDA), Ljubljana, Slovenia, Sept. 2007.

Digital Library

[25]

Y.-M. Wang, D. Beck, X. Jiang, R. Roussev, C. Verbowski, S. Chen, and S. King. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), San Diego, CA, Feb. 2006.

[26]

WebSense. ThreatSeeker Network. http://www.websense.com/content/Threatseeker.aspx.

[27]

J. Zhang, P. Porras, and J. Ullrich. Highly Predictive Blacklisting. In Proc. of the USENIX Security Symposium, San Jose, CA, July 2008.

Digital Library

[28]

Y. Zhang, J. Hong, and L. Cranor. CANTINA: A Content-Based Approach to Detecting Phishing Web Sites. In Proceedings of the International World Wide Web Conference (WWW), Banff, Alberta, Canada, May 2007.

Digital Library

Cited By

Luo EYoung LHo GAfifi MSchweighauser MKatz-Bassett ECidon A(2025)Characterizing the Networks Sending Enterprise Phishing EmailsPassive and Active Measurement10.1007/978-3-031-85960-1_18(437-466)Online publication date: 7-Mar-2025
https://doi.org/10.1007/978-3-031-85960-1_18
Mohanty SNanda SRout RKumar AAcharya APanda NAgrawal V(2024)Detection of Cyber Threats From Suspicious URLs Using Multi-Classification ApproachSustainable Science and Intelligent Technologies for Societal Development10.4018/979-8-3693-1186-8.ch007(107-129)Online publication date: 5-Jan-2024
https://doi.org/10.4018/979-8-3693-1186-8.ch007
Chughtai MBibi IKarim SShah SLaghari AKhan A(2024)Deep learning trends and future perspectives of web security and vulnerabilitiesJournal of High Speed Networks10.3233/JHS-23003730:1(115-146)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.3233/JHS-230037
Show More Cited By

Index Terms

Beyond blacklists: learning to detect malicious web sites from suspicious URLs
1. Security and privacy
2. Social and professional topics
  1. Computing / technology policy
    1. Computer crime

Recommendations

Learning to detect malicious URLs

Malicious Web sites are a cornerstone of Internet criminal activities. The dangers of these sites have created a demand for safeguards that protect end-users from visiting them. This article explores how to detect malicious Web sites from the lexical ...
Opcode sequences as representation of executables for data-mining-based unknown malware detection

Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a ...
New biostatistics features for detecting web bot activity on web applications
Abstract
Web bots are malicious scripts that automatically traverse the websites, fill the web form and illegally scrap the data from web sites. The never-ending threat of web bot is causing serious problems on the web applications. According ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

June 2009

1426 pages

ISBN:9781605584959

DOI:10.1145/1557019

General Chairs:
John Elder
Elder Research, Inc., USA
,
Françoise Soulié Fogelman
KXEN, France
,
Program Chairs:
Peter Flach
University of Bristol, UK
,
Mohammed Zaki
RPI, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD09

Sponsor:

KDD09: The 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

June 28 - July 1, 2009

Paris, France

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

524
Total Citations
View Citations
3,272
Total Downloads

Downloads (Last 12 months)104
Downloads (Last 6 weeks)14

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Luo EYoung LHo GAfifi MSchweighauser MKatz-Bassett ECidon A(2025)Characterizing the Networks Sending Enterprise Phishing EmailsPassive and Active Measurement10.1007/978-3-031-85960-1_18(437-466)Online publication date: 7-Mar-2025
https://doi.org/10.1007/978-3-031-85960-1_18
Mohanty SNanda SRout RKumar AAcharya APanda NAgrawal V(2024)Detection of Cyber Threats From Suspicious URLs Using Multi-Classification ApproachSustainable Science and Intelligent Technologies for Societal Development10.4018/979-8-3693-1186-8.ch007(107-129)Online publication date: 5-Jan-2024
https://doi.org/10.4018/979-8-3693-1186-8.ch007
Chughtai MBibi IKarim SShah SLaghari AKhan A(2024)Deep learning trends and future perspectives of web security and vulnerabilitiesJournal of High Speed Networks10.3233/JHS-23003730:1(115-146)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.3233/JHS-230037
Phung NMimura M(2024)Malicious JavaScript Detection in Realistic Environments with SVM and MLP ModelsJournal of Information Processing10.2197/ipsjjip.32.74832(748-756)Online publication date: 2024
https://doi.org/10.2197/ipsjjip.32.748
Fu BWang YFeng T(2024)CT-GCN+: a high-performance cryptocurrency transaction graph convolutional model for phishing node classificationCybersecurity10.1186/s42400-023-00194-57:1Online publication date: 1-Feb-2024
https://doi.org/10.1186/s42400-023-00194-5
Jeng TChen CTsai YChen Y(2024)Comparison of Interaction Profiling Bipartite Graph Mining and Graph Neural Network for Malware-Control Domain DetectionProceedings of the 2024 International Conference on Information Technology, Data Science, and Optimization10.1145/3658549.3658552(12-19)Online publication date: 22-May-2024
https://dl.acm.org/doi/10.1145/3658549.3658552
Kumar S YMishra SSingh R(2024)Unleashing the Potential of Machine Learning and NLP Contextual Word Embedding for URL-Based Malicious Traffic Classification2024 IEEE 99th Vehicular Technology Conference (VTC2024-Spring)10.1109/VTC2024-Spring62846.2024.10683482(1-5)Online publication date: 24-Jun-2024
https://doi.org/10.1109/VTC2024-Spring62846.2024.10683482
Jiang PXiao JLi DYu HBai YGuo YChen X(2024)Detecting Malicious Websites From the Perspective of System Provenance AnalysisIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.327761321:3(1406-1423)Online publication date: May-2024
https://doi.org/10.1109/TDSC.2023.3277613
Lavanya. Kumaran RAshiq MKumar.K MVishal V(2024)Phishing Site Detection using Machine Learning2024 International Conference on System, Computation, Automation and Networking (ICSCAN)10.1109/ICSCAN62807.2024.10894084(1-5)Online publication date: 27-Dec-2024
https://doi.org/10.1109/ICSCAN62807.2024.10894084
Hamza AHammam FAbouzeid MAhmed MDhou SAloul F(2024)Malicious URL and Intrusion Detection using Machine Learning2024 International Conference on Information Networking (ICOIN)10.1109/ICOIN59985.2024.10572207(795-800)Online publication date: 17-Jan-2024
https://doi.org/10.1109/ICOIN59985.2024.10572207
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten