skip to main content
10.1145/1557019.1557153acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Beyond blacklists: learning to detect malicious web sites from suspicious URLs

Published: 28 June 2009 Publication History

Abstract

Malicious Web sites are a cornerstone of Internet criminal activities. As a result, there has been broad interest in developing systems to prevent the end user from visiting such sites. In this paper, we describe an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lexical and host-based properties of malicious Web site URLs. These methods are able to learn highly predictive models by extracting and automatically analyzing tens of thousands of features potentially indicative of suspicious URLs. The resulting classifiers obtain 95-99% accuracy, detecting large numbers of malicious Web sites from their URLs, with only modest false positives.

Supplementary Material

JPG File (p1245-ma.jpg)
MP4 File (p1245-ma.mp4)

References

[1]
S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair. A Comparison of Machine Learning Techniques for Phishing Detection. In Proceedings of the Anti-Phishing Working Group eCrime Researchers Summit, Pittsburgh, PA, Oct. 2007.
[2]
Against Intuition. WOT Web of Trust. http://www.mywot.com.
[3]
D. S. Anderson, C. Fleizach, S. Savage, and G. M. Voelker. Spamscatter: Characterizing Internet Scam Hosting Infrastructure. In Proc. of the USENIX Security Symposium, Boston, MA, Aug. 2007.
[4]
A. Bergholz, J.-H. Chang, G. Paaß, F. Reichartz, and S. Strobel. Improved Phishing Detection using Model-Based Features. In Proceedings of the Conference on Email and Anti-Spam (CEAS), Mountain View, CA, Aug. 2008.
[5]
C. M. Bishop. Pattern Recognition and Machine Learning. Springer Publishing Company, New York, NY, 2006.
[6]
C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Vector Machines. http://www.csie.ntu.edu.tw/ cjlin/libsvm/.
[7]
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classification. http://www.csie.ntu.edu.tw/ cjlin/liblinear/.
[8]
I. Fette, N. Sadeh, and A. Tomasic. Learning to Detect Phishing Emails. In Proceedings of the International World Wide Web Conference (WWW), Banff, Alberta, Canada, May 2007.
[9]
S. Garera, N. Provos, M. Chew, and A. D. Rubin. A Framework for Detection and Measurement of Phishing Attacks. In Proceedings of the ACM Workshop on Rapid Malcode (WORM), Alexandria, VA, Nov. 2007.
[10]
Google. Google Toolbar. http://tools.google.com/firefox/toolbar/.
[11]
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Publishing Company, New York, NY, 2001.
[12]
IronPort. IronPort Web Reputation: Protect and Defend Against URL-Based Threat. IronPort White Paper, 2008.
[13]
P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog Identification and Splog Detection. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, Stanford, CA, Mar. 2006.
[14]
J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying Suspicious URLs: An Application of Large-Scale Online Learning. In Proc. of the International Conference on Machine Learning (ICML), Montreal, Quebec, June 2009.
[15]
McAfee. SiteAdvisor. http://www.siteadvisor.com.
[16]
D. K. McGrath and M. Gupta. Behind Phishing: An Examination of Phisher Modi Operandi. In Proc. of the USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET), San Francisco, CA, Apr. 2008.
[17]
A. Moshchuk, T. Bragin, D. Deville, S. D. Gribble, and H. M. Levy. SpyProxy: Execution-based Detection of Malicious Web Content. In Proc. of the USENIX Security Symposium, Boston, MA, Aug. 2007.
[18]
A. Moshchuk, T. Bragin, S. D. Gribble, and H. M. Levy. A Crawler-Based Study of Spyware on the Web. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), San Diego, CA, Feb. 2006.
[19]
Netscape. DMOZ Open Directory Project. http://www.dmoz.org.
[20]
Y. Niu, Y.-M. Wang, H. Chen, M. Ma, and F. Hsu. A Quantitative Study of Forum Spamming Using Context-based Analysis. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), San Diego, CA, Mar. 2007.
[21]
OpenDNS. PhishTank. http://www.phishtank.com.
[22]
N. Provos, P. Mavrommatis, M. A. Rajab, and F. Monrose. All Your iFRAMEs Point to Us. In Proc. of the USENIX Security Symposium, San Jose, CA, July 2008.
[23]
B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, 2002.
[24]
F. Sha, A. Park, and L. K. Saul. Multiplicative Updates for L_1-Regularized Linear and Logistic Regression. In Proceedings of the Symposium on Intelligent Data Analysis (IDA), Ljubljana, Slovenia, Sept. 2007.
[25]
Y.-M. Wang, D. Beck, X. Jiang, R. Roussev, C. Verbowski, S. Chen, and S. King. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), San Diego, CA, Feb. 2006.
[26]
WebSense. ThreatSeeker Network. http://www.websense.com/content/Threatseeker.aspx.
[27]
J. Zhang, P. Porras, and J. Ullrich. Highly Predictive Blacklisting. In Proc. of the USENIX Security Symposium, San Jose, CA, July 2008.
[28]
Y. Zhang, J. Hong, and L. Cranor. CANTINA: A Content-Based Approach to Detecting Phishing Web Sites. In Proceedings of the International World Wide Web Conference (WWW), Banff, Alberta, Canada, May 2007.

Cited By

View all
  • (2025)Characterizing the Networks Sending Enterprise Phishing EmailsPassive and Active Measurement10.1007/978-3-031-85960-1_18(437-466)Online publication date: 7-Mar-2025
  • (2024)Detection of Cyber Threats From Suspicious URLs Using Multi-Classification ApproachSustainable Science and Intelligent Technologies for Societal Development10.4018/979-8-3693-1186-8.ch007(107-129)Online publication date: 5-Jan-2024
  • (2024)Deep learning trends and future perspectives of web security and vulnerabilitiesJournal of High Speed Networks10.3233/JHS-23003730:1(115-146)Online publication date: 1-Jan-2024
  • Show More Cited By

Index Terms

  1. Beyond blacklists: learning to detect malicious web sites from suspicious URLs

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
      June 2009
      1426 pages
      ISBN:9781605584959
      DOI:10.1145/1557019
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 June 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. L1-regularization
      2. malicious web sites
      3. supervised learning

      Qualifiers

      • Research-article

      Conference

      KDD09

      Acceptance Rates

      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)104
      • Downloads (Last 6 weeks)14
      Reflects downloads up to 03 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Characterizing the Networks Sending Enterprise Phishing EmailsPassive and Active Measurement10.1007/978-3-031-85960-1_18(437-466)Online publication date: 7-Mar-2025
      • (2024)Detection of Cyber Threats From Suspicious URLs Using Multi-Classification ApproachSustainable Science and Intelligent Technologies for Societal Development10.4018/979-8-3693-1186-8.ch007(107-129)Online publication date: 5-Jan-2024
      • (2024)Deep learning trends and future perspectives of web security and vulnerabilitiesJournal of High Speed Networks10.3233/JHS-23003730:1(115-146)Online publication date: 1-Jan-2024
      • (2024)Malicious JavaScript Detection in Realistic Environments with SVM and MLP ModelsJournal of Information Processing10.2197/ipsjjip.32.74832(748-756)Online publication date: 2024
      • (2024)CT-GCN+: a high-performance cryptocurrency transaction graph convolutional model for phishing node classificationCybersecurity10.1186/s42400-023-00194-57:1Online publication date: 1-Feb-2024
      • (2024)Comparison of Interaction Profiling Bipartite Graph Mining and Graph Neural Network for Malware-Control Domain DetectionProceedings of the 2024 International Conference on Information Technology, Data Science, and Optimization10.1145/3658549.3658552(12-19)Online publication date: 22-May-2024
      • (2024)Unleashing the Potential of Machine Learning and NLP Contextual Word Embedding for URL-Based Malicious Traffic Classification2024 IEEE 99th Vehicular Technology Conference (VTC2024-Spring)10.1109/VTC2024-Spring62846.2024.10683482(1-5)Online publication date: 24-Jun-2024
      • (2024)Detecting Malicious Websites From the Perspective of System Provenance AnalysisIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.327761321:3(1406-1423)Online publication date: May-2024
      • (2024)Phishing Site Detection using Machine Learning2024 International Conference on System, Computation, Automation and Networking (ICSCAN)10.1109/ICSCAN62807.2024.10894084(1-5)Online publication date: 27-Dec-2024
      • (2024)Malicious URL and Intrusion Detection using Machine Learning2024 International Conference on Information Networking (ICOIN)10.1109/ICOIN59985.2024.10572207(795-800)Online publication date: 17-Jan-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media