research-article

Identifying suspicious URLs: an application of large-scale online learning

Authors:

Lawrence K. Saul,

Geoffrey M. VoelkerAuthors Info & Claims

ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning

Pages 681 - 688

https://doi.org/10.1145/1553374.1553462

Published: 14 June 2009 Publication History

Abstract

This paper explores online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. We show that this application is particularly appropriate for online algorithms as the size of the training data is larger than can be efficiently processed in batch and because the distribution of features that typify malicious URLs is changing continuously. Using a real-time system we developed for gathering URL features, combined with a real-time source of labeled URLs from a large Web mail provider, we demonstrate that recently-developed online algorithms can be as accurate as batch techniques, achieving classification accuracies up to 99% over a balanced data set.

References

[1]

Bergholz, A., Chang, J.-H., Paaß, G., Reichartz, F., & Strobel, S. (2008). Improved Phishing Detection using Model-Based Features. Proceedings of the Conference on Email and Anti-Spam (CEAS). Mountain View, CA.

[2]

Bottou, L. (1998). Online Learning and Stochastic Approximations. In Online Learning and Neural Networks, 9--42. Cambridge, UK: Cambridge University Press.

Digital Library

[3]

Bottou, L., & LeCun, Y. (2004). Large Scale Online Learning. In S. Thrun, L. K. Saul and B. Schöölkopf (Eds.), Advances in Neural Information Processing Systems 16, 217--224. Cambridge, MA: MIT Press.

[4]

Chou, N., Ledesma, R., Teraguchi, Y., Boneh, D., & Mitchell, J. C. (2004). Client-Side Defense against Web-Based Identity Theft. Network and Distributed System Security (NDSS). San Diego, CA.

[5]

Crammer, K., Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2006). Online Passive-Aggressive Algorithms. Journal of Machine Learning Research, 7, 551--585.

Digital Library

[6]

Crammer, K., Dredze, M., & Pereira, F. (2009). Exact Convex Confidence-Weighted Learning. Advances in Neural Information Processing Systems 21 (pp. 345--352).

[7]

Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2008). The Forgetron: A Kernel-Based Perceptron on a Budget. SIAM Journal on Computing, 37, 1342--1372.

Digital Library

[8]

Dredze, M., Crammer, K., & Pereira, F. (2008). Confidence-Weighted Linear Classification. Proceedings of the International Conference on Marchine Learning (ICML) (pp. 264--271). Helsinki, Finland: Omni-press.

Digital Library

[9]

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). LIBLINEAR: A Library for Large Linear Classification. http://www.csie.ntu.edu.tw/cjlin/liblinear/.

[10]

Fette, I., Sadeh, N., & Tomasic, A. (2007). Learning to Detect Phishing Emails. Proceedings of the International World Wide Web Conference (WWW) (pp. 649--656). Banff, Alberta, Canada.

Digital Library

[11]

Garera, S., Provos, N., Chew, M., & Rubin, A. D. (2007). A Framework for Detection and Measurement of Phishing Attacks. Proceedings of the ACM Workshop on Rapid Malcode (WORM) (pp. 1--8). Alexandria, VA.

Digital Library

[12]

Ma, J., Saul, L. K., Savage, S., & Voelker, G. M. (2009). Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs. Proceedings of the SIGKDD Conference. Paris, France.

Digital Library

[13]

McGrath, D. K., & Gupta, M. (2008). Behind Phishing: An Examination of Phisher Modi Operandi. Proceedings of the USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET). San Francisco, CA.

Digital Library

[14]

Moshchuk, A., Bragin, T., Gribble, S. D., & Levy, H. M. (2006). A Crawler-Based Study of Spyware on the Web. Network and Distributed System Security (NDSS). San Diego, CA.

[15]

Orabona, F., Keshet, J., & Caputo, B. (2008). The Projectron: A Bounded Kernel-Based Perceptron. Proceedings of the International Conference on Machine Learning (ICML) (pp. 720--727). Helsinki, Finland: Omnipress.

Digital Library

[16]

Provos, N., Mavrommatis, P., Rajab, M. A., & Monrose, F. (2008). All Your iFRAMEs Point to Us. Proceedings of the USENIX Security Symposium (pp. 1--15). San Jose, CA.

Digital Library

[17]

Provos, N., McNamee, D., Mavrommatis, P., Wang, K., & Modadugu, N. (2007). The Ghost in the Browser Analysis of Web-based Malware. Proceedings of the USENIX Workshop on Hot Topics in Understanding Botnets (Hot-Bots). Cambridge, MA.

Digital Library

[18]

Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65, 386--408.

[19]

Rudd, J. (2007). Botnet plugin for SpamAssas-sin. http://people.ucsc.edu/~jrudd/spamassassin/.

[20]

Sinha, S., Bailey, M., & Jahanian, F. (2008). Shades of Grey: On the Effectiveness of Reputation-Based Blacklists. Proceedings of the International Conference on Malicious and Unwanted Software (Malware) (pp. 57--64). Alexandria, VA.

[21]

Sonnenburg, S., Franc, V., Yom-Tov, E., & Sebag, M. (2008). PASCAL Large Scale Learning Challenge. http://largescale.first.fraunhofer.de/workshop/.

[22]

Wang, Y.-M., Beck, D., Jiang, X., Roussev, R., Verbowski, C., Chen, S., & King, S. (2006). Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. Network and Distributed System Security (NDSS). San Diego, CA.

Cited By

Sun LBarbu A(2025)Stochastic feature selection with annealing and its applications to streaming dataJournal of Nonparametric Statistics10.1080/10485252.2025.2456767(1-18)Online publication date: 30-Jan-2025
https://doi.org/10.1080/10485252.2025.2456767
Akar F(2024)Data correlation matrix-based spam URL detection using machine learning algorithmsJournal of Scientific Reports-A10.59313/jsr-a.1422913(56-69)Online publication date: 31-Mar-2024
https://doi.org/10.59313/jsr-a.1422913
Dasgupta AMehr S(2024)Enhanced MNB Method for SPAM E-mail/SMS Text Detection Using TF-IDF VectorizerAmerican Journal of Mathematical and Computer Modelling10.11648/j.ajmcm.20240901.119:1(1-8)Online publication date: 28-Apr-2024
https://doi.org/10.11648/j.ajmcm.20240901.11
Show More Cited By

Index Terms

Identifying suspicious URLs: an application of large-scale online learning
1. Computing methodologies
  1. Machine learning
  2. Modeling and simulation
    1. Model development and analysis
      1. Model verification and validation
      2. Modeling methodologies
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction
  2. Information storage systems

Recommendations

Learning to detect malicious URLs

Malicious Web sites are a cornerstone of Internet criminal activities. The dangers of these sites have created a demand for safeguards that protect end-users from visiting them. This article explores how to detect malicious Web sites from the lexical ...
Suspicious FQDN Evaluation based on Variations in Malware Download URLs
ASONAM '17: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017

Nowadays, increasing Internet use is plagued by malicious activity; especially drive-by download attacks have become a serious problem. As part of an exploit-as-a-service ecosystem for drive-by download attacks, malware download sites play a ...
Privacy Preservation for Detecting Malicious Web Sites from Suspicious URLs
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

Some criminals and malcontents attempt to take advantage of others by using malicious web sites. As a result, many systems were developed to prevent the end user from visiting such malicious sites. A lot of approaches were used in these systems, e.g., ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning

June 2009

1331 pages

ISBN:9781605585161

DOI:10.1145/1553374

General Chair:
Andrea Danyluk
Williams College
,
Program Chairs:
Léon Bottou
NEC Laboratories America
,
Michael Littman
Rutgers University

Copyright © 2009 Copyright 2009 by the author(s)/owner(s).

Sponsors

NSF
Microsoft Research: Microsoft Research
MITACS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

ICML '09

Sponsor:

Microsoft Research

ICML '09: The 26th Annual International Conference on Machine Learning held in conjunction with the 2007 International Conference on Inductive Logic Programming

June 14 - 18, 2009

Quebec, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

326
Total Citations
View Citations
1,948
Total Downloads

Downloads (Last 12 months)69
Downloads (Last 6 weeks)5

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sun LBarbu A(2025)Stochastic feature selection with annealing and its applications to streaming dataJournal of Nonparametric Statistics10.1080/10485252.2025.2456767(1-18)Online publication date: 30-Jan-2025
https://doi.org/10.1080/10485252.2025.2456767
Akar F(2024)Data correlation matrix-based spam URL detection using machine learning algorithmsJournal of Scientific Reports-A10.59313/jsr-a.1422913(56-69)Online publication date: 31-Mar-2024
https://doi.org/10.59313/jsr-a.1422913
Dasgupta AMehr S(2024)Enhanced MNB Method for SPAM E-mail/SMS Text Detection Using TF-IDF VectorizerAmerican Journal of Mathematical and Computer Modelling10.11648/j.ajmcm.20240901.119:1(1-8)Online publication date: 28-Apr-2024
https://doi.org/10.11648/j.ajmcm.20240901.11
Guo HZhang YWang W(2024)Dynamical Targeted Ensemble Learning for Streaming Data With Concept DriftIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.346040436:12(8023-8036)Online publication date: Dec-2024
https://doi.org/10.1109/TKDE.2024.3460404
Polyzos KLu QGiannakis G(2024)Active labeling for online ensemble learning2024 IEEE 13rd Sensor Array and Multichannel Signal Processing Workshop (SAM)10.1109/SAM60225.2024.10636477(1-5)Online publication date: 8-Jul-2024
https://doi.org/10.1109/SAM60225.2024.10636477
Mehr SBasak SDasgupta A(2024)Enhanced KNN Method for Malicious URL Detection using GCL (Google Index, Counting the number of characters and Length of URL) Extraction Technique2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00087(496-501)Online publication date: 16-Dec-2024
https://doi.org/10.1109/MCSoC64144.2024.00087
Nezarat MKhedersolh EShahhoseini H(2024)Classifying Benign and Malicious Websites through URL Feature Extraction Using Transfer Learning2024 11th International Symposium on Telecommunications (IST)10.1109/IST64061.2024.10843636(340-344)Online publication date: 9-Oct-2024
https://doi.org/10.1109/IST64061.2024.10843636
Hartpence BJohnson DStackpole BKwasinski A(2024)RIT Network and Security Dataset Collections2024 International Symposium on Networks, Computers and Communications (ISNCC)10.1109/ISNCC62547.2024.10759040(1-6)Online publication date: 22-Oct-2024
https://doi.org/10.1109/ISNCC62547.2024.10759040
Hamza AHammam FAbouzeid MAhmed MDhou SAloul F(2024)Malicious URL and Intrusion Detection using Machine Learning2024 International Conference on Information Networking (ICOIN)10.1109/ICOIN59985.2024.10572207(795-800)Online publication date: 17-Jan-2024
https://doi.org/10.1109/ICOIN59985.2024.10572207
Alajlouni MTashtoush YDossari SDarwish O(2024)Optimizing Phishing URL Detection: A Comparative Analysis of Feature Counts and Training Times2024 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS)10.1109/ICCNS62192.2024.10776210(1-7)Online publication date: 24-Sep-2024
https://doi.org/10.1109/ICCNS62192.2024.10776210
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten