skip to main content
10.1145/1553374.1553462acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
research-article

Identifying suspicious URLs: an application of large-scale online learning

Published: 14 June 2009 Publication History

Abstract

This paper explores online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. We show that this application is particularly appropriate for online algorithms as the size of the training data is larger than can be efficiently processed in batch and because the distribution of features that typify malicious URLs is changing continuously. Using a real-time system we developed for gathering URL features, combined with a real-time source of labeled URLs from a large Web mail provider, we demonstrate that recently-developed online algorithms can be as accurate as batch techniques, achieving classification accuracies up to 99% over a balanced data set.

References

[1]
Bergholz, A., Chang, J.-H., Paaß, G., Reichartz, F., & Strobel, S. (2008). Improved Phishing Detection using Model-Based Features. Proceedings of the Conference on Email and Anti-Spam (CEAS). Mountain View, CA.
[2]
Bottou, L. (1998). Online Learning and Stochastic Approximations. In Online Learning and Neural Networks, 9--42. Cambridge, UK: Cambridge University Press.
[3]
Bottou, L., & LeCun, Y. (2004). Large Scale Online Learning. In S. Thrun, L. K. Saul and B. Schöölkopf (Eds.), Advances in Neural Information Processing Systems 16, 217--224. Cambridge, MA: MIT Press.
[4]
Chou, N., Ledesma, R., Teraguchi, Y., Boneh, D., & Mitchell, J. C. (2004). Client-Side Defense against Web-Based Identity Theft. Network and Distributed System Security (NDSS). San Diego, CA.
[5]
Crammer, K., Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2006). Online Passive-Aggressive Algorithms. Journal of Machine Learning Research, 7, 551--585.
[6]
Crammer, K., Dredze, M., & Pereira, F. (2009). Exact Convex Confidence-Weighted Learning. Advances in Neural Information Processing Systems 21 (pp. 345--352).
[7]
Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2008). The Forgetron: A Kernel-Based Perceptron on a Budget. SIAM Journal on Computing, 37, 1342--1372.
[8]
Dredze, M., Crammer, K., & Pereira, F. (2008). Confidence-Weighted Linear Classification. Proceedings of the International Conference on Marchine Learning (ICML) (pp. 264--271). Helsinki, Finland: Omni-press.
[9]
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). LIBLINEAR: A Library for Large Linear Classification. http://www.csie.ntu.edu.tw/cjlin/liblinear/.
[10]
Fette, I., Sadeh, N., & Tomasic, A. (2007). Learning to Detect Phishing Emails. Proceedings of the International World Wide Web Conference (WWW) (pp. 649--656). Banff, Alberta, Canada.
[11]
Garera, S., Provos, N., Chew, M., & Rubin, A. D. (2007). A Framework for Detection and Measurement of Phishing Attacks. Proceedings of the ACM Workshop on Rapid Malcode (WORM) (pp. 1--8). Alexandria, VA.
[12]
Ma, J., Saul, L. K., Savage, S., & Voelker, G. M. (2009). Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs. Proceedings of the SIGKDD Conference. Paris, France.
[13]
McGrath, D. K., & Gupta, M. (2008). Behind Phishing: An Examination of Phisher Modi Operandi. Proceedings of the USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET). San Francisco, CA.
[14]
Moshchuk, A., Bragin, T., Gribble, S. D., & Levy, H. M. (2006). A Crawler-Based Study of Spyware on the Web. Network and Distributed System Security (NDSS). San Diego, CA.
[15]
Orabona, F., Keshet, J., & Caputo, B. (2008). The Projectron: A Bounded Kernel-Based Perceptron. Proceedings of the International Conference on Machine Learning (ICML) (pp. 720--727). Helsinki, Finland: Omnipress.
[16]
Provos, N., Mavrommatis, P., Rajab, M. A., & Monrose, F. (2008). All Your iFRAMEs Point to Us. Proceedings of the USENIX Security Symposium (pp. 1--15). San Jose, CA.
[17]
Provos, N., McNamee, D., Mavrommatis, P., Wang, K., & Modadugu, N. (2007). The Ghost in the Browser Analysis of Web-based Malware. Proceedings of the USENIX Workshop on Hot Topics in Understanding Botnets (Hot-Bots). Cambridge, MA.
[18]
Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65, 386--408.
[19]
Rudd, J. (2007). Botnet plugin for SpamAssas-sin. http://people.ucsc.edu/~jrudd/spamassassin/.
[20]
Sinha, S., Bailey, M., & Jahanian, F. (2008). Shades of Grey: On the Effectiveness of Reputation-Based Blacklists. Proceedings of the International Conference on Malicious and Unwanted Software (Malware) (pp. 57--64). Alexandria, VA.
[21]
Sonnenburg, S., Franc, V., Yom-Tov, E., & Sebag, M. (2008). PASCAL Large Scale Learning Challenge. http://largescale.first.fraunhofer.de/workshop/.
[22]
Wang, Y.-M., Beck, D., Jiang, X., Roussev, R., Verbowski, C., Chen, S., & King, S. (2006). Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. Network and Distributed System Security (NDSS). San Diego, CA.

Cited By

View all
  • (2025)Stochastic feature selection with annealing and its applications to streaming dataJournal of Nonparametric Statistics10.1080/10485252.2025.2456767(1-18)Online publication date: 30-Jan-2025
  • (2024)Data correlation matrix-based spam URL detection using machine learning algorithmsJournal of Scientific Reports-A10.59313/jsr-a.1422913(56-69)Online publication date: 31-Mar-2024
  • (2024)Enhanced MNB Method for SPAM E-mail/SMS Text Detection Using TF-IDF VectorizerAmerican Journal of Mathematical and Computer Modelling10.11648/j.ajmcm.20240901.119:1(1-8)Online publication date: 28-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning
June 2009
1331 pages
ISBN:9781605585161
DOI:10.1145/1553374

Sponsors

  • NSF
  • Microsoft Research: Microsoft Research
  • MITACS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

ICML '09
Sponsor:
  • Microsoft Research

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)69
  • Downloads (Last 6 weeks)5
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Stochastic feature selection with annealing and its applications to streaming dataJournal of Nonparametric Statistics10.1080/10485252.2025.2456767(1-18)Online publication date: 30-Jan-2025
  • (2024)Data correlation matrix-based spam URL detection using machine learning algorithmsJournal of Scientific Reports-A10.59313/jsr-a.1422913(56-69)Online publication date: 31-Mar-2024
  • (2024)Enhanced MNB Method for SPAM E-mail/SMS Text Detection Using TF-IDF VectorizerAmerican Journal of Mathematical and Computer Modelling10.11648/j.ajmcm.20240901.119:1(1-8)Online publication date: 28-Apr-2024
  • (2024)Dynamical Targeted Ensemble Learning for Streaming Data With Concept DriftIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.346040436:12(8023-8036)Online publication date: Dec-2024
  • (2024)Active labeling for online ensemble learning2024 IEEE 13rd Sensor Array and Multichannel Signal Processing Workshop (SAM)10.1109/SAM60225.2024.10636477(1-5)Online publication date: 8-Jul-2024
  • (2024)Enhanced KNN Method for Malicious URL Detection using GCL (Google Index, Counting the number of characters and Length of URL) Extraction Technique2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00087(496-501)Online publication date: 16-Dec-2024
  • (2024)Classifying Benign and Malicious Websites through URL Feature Extraction Using Transfer Learning2024 11th International Symposium on Telecommunications (IST)10.1109/IST64061.2024.10843636(340-344)Online publication date: 9-Oct-2024
  • (2024)RIT Network and Security Dataset Collections2024 International Symposium on Networks, Computers and Communications (ISNCC)10.1109/ISNCC62547.2024.10759040(1-6)Online publication date: 22-Oct-2024
  • (2024)Malicious URL and Intrusion Detection using Machine Learning2024 International Conference on Information Networking (ICOIN)10.1109/ICOIN59985.2024.10572207(795-800)Online publication date: 17-Jan-2024
  • (2024)Optimizing Phishing URL Detection: A Comparative Analysis of Feature Counts and Training Times2024 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS)10.1109/ICCNS62192.2024.10776210(1-7)Online publication date: 24-Sep-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media