Skip to main content

Discovering Malicious URLs Using Machine Learning Techniques

  • Chapter
  • First Online:

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 177))

Abstract

Security specialists have been developing and implementing many countermeasures against security threats, which is needed because the number of new security threats is further and further growing. In this chapter, we introduce an approach for identifying hidden security threats by using Uniform Resource Locators (URLs) as an example dataset, with a method that automatically detects malicious URLs by leveraging machine learning techniques. We demonstrate the effectiveness of the method through performance evaluations.

This chapter is based on reference Sun et al. (2016), which appeared in the IEICE Transactions on Information and Systems, Copyright(C)2016 IEICE.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://safebrowsing.google.com/

  2. 2.

    http://nutch.apache.org

  3. 3.

    https://webarchive.jira.com/wiki/display/Heritrix/Heritrix

  4. 4.

    https://scrapy.org/

  5. 5.

    https://www.malwaredomainlist.com/mdl.php

  6. 6.

    https://www.phishtank.com

  7. 7.

    https://www.virustotal.com

  8. 8.

    http://www.alexa.com

  9. 9.

    https://www.dnsdb.info

  10. 10.

    https://www.riskiq.com/products/passivetotal/

  11. 11.

    https://www.circl.lu/services/passive-dns/

  12. 12.

    https://www.google.com

  13. 13.

    https://www.bing.com

  14. 14.

    https://yandex.com

  15. 15.

    https://github.com/NikolaiT/GoogleScraper

  16. 16.

    http://scikit-learn.org

  17. 17.

    http://www.hping.org/hping3.html

  18. 18.

    https://zmap.io/

  19. 19.

    https://www.unbound.net

  20. 20.

    https://nutch.apache.org

  21. 21.

    https://www.mysql.com

References

  • The high-interaction web client honeypot capture-hpc. https://github.com/honeynet/capture-hpc

  • The high-interaction web client honeypot pwnypot. https://github.com/shjalayeri/pwnypot

  • The low-interaction web client honeypot thug. https://github.com/buffer/thug

  • The low-interaction web client honeypot yalih. https://github.com/Masood-M/yalih

  • Akiyama M, Iwamura M, Kawakoya Y, Aoki K, Itoh M (2010) Design and implementation of high interaction client honeypot for drive-by-download attacks. IEICE Trans 93-B(5):1131–1139

    Article  Google Scholar 

  • Akiyama M, Yagi T, Itoh M (2011) Searching structural neighborhood of malicious urls to improve blacklisting. In: 11th annual international symposium on applications and the internet, SAINT 2011, Munich, Germany, 18–21 July 2011, Proceedings, pp 1–10. http://doi.ieeecomputersociety.org/10.1109/SAINT.2011.11

  • Antonakakis M, Perdisci R, Dagon D, Lee W, Feamster N (2010) Building a dynamic reputation system for DNS. In: 19th USENIX security symposium, Washington, DC, USA, 11–13 August 2010, Proceedings, pp 273–290

    Google Scholar 

  • Aoki K, Yagi T, Iwamura M, Itoh M (2011) Controlling malware HTTP communications in dynamic analysis system using search engine. In: Proceedings of the IEEE CSS, pp 1–6

    Google Scholar 

  • Barabosch T, Wichmann A, Leder F, and Gerhards-Padilla E (2012) Automatic extraction of domain name generation algorithms from current malware. In: Proceedings of the NATO symposium IST-111 on information assurance and cyber defence (2012)

    Google Scholar 

  • Canali D, Cova M, Vigna G, Kruegel C (2011) Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the WWW, pp 197–206

    Google Scholar 

  • Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM TIST 2(3):27:1–27:27

    Article  Google Scholar 

  • Chiba D, Tobe K, Mori T, Goto S (2012) Detecting malicious websites by learning IP address features. In: 12th IEEE/IPSJ international symposium on applications and the internet, SAINT 2012, Izmir, Turkey, 16–20 July 2012, pp 29–39. http://dx.doi.org/10.1109/SAINT.2012.14

  • Choi H, Zhu BB, Lee H (2011) Detecting malicious web links and identifying their attack types. In: Proceedings of the USENIX WebApps

    Google Scholar 

  • Curtsinger C, Livshits B, Zorn BG, Seifert C (2011) ZOZZLE: fast and precise in-browser javascript malware detection. In: 20th USENIX security symposium, San Francisco, CA, USA, 8–12 August 2011, Proceedings

    Google Scholar 

  • Eshete B, Villafiorita A, Weldemariam K (2012) Binspect: holistic analysis and detection of malicious web pages. In: Proceedings of the SecureComm, pp 149–166

    Google Scholar 

  • Ghahramani Z, Heller KA (2005) Bayesian sets. In: Proceedings of the NIPS

    Google Scholar 

  • Internetlivestats (2019) Google search statistics-internet live stats. http://www.internetlivestats.com/google-search-statistics/

  • Invernizzi L, Comparetti PM (2012) Evilseed: a guided approach to finding malicious web pages. In: Proceedings of the IEEE symposium on security and privacy, pp 428–442

    Google Scholar 

  • Kaspersky Lab (2013) Kaspersky security bulletin 2013. https://report.kaspersky.com

  • Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the KDD, pp 1245–1254

    Google Scholar 

  • Mowbray M, Hagen J (2014) Finding domain-generation algorithms by looking at length distribution. In: 25th IEEE international symposium on software reliability engineering workshops, ISSRE Workshops, Naples, Italy, 3–6 November 2014, pp 395–400

    Google Scholar 

  • Schiavoni S, Maggi F, Cavallaro L, Zanero S (2014) Phoenix: DGA-based botnet tracking and intelligence. In: 11th International conference on detection of intrusions and malware, and vulnerability assessment, DIMVA 2014, Egham, UK, 10–11 July 2014, Proceedings, pp 192–211

    Google Scholar 

  • Spooren J, Preuveneers D, Desmet L, Janssen P, Joosen W (2019) Detection of algorithmically generated domain names used by botnets: a dual arms race. In: Proceedings of the 34th ACM/SIGAPP symposium on applied computing, SAC 2019, Limassol, Cyprus, 8–12 April 2019, pp 1916–1923

    Google Scholar 

  • Sun B, Akiyama M, Yagi T, Hatada M, Mori T (2016) Automating URL blacklist generation with similarity search approach. IEICE Trans 99-D(4):873–882

    Article  Google Scholar 

  • Xu W, Sanders K, Zhang Y (2014) We know it before you do: predicting malicious domains. In: Proceedings of the 24th virus bulletin conference (VB2014)

    Google Scholar 

  • Xu L, Zhan Z, Xu S, Ye K (2013) Cross-layer detection of malicious websites. In: Proceedings of the CODASPY, pp 141–152

    Google Scholar 

  • Yadav S, Reddy AKK, Reddy ALN, Ranjan S (2010) Detecting algorithmically generated malicious domain names. In: Proceedings of the 10th ACM SIGCOMM Internet measurement conference, IMC 2010, Melbourne, Australia, 1–3 November 2010, pp 48–61

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Sun .

Editor information

Editors and Affiliations

3.9 Appendix

3.9 Appendix

The HTML content of some malicious URLs are shown in the following, these URLs form patterns for Bayesian Sets querying. Sensitive information like hostnames are hidden for privacy issue. Figures 3.4 and 3.5 are the partial HTML content related to two URLs in query pattern 1. We can clearly observe that obfuscation JavaScript code occur in both cases, this is why we combine these two URLs in one pattern. Figure 3.6 shows the HTML content of URL detected, as we can see this content is considerably similar to the queries above.

On the other hand, Figs. 3.7 and 3.8 give the HTML content of two URLs queried in pattern 2. Here, intrinsic embed and object tags can be found in both cases, which implies they are likely to be the landing pages for the drive-by-download attacks. For one of the detection results obtained from such query pattern, the HTML presented in Fig. 3.9 shows similar characteristic with that in Figs. 3.7 and 3.8.

Fig. 3.4
figure 4

HTML content of query URL 1 (pattern 1)

Fig. 3.5
figure 5

HTML content of query URL 2 (pattern 1)

Fig. 3.6
figure 6

HTML content of detected URL (pattern 1)

Fig. 3.7
figure 7

HTML content of query URL 1 (pattern 2)

Fig. 3.8
figure 8

HTML content of query URL 2 (pattern 2)

Fig. 3.9
figure 9

HTML content of detected URL (pattern 2)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sun, B., Takahashi, T., Zhu, L., Mori, T. (2020). Discovering Malicious URLs Using Machine Learning Techniques. In: Sikos, L., Choo, KK. (eds) Data Science in Cybersecurity and Cyberthreat Intelligence. Intelligent Systems Reference Library, vol 177. Springer, Cham. https://doi.org/10.1007/978-3-030-38788-4_3

Download citation

Publish with us

Policies and ethics