Discovering Malicious URLs Using Machine Learning Techniques

Sun, Bo; Takahashi, Takeshi; Zhu, Lei; Mori, Tatsuya

doi:10.1007/978-3-030-38788-4_3

Discovering Malicious URLs Using Machine Learning Techniques

Bo Sun⁵,
Takeshi Takahashi⁵,
Lei Zhu⁵ &
…
Tatsuya Mori⁶

Chapter
First Online: 06 February 2020

904 Accesses
2 Citations

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 177))

Abstract

Security specialists have been developing and implementing many countermeasures against security threats, which is needed because the number of new security threats is further and further growing. In this chapter, we introduce an approach for identifying hidden security threats by using Uniform Resource Locators (URLs) as an example dataset, with a method that automatically detects malicious URLs by leveraging machine learning techniques. We demonstrate the effectiveness of the method through performance evaluations.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

The high-interaction web client honeypot capture-hpc. https://github.com/honeynet/capture-hpc
The high-interaction web client honeypot pwnypot. https://github.com/shjalayeri/pwnypot
The low-interaction web client honeypot thug. https://github.com/buffer/thug
The low-interaction web client honeypot yalih. https://github.com/Masood-M/yalih
Akiyama M, Iwamura M, Kawakoya Y, Aoki K, Itoh M (2010) Design and implementation of high interaction client honeypot for drive-by-download attacks. IEICE Trans 93-B(5):1131–1139
Article Google Scholar
Akiyama M, Yagi T, Itoh M (2011) Searching structural neighborhood of malicious urls to improve blacklisting. In: 11th annual international symposium on applications and the internet, SAINT 2011, Munich, Germany, 18–21 July 2011, Proceedings, pp 1–10. http://doi.ieeecomputersociety.org/10.1109/SAINT.2011.11
Antonakakis M, Perdisci R, Dagon D, Lee W, Feamster N (2010) Building a dynamic reputation system for DNS. In: 19th USENIX security symposium, Washington, DC, USA, 11–13 August 2010, Proceedings, pp 273–290
Google Scholar
Aoki K, Yagi T, Iwamura M, Itoh M (2011) Controlling malware HTTP communications in dynamic analysis system using search engine. In: Proceedings of the IEEE CSS, pp 1–6
Google Scholar
Barabosch T, Wichmann A, Leder F, and Gerhards-Padilla E (2012) Automatic extraction of domain name generation algorithms from current malware. In: Proceedings of the NATO symposium IST-111 on information assurance and cyber defence (2012)
Google Scholar
Canali D, Cova M, Vigna G, Kruegel C (2011) Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the WWW, pp 197–206
Google Scholar
Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM TIST 2(3):27:1–27:27
Article Google Scholar
Chiba D, Tobe K, Mori T, Goto S (2012) Detecting malicious websites by learning IP address features. In: 12th IEEE/IPSJ international symposium on applications and the internet, SAINT 2012, Izmir, Turkey, 16–20 July 2012, pp 29–39. http://dx.doi.org/10.1109/SAINT.2012.14
Choi H, Zhu BB, Lee H (2011) Detecting malicious web links and identifying their attack types. In: Proceedings of the USENIX WebApps
Google Scholar
Curtsinger C, Livshits B, Zorn BG, Seifert C (2011) ZOZZLE: fast and precise in-browser javascript malware detection. In: 20th USENIX security symposium, San Francisco, CA, USA, 8–12 August 2011, Proceedings
Google Scholar
Eshete B, Villafiorita A, Weldemariam K (2012) Binspect: holistic analysis and detection of malicious web pages. In: Proceedings of the SecureComm, pp 149–166
Google Scholar
Ghahramani Z, Heller KA (2005) Bayesian sets. In: Proceedings of the NIPS
Google Scholar
Internetlivestats (2019) Google search statistics-internet live stats. http://www.internetlivestats.com/google-search-statistics/
Invernizzi L, Comparetti PM (2012) Evilseed: a guided approach to finding malicious web pages. In: Proceedings of the IEEE symposium on security and privacy, pp 428–442
Google Scholar
Kaspersky Lab (2013) Kaspersky security bulletin 2013. https://report.kaspersky.com
Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the KDD, pp 1245–1254
Google Scholar
Mowbray M, Hagen J (2014) Finding domain-generation algorithms by looking at length distribution. In: 25th IEEE international symposium on software reliability engineering workshops, ISSRE Workshops, Naples, Italy, 3–6 November 2014, pp 395–400
Google Scholar
Schiavoni S, Maggi F, Cavallaro L, Zanero S (2014) Phoenix: DGA-based botnet tracking and intelligence. In: 11th International conference on detection of intrusions and malware, and vulnerability assessment, DIMVA 2014, Egham, UK, 10–11 July 2014, Proceedings, pp 192–211
Google Scholar
Spooren J, Preuveneers D, Desmet L, Janssen P, Joosen W (2019) Detection of algorithmically generated domain names used by botnets: a dual arms race. In: Proceedings of the 34th ACM/SIGAPP symposium on applied computing, SAC 2019, Limassol, Cyprus, 8–12 April 2019, pp 1916–1923
Google Scholar
Sun B, Akiyama M, Yagi T, Hatada M, Mori T (2016) Automating URL blacklist generation with similarity search approach. IEICE Trans 99-D(4):873–882
Article Google Scholar
Xu W, Sanders K, Zhang Y (2014) We know it before you do: predicting malicious domains. In: Proceedings of the 24th virus bulletin conference (VB2014)
Google Scholar
Xu L, Zhan Z, Xu S, Ye K (2013) Cross-layer detection of malicious websites. In: Proceedings of the CODASPY, pp 141–152
Google Scholar
Yadav S, Reddy AKK, Reddy ALN, Ranjan S (2010) Detecting algorithmically generated malicious domain names. In: Proceedings of the 10th ACM SIGCOMM Internet measurement conference, IMC 2010, Melbourne, Australia, 1–3 November 2010, pp 48–61
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Information and Communications Technology, Tokyo, Japan
Bo Sun, Takeshi Takahashi & Lei Zhu
Waseda University, Tokyo, Japan
Tatsuya Mori

Authors

Bo Sun
View author publications
You can also search for this author in PubMed Google Scholar
Takeshi Takahashi
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Tatsuya Mori
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Sun .

Editor information

Editors and Affiliations

School of Science, Edith Cowan University, Joondalup, WA, Australia
Leslie F. Sikos
Department of Information Systems and Security, University of Texas at San Antonio, San Antonio, TX, USA
Kim-Kwang Raymond Choo

3.9 Appendix

The HTML content of some malicious URLs are shown in the following, these URLs form patterns for Bayesian Sets querying. Sensitive information like hostnames are hidden for privacy issue. Figures 3.4 and 3.5 are the partial HTML content related to two URLs in query pattern 1. We can clearly observe that obfuscation JavaScript code occur in both cases, this is why we combine these two URLs in one pattern. Figure 3.6 shows the HTML content of URL detected, as we can see this content is considerably similar to the queries above.

On the other hand, Figs. 3.7 and 3.8 give the HTML content of two URLs queried in pattern 2. Here, intrinsic embed and object tags can be found in both cases, which implies they are likely to be the landing pages for the drive-by-download attacks. For one of the detection results obtained from such query pattern, the HTML presented in Fig. 3.9 shows similar characteristic with that in Figs. 3.7 and 3.8.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sun, B., Takahashi, T., Zhu, L., Mori, T. (2020). Discovering Malicious URLs Using Machine Learning Techniques. In: Sikos, L., Choo, KK. (eds) Data Science in Cybersecurity and Cyberthreat Intelligence. Intelligent Systems Reference Library, vol 177. Springer, Cham. https://doi.org/10.1007/978-3-030-38788-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-38788-4_3
Published: 06 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38787-7
Online ISBN: 978-3-030-38788-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Abstract

Buying options

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

3.9 Appendix

3.9 Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation