Abstract
Security specialists have been developing and implementing many countermeasures against security threats, which is needed because the number of new security threats is further and further growing. In this chapter, we introduce an approach for identifying hidden security threats by using Uniform Resource Locators (URLs) as an example dataset, with a method that automatically detects malicious URLs by leveraging machine learning techniques. We demonstrate the effectiveness of the method through performance evaluations.
This chapter is based on reference Sun et al. (2016), which appeared in the IEICE Transactions on Information and Systems, Copyright(C)2016 IEICE.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
References
The high-interaction web client honeypot capture-hpc. https://github.com/honeynet/capture-hpc
The high-interaction web client honeypot pwnypot. https://github.com/shjalayeri/pwnypot
The low-interaction web client honeypot thug. https://github.com/buffer/thug
The low-interaction web client honeypot yalih. https://github.com/Masood-M/yalih
Akiyama M, Iwamura M, Kawakoya Y, Aoki K, Itoh M (2010) Design and implementation of high interaction client honeypot for drive-by-download attacks. IEICE Trans 93-B(5):1131–1139
Akiyama M, Yagi T, Itoh M (2011) Searching structural neighborhood of malicious urls to improve blacklisting. In: 11th annual international symposium on applications and the internet, SAINT 2011, Munich, Germany, 18–21 July 2011, Proceedings, pp 1–10. http://doi.ieeecomputersociety.org/10.1109/SAINT.2011.11
Antonakakis M, Perdisci R, Dagon D, Lee W, Feamster N (2010) Building a dynamic reputation system for DNS. In: 19th USENIX security symposium, Washington, DC, USA, 11–13 August 2010, Proceedings, pp 273–290
Aoki K, Yagi T, Iwamura M, Itoh M (2011) Controlling malware HTTP communications in dynamic analysis system using search engine. In: Proceedings of the IEEE CSS, pp 1–6
Barabosch T, Wichmann A, Leder F, and Gerhards-Padilla E (2012) Automatic extraction of domain name generation algorithms from current malware. In: Proceedings of the NATO symposium IST-111 on information assurance and cyber defence (2012)
Canali D, Cova M, Vigna G, Kruegel C (2011) Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the WWW, pp 197–206
Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM TIST 2(3):27:1–27:27
Chiba D, Tobe K, Mori T, Goto S (2012) Detecting malicious websites by learning IP address features. In: 12th IEEE/IPSJ international symposium on applications and the internet, SAINT 2012, Izmir, Turkey, 16–20 July 2012, pp 29–39. http://dx.doi.org/10.1109/SAINT.2012.14
Choi H, Zhu BB, Lee H (2011) Detecting malicious web links and identifying their attack types. In: Proceedings of the USENIX WebApps
Curtsinger C, Livshits B, Zorn BG, Seifert C (2011) ZOZZLE: fast and precise in-browser javascript malware detection. In: 20th USENIX security symposium, San Francisco, CA, USA, 8–12 August 2011, Proceedings
Eshete B, Villafiorita A, Weldemariam K (2012) Binspect: holistic analysis and detection of malicious web pages. In: Proceedings of the SecureComm, pp 149–166
Ghahramani Z, Heller KA (2005) Bayesian sets. In: Proceedings of the NIPS
Internetlivestats (2019) Google search statistics-internet live stats. http://www.internetlivestats.com/google-search-statistics/
Invernizzi L, Comparetti PM (2012) Evilseed: a guided approach to finding malicious web pages. In: Proceedings of the IEEE symposium on security and privacy, pp 428–442
Kaspersky Lab (2013) Kaspersky security bulletin 2013. https://report.kaspersky.com
Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the KDD, pp 1245–1254
Mowbray M, Hagen J (2014) Finding domain-generation algorithms by looking at length distribution. In: 25th IEEE international symposium on software reliability engineering workshops, ISSRE Workshops, Naples, Italy, 3–6 November 2014, pp 395–400
Schiavoni S, Maggi F, Cavallaro L, Zanero S (2014) Phoenix: DGA-based botnet tracking and intelligence. In: 11th International conference on detection of intrusions and malware, and vulnerability assessment, DIMVA 2014, Egham, UK, 10–11 July 2014, Proceedings, pp 192–211
Spooren J, Preuveneers D, Desmet L, Janssen P, Joosen W (2019) Detection of algorithmically generated domain names used by botnets: a dual arms race. In: Proceedings of the 34th ACM/SIGAPP symposium on applied computing, SAC 2019, Limassol, Cyprus, 8–12 April 2019, pp 1916–1923
Sun B, Akiyama M, Yagi T, Hatada M, Mori T (2016) Automating URL blacklist generation with similarity search approach. IEICE Trans 99-D(4):873–882
Xu W, Sanders K, Zhang Y (2014) We know it before you do: predicting malicious domains. In: Proceedings of the 24th virus bulletin conference (VB2014)
Xu L, Zhan Z, Xu S, Ye K (2013) Cross-layer detection of malicious websites. In: Proceedings of the CODASPY, pp 141–152
Yadav S, Reddy AKK, Reddy ALN, Ranjan S (2010) Detecting algorithmically generated malicious domain names. In: Proceedings of the 10th ACM SIGCOMM Internet measurement conference, IMC 2010, Melbourne, Australia, 1–3 November 2010, pp 48–61
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
3.9 Appendix
3.9 Appendix
The HTML content of some malicious URLs are shown in the following, these URLs form patterns for Bayesian Sets querying. Sensitive information like hostnames are hidden for privacy issue. Figures 3.4 and 3.5 are the partial HTML content related to two URLs in query pattern 1. We can clearly observe that obfuscation JavaScript code occur in both cases, this is why we combine these two URLs in one pattern. Figure 3.6 shows the HTML content of URL detected, as we can see this content is considerably similar to the queries above.
On the other hand, Figs. 3.7 and 3.8 give the HTML content of two URLs queried in pattern 2. Here, intrinsic embed and object tags can be found in both cases, which implies they are likely to be the landing pages for the drive-by-download attacks. For one of the detection results obtained from such query pattern, the HTML presented in Fig. 3.9 shows similar characteristic with that in Figs. 3.7 and 3.8.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sun, B., Takahashi, T., Zhu, L., Mori, T. (2020). Discovering Malicious URLs Using Machine Learning Techniques. In: Sikos, L., Choo, KK. (eds) Data Science in Cybersecurity and Cyberthreat Intelligence. Intelligent Systems Reference Library, vol 177. Springer, Cham. https://doi.org/10.1007/978-3-030-38788-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-38788-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38787-7
Online ISBN: 978-3-030-38788-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)