Abstract
Many cyberattacks start with phishing to lure victims into malicious web pages where malware codes are hidden. Victim machines are infected by malware and the attacker can intrude the enterprise network, evading firewalls. Therefore, it is of fundamental importance to detect phishing URLs and prevent employees from visiting them. Many machine learning methods were proposed so far. In this work, we collect many lexical features after literature survey and combine them with blacklisted domains to improve the detection performance. We collect many recent phishing URLs because most of open datasets are outdated. Our method shows the F-1 of 0.84.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Jiwon Hong, Taeri Kim, and Jing Liu are listed in alphabetical order and equally contributed. Noseong Park and Sang-Wook Kim are the co-corresponding authors.
References
Ahmad F (2017) https://github.com/faizann24/using-machine-learning-to-detect-malicious-urls
Anand A, Gorde K, Moniz JRA, Park N, Chakraborty T, Chu BT (2018) Phishing url detection with oversampling based on text generative adversarial networks. In: 2018 IEEE International Conference on Big Data (Big Data), IEEE, pp 1168–1177
Anderson DS, Fleizach C, Savage S, Voelker GM (2007) Spamscatter: Characterizing internet scam hosting infrastructure. In: USENIX Security Symposium
Anti-Phishing Working Group (2018) APWG Phishing Attack Trends Reports. https://www.antiphishing.org/resources/apwg-reports
Bahnsen AC, Bohorquez EC, Villegas S, Vargas J, González FA (2017) Classifying phishing urls using recurrent neural networks. In: 2017 APWG Symposium on Electronic Crime Research (eCrime), IEEE, pp 1–8
Bowyer KW, Chawla NV, Hall LO, Kegelmeyer WP (2011) SMOTE: synthetic minority over-sampling technique. CoRR abs/1106.1813, http://arxiv.org/abs/1106.1813
Canali D, Cova M, Vigna G, Kruegel C (2011) Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the 20th international conference on World wide web, ACM, pp 197–206
Chen TC, Dick S, Miller J (2010) Detecting visually similar web pages: Application to phishing detection. ACM Transactions on Internet Technology (TOIT) 10(2):5
Choi Y, Kim T, Choi S, Lee C (2009) Automatic detection for javascript obfuscation attacks in web pages through string pattern analysis. In: Proceedings of the 1st International Conference on Future Generation Information Technology, Springer-Verlag, Berlin, Heidelberg, FGIT ’09, pp 160–172
of Economic Advisers TC (2018) https://www.whitehouse.gov/wp-content/uploads/2018/03/the-cost-of-malicious-cyber-activity-to-the-u.s.-economy.pdf
Eshete B, Villafiorita A, Weldemariam K (2012) Binspect: Holistic analysis and detection of malicious web pages. In: International Conference on Security and Privacy in Communication Systems, Springer, pp 149–166
Eshete B, Villafiorita A, Weldemariam K, Zulkernine M (2013) Einspect: Evolution-guided analysis and detection of malicious web pages. In: 2013 IEEE 37th Annual Computer Software and Applications Conference, IEEE, pp 375–380
Felegyhazi M, Kreibich C, Paxson V (2010) On the potential of proactive domain blacklisting. In: Proceedings of the 3rd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, USENIX Association, Berkeley, CA, USA, LEET’10, pp 6–6, http://dl.acm.org/citation.cfm?id=1855686.1855692
Fu AY, Wenyin L, Deng X (2006) Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (emd). IEEE transactions on dependable and secure computing 3(4):301–311
Garera S, Provos N, Chew M, Rubin AD (2007) A framework for detection and measurement of phishing attacks. In: Proceedings of the 2007 ACM workshop on Recurring malcode, ACM, pp 1–8
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, pp 1322–1328
Hou YT, Chang Y, Chen T, Laih CS, Chen CM (2010) Malicious web content detection by machine learning. Expert Systems with Applications 37(1):55–60
Kilby M (2017) https://github.com/incertum/cyber-matrix-ai/tree/master/malicious-url-detection-deep-learning
Le H, Pham Q, Sahoo D, Hoi SC (2018) Urlnet: Learning a url representation with deep learning for malicious url detection. arXiv preprint arXiv:180203162
Ludl C, Mcallister S, Kirda E, Kruegel C (2007) On the effectiveness of techniques to detect phishing sites. In: Proceedings of the 4th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, Springer-Verlag, Berlin, Heidelberg, DIMVA ’07, pp 20–39
Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: learning to detect malicious web sites from suspicious urls. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 1245–1254
Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: Learning to detect malicious web sites from suspicious urls. In: KDD, pp 1245–1254
Ma J, Saul LK, Savage S, Voelker GM (2009) Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th annual international conference on machine learning, ACM, pp 681–688
Mao J, Tian W, Li P, Wei T, Liang Z (2017) Phishing-alarm: Robust and efficient phishing detection via page component similarity. IEEE Access 5:17020–17030, DOI 10.1109/ACCESS.2017.2743528
Marchal S, François J, State R, Engel T (2014) Phishscore: Hacking phishers’ minds. In: 10th International Conference on Network and Service Management (CNSM) and Workshop, IEEE, pp 46–54
McGrath DK, Gupta M (2008) Behind phishing: An examination of phisher modi operandi. LEET 8:4
Medvet E, Kirda E, Kruegel C (2008) Visual-similarity-based phishing detection. In: Proceedings of the 4th international conference on Security and privacy in communication netowrks, ACM, p 22
Mohammad RM, Thabtah FA, McCluskey L (2012) An assessment of features related to phishing websites using an automated technique. In: 7th International Conference for Internet Technology and Secured Transactions, pp 492–497
Mohammad RM, Thabtah F, McCluskey L (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications 25(2), DOI 10.1007/s00521-013-1490-z, https://doi.org/10.1007/s00521-013-1490-z
OpenDNS (2019) Phishtank - out of the net, into the tank, https://www.phishtank.com/
Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford InfoLab
Prakash P, Kumar M, Kompella RR, Gupta M (2010) Phishnet: Predictive blacklisting to detect phishing attacks. In: Proceedings of the 29th Conference on Information Communications, IEEE Press, Piscataway, NJ, USA, INFOCOM’10, pp 346–350, http://dl.acm.org/citation.cfm?id=1833515.1833585
Ramanathan V, Wechsler H (2012) Phishing website detection using latent dirichlet allocation and adaboost. In: 2012 IEEE International Conference on Intelligence and Security Informatics, IEEE, pp 102–107
Sheng S, Wardman B, Warner G, Cranor L, Hong J, Zhang C (2009) An empirical analysis of phishing blacklists
Sinha S, Bailey M, Jahanian F (2008) Shades of grey: On the effectiveness of reputation-based “blacklists”. In: 2008 3rd International Conference on Malicious and Unwanted Software (MALWARE), IEEE, pp 57–64
Sorio E, Bartoli A, Medvet E (2013) Detection of hidden fraudulent urls within trusted sites using lexical features. 2013 International Conference on Availability, Reliability and Security pp 242–247
Sun B, Akiyama M, Yagi T, Hatada M, Mori T (2016) Automating url blacklist generation with similarity search approach. IEICE TRANSACTIONS on Information and Systems 99(4):873–882
Teraguchi NCRLY, Mitchell JC (2004) Client-side defense against web-based identity theft. Computer Science Department, Stanford University Available: http://cryptostanfordedu/SpoofGuard/webspoofpdf
Verma R, Dyer K (2015) On the character of phishing urls: Accurate and robust statistical learning classifiers. In: Proceedings of the 5th ACM Conference on Data and Application Security and Privacy, DOI 10.1145/2699026.2699115, http://doi.acm.org/10.1145/2699026.2699115
Wenyin L, Huang G, Xiaoyue L, Min Z, Deng X (2005) Detection of phishing webpages based on visual similarity. In: Special interest tracks and posters of the 14th international conference on World Wide Web, ACM, pp 1060–1061
Whittaker C, Ryner B, Nazif M (2010) Large-scale automatic classification of phishing pages. In: NDSS ’10, http://www.isoc.org/isoc/conferences/ndss/10/pdf/08.pdf
Xiang G, Hong J, Rose CP, Cranor L (2011) Cantina+ : A feature-rich machine learning framework for detecting phishing web sites. ACM Transactions on Information and System Security (TISSEC) 14(2):21
Xu L, Zhan Z, Xu S, Ye K (2013) Cross-layer detection of malicious websites. In: Proceedings of the third ACM conference on Data and application security and privacy, ACM, pp 141–152
Zhang Y, Hong JI, Cranor LF (2007) Cantina: a content-based approach to detecting phishing web sites. In: Proceedings of the 16th international conference on World Wide Web, ACM, pp 639–648
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Hong, J., Kim, T., Liu, J., Park, N., Kim, SW. (2020). Phishing URL Detection with Lexical Features and Blacklisted Domains. In: Jajodia, S., Cybenko, G., Subrahmanian, V., Swarup, V., Wang, C., Wellman, M. (eds) Adaptive Autonomous Secure Cyber Systems. Springer, Cham. https://doi.org/10.1007/978-3-030-33432-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-33432-1_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33431-4
Online ISBN: 978-3-030-33432-1
eBook Packages: Computer ScienceComputer Science (R0)