Phishing URL Detection with Lexical Features and Blacklisted Domains

Hong, Jiwon; Kim, Taeri; Liu, Jing; Park, Noseong; Kim, Sang-Wook

doi:10.1007/978-3-030-33432-1_12

Jiwon Hong⁷,
Taeri Kim⁷,
Jing Liu⁸,
Noseong Park⁸ &
…
Sang-Wook Kim⁷

1906 Accesses
1 Altmetric

Abstract

Many cyberattacks start with phishing to lure victims into malicious web pages where malware codes are hidden. Victim machines are infected by malware and the attacker can intrude the enterprise network, evading firewalls. Therefore, it is of fundamental importance to detect phishing URLs and prevent employees from visiting them. Many machine learning methods were proposed so far. In this work, we collect many lexical features after literature survey and combine them with blacklisted domains to improve the detection performance. We collect many recent phishing URLs because most of open datasets are outdated. Our method shows the F-1 of 0.84.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Malicious URL Detection Using Transformers’ NLP Models and Machine Learning

Everything Is in the Name – A URL Based Approach for Phishing Detection

Machine Learning-Based Phishing Detection Using URL Features: A Comprehensive Review

Notes

1.
Jiwon Hong, Taeri Kim, and Jing Liu are listed in alphabetical order and equally contributed. Noseong Park and Sang-Wook Kim are the co-corresponding authors.

References

Ahmad F (2017) https://github.com/faizann24/using-machine-learning-to-detect-malicious-urls
Anand A, Gorde K, Moniz JRA, Park N, Chakraborty T, Chu BT (2018) Phishing url detection with oversampling based on text generative adversarial networks. In: 2018 IEEE International Conference on Big Data (Big Data), IEEE, pp 1168–1177
Google Scholar
Anderson DS, Fleizach C, Savage S, Voelker GM (2007) Spamscatter: Characterizing internet scam hosting infrastructure. In: USENIX Security Symposium
Google Scholar
Anti-Phishing Working Group (2018) APWG Phishing Attack Trends Reports. https://www.antiphishing.org/resources/apwg-reports
Bahnsen AC, Bohorquez EC, Villegas S, Vargas J, González FA (2017) Classifying phishing urls using recurrent neural networks. In: 2017 APWG Symposium on Electronic Crime Research (eCrime), IEEE, pp 1–8
Google Scholar
Bowyer KW, Chawla NV, Hall LO, Kegelmeyer WP (2011) SMOTE: synthetic minority over-sampling technique. CoRR abs/1106.1813, http://arxiv.org/abs/1106.1813
Canali D, Cova M, Vigna G, Kruegel C (2011) Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the 20th international conference on World wide web, ACM, pp 197–206
Google Scholar
Chen TC, Dick S, Miller J (2010) Detecting visually similar web pages: Application to phishing detection. ACM Transactions on Internet Technology (TOIT) 10(2):5
Article Google Scholar
Choi Y, Kim T, Choi S, Lee C (2009) Automatic detection for javascript obfuscation attacks in web pages through string pattern analysis. In: Proceedings of the 1st International Conference on Future Generation Information Technology, Springer-Verlag, Berlin, Heidelberg, FGIT ’09, pp 160–172
Chapter Google Scholar
of Economic Advisers TC (2018) https://www.whitehouse.gov/wp-content/uploads/2018/03/the-cost-of-malicious-cyber-activity-to-the-u.s.-economy.pdf
Eshete B, Villafiorita A, Weldemariam K (2012) Binspect: Holistic analysis and detection of malicious web pages. In: International Conference on Security and Privacy in Communication Systems, Springer, pp 149–166
Google Scholar
Eshete B, Villafiorita A, Weldemariam K, Zulkernine M (2013) Einspect: Evolution-guided analysis and detection of malicious web pages. In: 2013 IEEE 37th Annual Computer Software and Applications Conference, IEEE, pp 375–380
Google Scholar
Felegyhazi M, Kreibich C, Paxson V (2010) On the potential of proactive domain blacklisting. In: Proceedings of the 3rd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, USENIX Association, Berkeley, CA, USA, LEET’10, pp 6–6, http://dl.acm.org/citation.cfm?id=1855686.1855692
Fu AY, Wenyin L, Deng X (2006) Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (emd). IEEE transactions on dependable and secure computing 3(4):301–311
Article Google Scholar
Garera S, Provos N, Chew M, Rubin AD (2007) A framework for detection and measurement of phishing attacks. In: Proceedings of the 2007 ACM workshop on Recurring malcode, ACM, pp 1–8
Google Scholar
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, pp 1322–1328
Google Scholar
Hou YT, Chang Y, Chen T, Laih CS, Chen CM (2010) Malicious web content detection by machine learning. Expert Systems with Applications 37(1):55–60
Article Google Scholar
Kilby M (2017) https://github.com/incertum/cyber-matrix-ai/tree/master/malicious-url-detection-deep-learning
Le H, Pham Q, Sahoo D, Hoi SC (2018) Urlnet: Learning a url representation with deep learning for malicious url detection. arXiv preprint arXiv:180203162
Google Scholar
Ludl C, Mcallister S, Kirda E, Kruegel C (2007) On the effectiveness of techniques to detect phishing sites. In: Proceedings of the 4th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, Springer-Verlag, Berlin, Heidelberg, DIMVA ’07, pp 20–39
Chapter Google Scholar
Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: learning to detect malicious web sites from suspicious urls. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 1245–1254
Google Scholar
Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: Learning to detect malicious web sites from suspicious urls. In: KDD, pp 1245–1254
Google Scholar
Ma J, Saul LK, Savage S, Voelker GM (2009) Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th annual international conference on machine learning, ACM, pp 681–688
Google Scholar
Mao J, Tian W, Li P, Wei T, Liang Z (2017) Phishing-alarm: Robust and efficient phishing detection via page component similarity. IEEE Access 5:17020–17030, DOI 10.1109/ACCESS.2017.2743528
Article Google Scholar
Marchal S, François J, State R, Engel T (2014) Phishscore: Hacking phishers’ minds. In: 10th International Conference on Network and Service Management (CNSM) and Workshop, IEEE, pp 46–54
Google Scholar
McGrath DK, Gupta M (2008) Behind phishing: An examination of phisher modi operandi. LEET 8:4
Google Scholar
Medvet E, Kirda E, Kruegel C (2008) Visual-similarity-based phishing detection. In: Proceedings of the 4th international conference on Security and privacy in communication netowrks, ACM, p 22
Google Scholar
Mohammad RM, Thabtah FA, McCluskey L (2012) An assessment of features related to phishing websites using an automated technique. In: 7th International Conference for Internet Technology and Secured Transactions, pp 492–497
Google Scholar
Mohammad RM, Thabtah F, McCluskey L (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications 25(2), DOI 10.1007/s00521-013-1490-z, https://doi.org/10.1007/s00521-013-1490-z
Article Google Scholar
OpenDNS (2019) Phishtank - out of the net, into the tank, https://www.phishtank.com/
Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford InfoLab
Google Scholar
Prakash P, Kumar M, Kompella RR, Gupta M (2010) Phishnet: Predictive blacklisting to detect phishing attacks. In: Proceedings of the 29th Conference on Information Communications, IEEE Press, Piscataway, NJ, USA, INFOCOM’10, pp 346–350, http://dl.acm.org/citation.cfm?id=1833515.1833585
Ramanathan V, Wechsler H (2012) Phishing website detection using latent dirichlet allocation and adaboost. In: 2012 IEEE International Conference on Intelligence and Security Informatics, IEEE, pp 102–107
Google Scholar
Sheng S, Wardman B, Warner G, Cranor L, Hong J, Zhang C (2009) An empirical analysis of phishing blacklists
Google Scholar
Sinha S, Bailey M, Jahanian F (2008) Shades of grey: On the effectiveness of reputation-based “blacklists”. In: 2008 3rd International Conference on Malicious and Unwanted Software (MALWARE), IEEE, pp 57–64
Google Scholar
Sorio E, Bartoli A, Medvet E (2013) Detection of hidden fraudulent urls within trusted sites using lexical features. 2013 International Conference on Availability, Reliability and Security pp 242–247
Google Scholar
Sun B, Akiyama M, Yagi T, Hatada M, Mori T (2016) Automating url blacklist generation with similarity search approach. IEICE TRANSACTIONS on Information and Systems 99(4):873–882
Article Google Scholar
Teraguchi NCRLY, Mitchell JC (2004) Client-side defense against web-based identity theft. Computer Science Department, Stanford University Available: http://cryptostanfordedu/SpoofGuard/webspoofpdf
Google Scholar
Verma R, Dyer K (2015) On the character of phishing urls: Accurate and robust statistical learning classifiers. In: Proceedings of the 5th ACM Conference on Data and Application Security and Privacy, DOI 10.1145/2699026.2699115, http://doi.acm.org/10.1145/2699026.2699115
Wenyin L, Huang G, Xiaoyue L, Min Z, Deng X (2005) Detection of phishing webpages based on visual similarity. In: Special interest tracks and posters of the 14th international conference on World Wide Web, ACM, pp 1060–1061
Google Scholar
Whittaker C, Ryner B, Nazif M (2010) Large-scale automatic classification of phishing pages. In: NDSS ’10, http://www.isoc.org/isoc/conferences/ndss/10/pdf/08.pdf
Xiang G, Hong J, Rose CP, Cranor L (2011) Cantina+ : A feature-rich machine learning framework for detecting phishing web sites. ACM Transactions on Information and System Security (TISSEC) 14(2):21
Article Google Scholar
Xu L, Zhan Z, Xu S, Ye K (2013) Cross-layer detection of malicious websites. In: Proceedings of the third ACM conference on Data and application security and privacy, ACM, pp 141–152
Google Scholar
Zhang Y, Hong JI, Cranor LF (2007) Cantina: a content-based approach to detecting phishing web sites. In: Proceedings of the 16th international conference on World Wide Web, ACM, pp 639–648
Google Scholar

Download references

Author information

Authors and Affiliations

Hanyang University, Seoul, South Korea
Jiwon Hong, Taeri Kim & Sang-Wook Kim
George Mason University, Fairfax, VA, USA
Jing Liu & Noseong Park

Authors

Jiwon Hong
View author publications
You can also search for this author in PubMed Google Scholar
Taeri Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Noseong Park
View author publications
You can also search for this author in PubMed Google Scholar
Sang-Wook Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Noseong Park .

Editor information

Editors and Affiliations

Center for Secure Information Systems, George Mason University, Fairfax, VA, USA
Sushil Jajodia
Thayer School of Engineering, Dartmouth College, Hanover, NH, USA
George Cybenko
Department of Computer Science, Dartmouth College, Hanover, NH, USA
V.S. Subrahmanian
MS T310, MITRE Corporation, McLean, VA, USA
Vipin Swarup
Computing and Information Science Division, Army Research Office, Durham, NC, USA
Cliff Wang
Computer Science & Engineering, University of Michigan, Ann Arbor, MI, USA
Michael Wellman

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hong, J., Kim, T., Liu, J., Park, N., Kim, SW. (2020). Phishing URL Detection with Lexical Features and Blacklisted Domains. In: Jajodia, S., Cybenko, G., Subrahmanian, V., Swarup, V., Wang, C., Wellman, M. (eds) Adaptive Autonomous Secure Cyber Systems. Springer, Cham. https://doi.org/10.1007/978-3-030-33432-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-33432-1_12
Published: 05 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33431-4
Online ISBN: 978-3-030-33432-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Phishing URL Detection with Lexical Features and Blacklisted Domains

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Malicious URL Detection Using Transformers’ NLP Models and Machine Learning

Everything Is in the Name – A URL Based Approach for Phishing Detection

Machine Learning-Based Phishing Detection Using URL Features: A Comprehensive Review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Phishing URL Detection with Lexical Features and Blacklisted Domains

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Malicious URL Detection Using Transformers’ NLP Models and Machine Learning

Everything Is in the Name – A URL Based Approach for Phishing Detection

Machine Learning-Based Phishing Detection Using URL Features: A Comprehensive Review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation