Skip to main content

Phishing URL Detection with Lexical Features and Blacklisted Domains

  • Chapter
  • First Online:
Adaptive Autonomous Secure Cyber Systems

Abstract

Many cyberattacks start with phishing to lure victims into malicious web pages where malware codes are hidden. Victim machines are infected by malware and the attacker can intrude the enterprise network, evading firewalls. Therefore, it is of fundamental importance to detect phishing URLs and prevent employees from visiting them. Many machine learning methods were proposed so far. In this work, we collect many lexical features after literature survey and combine them with blacklisted domains to improve the detection performance. We collect many recent phishing URLs because most of open datasets are outdated. Our method shows the F-1 of 0.84.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Jiwon Hong, Taeri Kim, and Jing Liu are listed in alphabetical order and equally contributed. Noseong Park and Sang-Wook Kim are the co-corresponding authors.

References

  1. Ahmad F (2017) https://github.com/faizann24/using-machine-learning-to-detect-malicious-urls

  2. Anand A, Gorde K, Moniz JRA, Park N, Chakraborty T, Chu BT (2018) Phishing url detection with oversampling based on text generative adversarial networks. In: 2018 IEEE International Conference on Big Data (Big Data), IEEE, pp 1168–1177

    Google Scholar 

  3. Anderson DS, Fleizach C, Savage S, Voelker GM (2007) Spamscatter: Characterizing internet scam hosting infrastructure. In: USENIX Security Symposium

    Google Scholar 

  4. Anti-Phishing Working Group (2018) APWG Phishing Attack Trends Reports. https://www.antiphishing.org/resources/apwg-reports

  5. Bahnsen AC, Bohorquez EC, Villegas S, Vargas J, González FA (2017) Classifying phishing urls using recurrent neural networks. In: 2017 APWG Symposium on Electronic Crime Research (eCrime), IEEE, pp 1–8

    Google Scholar 

  6. Bowyer KW, Chawla NV, Hall LO, Kegelmeyer WP (2011) SMOTE: synthetic minority over-sampling technique. CoRR abs/1106.1813, http://arxiv.org/abs/1106.1813

  7. Canali D, Cova M, Vigna G, Kruegel C (2011) Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the 20th international conference on World wide web, ACM, pp 197–206

    Google Scholar 

  8. Chen TC, Dick S, Miller J (2010) Detecting visually similar web pages: Application to phishing detection. ACM Transactions on Internet Technology (TOIT) 10(2):5

    Article  Google Scholar 

  9. Choi Y, Kim T, Choi S, Lee C (2009) Automatic detection for javascript obfuscation attacks in web pages through string pattern analysis. In: Proceedings of the 1st International Conference on Future Generation Information Technology, Springer-Verlag, Berlin, Heidelberg, FGIT ’09, pp 160–172

    Chapter  Google Scholar 

  10. of Economic Advisers TC (2018) https://www.whitehouse.gov/wp-content/uploads/2018/03/the-cost-of-malicious-cyber-activity-to-the-u.s.-economy.pdf

  11. Eshete B, Villafiorita A, Weldemariam K (2012) Binspect: Holistic analysis and detection of malicious web pages. In: International Conference on Security and Privacy in Communication Systems, Springer, pp 149–166

    Google Scholar 

  12. Eshete B, Villafiorita A, Weldemariam K, Zulkernine M (2013) Einspect: Evolution-guided analysis and detection of malicious web pages. In: 2013 IEEE 37th Annual Computer Software and Applications Conference, IEEE, pp 375–380

    Google Scholar 

  13. Felegyhazi M, Kreibich C, Paxson V (2010) On the potential of proactive domain blacklisting. In: Proceedings of the 3rd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, USENIX Association, Berkeley, CA, USA, LEET’10, pp 6–6, http://dl.acm.org/citation.cfm?id=1855686.1855692

  14. Fu AY, Wenyin L, Deng X (2006) Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (emd). IEEE transactions on dependable and secure computing 3(4):301–311

    Article  Google Scholar 

  15. Garera S, Provos N, Chew M, Rubin AD (2007) A framework for detection and measurement of phishing attacks. In: Proceedings of the 2007 ACM workshop on Recurring malcode, ACM, pp 1–8

    Google Scholar 

  16. He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, pp 1322–1328

    Google Scholar 

  17. Hou YT, Chang Y, Chen T, Laih CS, Chen CM (2010) Malicious web content detection by machine learning. Expert Systems with Applications 37(1):55–60

    Article  Google Scholar 

  18. Kilby M (2017) https://github.com/incertum/cyber-matrix-ai/tree/master/malicious-url-detection-deep-learning

  19. Le H, Pham Q, Sahoo D, Hoi SC (2018) Urlnet: Learning a url representation with deep learning for malicious url detection. arXiv preprint arXiv:180203162

    Google Scholar 

  20. Ludl C, Mcallister S, Kirda E, Kruegel C (2007) On the effectiveness of techniques to detect phishing sites. In: Proceedings of the 4th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, Springer-Verlag, Berlin, Heidelberg, DIMVA ’07, pp 20–39

    Chapter  Google Scholar 

  21. Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: learning to detect malicious web sites from suspicious urls. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 1245–1254

    Google Scholar 

  22. Ma J, Saul LK, Savage S, Voelker GM (2009) Beyond blacklists: Learning to detect malicious web sites from suspicious urls. In: KDD, pp 1245–1254

    Google Scholar 

  23. Ma J, Saul LK, Savage S, Voelker GM (2009) Identifying suspicious urls: an application of large-scale online learning. In: Proceedings of the 26th annual international conference on machine learning, ACM, pp 681–688

    Google Scholar 

  24. Mao J, Tian W, Li P, Wei T, Liang Z (2017) Phishing-alarm: Robust and efficient phishing detection via page component similarity. IEEE Access 5:17020–17030, DOI 10.1109/ACCESS.2017.2743528

    Article  Google Scholar 

  25. Marchal S, François J, State R, Engel T (2014) Phishscore: Hacking phishers’ minds. In: 10th International Conference on Network and Service Management (CNSM) and Workshop, IEEE, pp 46–54

    Google Scholar 

  26. McGrath DK, Gupta M (2008) Behind phishing: An examination of phisher modi operandi. LEET 8:4

    Google Scholar 

  27. Medvet E, Kirda E, Kruegel C (2008) Visual-similarity-based phishing detection. In: Proceedings of the 4th international conference on Security and privacy in communication netowrks, ACM, p 22

    Google Scholar 

  28. Mohammad RM, Thabtah FA, McCluskey L (2012) An assessment of features related to phishing websites using an automated technique. In: 7th International Conference for Internet Technology and Secured Transactions, pp 492–497

    Google Scholar 

  29. Mohammad RM, Thabtah F, McCluskey L (2014) Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications 25(2), DOI 10.1007/s00521-013-1490-z, https://doi.org/10.1007/s00521-013-1490-z

    Article  Google Scholar 

  30. OpenDNS (2019) Phishtank - out of the net, into the tank, https://www.phishtank.com/

  31. Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford InfoLab

    Google Scholar 

  32. Prakash P, Kumar M, Kompella RR, Gupta M (2010) Phishnet: Predictive blacklisting to detect phishing attacks. In: Proceedings of the 29th Conference on Information Communications, IEEE Press, Piscataway, NJ, USA, INFOCOM’10, pp 346–350, http://dl.acm.org/citation.cfm?id=1833515.1833585

  33. Ramanathan V, Wechsler H (2012) Phishing website detection using latent dirichlet allocation and adaboost. In: 2012 IEEE International Conference on Intelligence and Security Informatics, IEEE, pp 102–107

    Google Scholar 

  34. Sheng S, Wardman B, Warner G, Cranor L, Hong J, Zhang C (2009) An empirical analysis of phishing blacklists

    Google Scholar 

  35. Sinha S, Bailey M, Jahanian F (2008) Shades of grey: On the effectiveness of reputation-based “blacklists”. In: 2008 3rd International Conference on Malicious and Unwanted Software (MALWARE), IEEE, pp 57–64

    Google Scholar 

  36. Sorio E, Bartoli A, Medvet E (2013) Detection of hidden fraudulent urls within trusted sites using lexical features. 2013 International Conference on Availability, Reliability and Security pp 242–247

    Google Scholar 

  37. Sun B, Akiyama M, Yagi T, Hatada M, Mori T (2016) Automating url blacklist generation with similarity search approach. IEICE TRANSACTIONS on Information and Systems 99(4):873–882

    Article  Google Scholar 

  38. Teraguchi NCRLY, Mitchell JC (2004) Client-side defense against web-based identity theft. Computer Science Department, Stanford University Available: http://cryptostanfordedu/SpoofGuard/webspoofpdf

    Google Scholar 

  39. Verma R, Dyer K (2015) On the character of phishing urls: Accurate and robust statistical learning classifiers. In: Proceedings of the 5th ACM Conference on Data and Application Security and Privacy, DOI 10.1145/2699026.2699115, http://doi.acm.org/10.1145/2699026.2699115

  40. Wenyin L, Huang G, Xiaoyue L, Min Z, Deng X (2005) Detection of phishing webpages based on visual similarity. In: Special interest tracks and posters of the 14th international conference on World Wide Web, ACM, pp 1060–1061

    Google Scholar 

  41. Whittaker C, Ryner B, Nazif M (2010) Large-scale automatic classification of phishing pages. In: NDSS ’10, http://www.isoc.org/isoc/conferences/ndss/10/pdf/08.pdf

  42. Xiang G, Hong J, Rose CP, Cranor L (2011) Cantina+ : A feature-rich machine learning framework for detecting phishing web sites. ACM Transactions on Information and System Security (TISSEC) 14(2):21

    Article  Google Scholar 

  43. Xu L, Zhan Z, Xu S, Ye K (2013) Cross-layer detection of malicious websites. In: Proceedings of the third ACM conference on Data and application security and privacy, ACM, pp 141–152

    Google Scholar 

  44. Zhang Y, Hong JI, Cranor LF (2007) Cantina: a content-based approach to detecting phishing web sites. In: Proceedings of the 16th international conference on World Wide Web, ACM, pp 639–648

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Noseong Park .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Hong, J., Kim, T., Liu, J., Park, N., Kim, SW. (2020). Phishing URL Detection with Lexical Features and Blacklisted Domains. In: Jajodia, S., Cybenko, G., Subrahmanian, V., Swarup, V., Wang, C., Wellman, M. (eds) Adaptive Autonomous Secure Cyber Systems. Springer, Cham. https://doi.org/10.1007/978-3-030-33432-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-33432-1_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-33431-4

  • Online ISBN: 978-3-030-33432-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics