Skip to main content
Log in

Highly accurate phishing URL detection based on machine learning

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Phishing is a persistent and major threat on the internet that is growing steadily and dangerously. It is a type of cyber-attack, in which phisher mimics a legitimate website page to harvest victim’s sensitive information, such as usernames, emails, passwords and bank or credit card details. To prevent such attacks, several phishing detection techniques have been proposed such as AI based, 3rd party, heuristic and content based. However, these approaches suffer from a number of limitations that needs to be addressed in order to detect phishing URLs. Firstly, features extracted in the past are extensive, with a limitation that it takes a considerable amount of time to extract such features. Secondly, several approaches selected important features using statistical methods, while some propose their own features. Although both methods have been implemented successfully in various approaches, however, these methods produce incorrect results without amplification of domain knowledge. Thirdly, most of the literature has used pre-classified and smaller datasets, which fail to produce exact efficiency and precision on large and real world datasets. Fourthly, the previous proposed approaches lack in advanced evaluation measures. Hence, in this paper, effective machine learning framework is proposed, which predicts phishing URLs without visiting the webpage nor utilizing any 3rd party services. The proposed technique is based on URL and uses full URL, protocol scheme, hostname, path area of the URL, entropy feature, suspicious words and brand name matching using TF-IDF technique for the classification of phishing URLs. The experiments are carried out on six different datasets using eight different machine learning classifiers, in which Random Forest achieved a significant higher accuracy than other classifiers on all the datasets. The proposed framework with only 30 features achieved a higher accuracy of 96.25% and 94.65% on the Kaggle datasets. The comparative results show that the proposed model achieved an accuracy of 92.2%, 91.63%, 94.80, 96.85% on benchmark datasets, which is higher than the existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Aburub F, Hadi W (2021) A new association classification based method for detecting phishing websites. J Theoret Appl Inf Technol 99(1):147–158

    Google Scholar 

  • Abuzuraiq A, Alkasassbeh M, Almseidin M (2020) Intelligent methods for accurately detecting phishing websites. In: 1th International Conference on information and communication systems (ICICS), pp 085–090, April 2020.

  • Al-Alyan A, Al-Ahmadi S (2020) Robust URL phishing detection based on deep learning. KSII Trans Internet Inf Syst 14(7):2752–2768

    Google Scholar 

  • Alexa (2022) Most popular legitimate URLs. https://www.alexa.com/. Accessed 5 Aug 2021

  • Alsharnouby M, Alaca F, Chiasson S (2015) Why phishing still works: user strategies for combating phishing attacks. Int J Hum Comput Stud 82:69–82

    Article  Google Scholar 

  • APWG (2013–2020) Phishing activity trends reports, 1st, 2nd, 3rd, and 4th quarters of each years. https://apwg.org/trendsreports/, published 2013–2020

  • Bahnsen AC, Bohorquez EC, Villegas S, Vargas J, González FA (2017) Classifying phishing URLs using recurrent neural networks. In: IEEE Proceedings of the APWG Symposium on electronic crime research (eCrime), pp 1–8, 2017

  • Banik B, Sarma A (2018) Phishing URL detection system based on URL features using SVM. Int J Electron Appl Res (IJEAR) 5(2):40–55

    Article  Google Scholar 

  • Chatterjee M, Namin AS (2019) Detecting phishing websites through deep reinforcement learning. In: IEEE Annual Computer Software and Applications Conference, pp 227–232, 2019

  • Chavan S, Inamdar A, Dorle A, Kulkarni S, W, X-W (2019) Phishing detection: malicious and benign websites classification using machine learning techniques. In: Springer Proceeding of International Conference on computational science and applications (ICCSA), pp 437–446, August 2019

  • Chiew KL, Yong KSC, Tan CL (2018) A survey of phishing attacks: their types, vectors and technical approaches. Elsevier Expert Syst Appl 106:1–20

    Article  Google Scholar 

  • Chiew KL, Tan CL, Wong K, Yong KS, Tiong WK (2019) A new hybrid ensemble feature selection framework formachine learning-based phishing detection system. Inf Sci 484:153–166

    Article  Google Scholar 

  • Dou Z, Khalil I, Khreishah A, Al-Fuqaha A, Guizani M (2017) Systematization of knowledge (SoK): a systematic review of software-based web phishing detection. IEEE Commun Surveys & Tutor 19(4):2797–2819

    Article  Google Scholar 

  • El Aassal A, Baki S, Das A, Verma RM (2020) An indepth benchmarking and evaluation of phishing detection research for security needs. IEEE Access 8:22170–22192

    Article  Google Scholar 

  • Feng F, Zhou Q, Shen Z et al (2018) The application of a novel neural network in the detection of phishing websites. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-018-0786-3

    Article  Google Scholar 

  • Gupta BB, Yadav K, Razzak I, Psannis K, Castiglione A, Chang X (2021) A novel approach for phishing URLs detection using lexical based machine learning in a real-time environment. Comput Commun 175:47–57

    Article  Google Scholar 

  • Hutchinson S, Zhang Z, Liu Q (2018) Detecting phishing websites with random forest. Springer ICST Inst Comput Sci Soc Inf Telecommun Eng MILICOM 251:470–479

    Google Scholar 

  • Jagadeesan S, Chaturvedi A, Kumar S (2018) Url phishing analysis using random forest. Int J Pure Appl Math 118(20):4159–4163

    Google Scholar 

  • Jain AK, Gupta BB (2018a) PHISH-SAFE: URL features-based phishing detection system using machine learning. In: Springer cyber security, advances in intelligent systems and computing, pp 467–474

  • Jain AK, Gupta BB (2018b) A machine learning based approach for phishing detection using hyperlinks information. Springer J Ambient Intell Humaniz Comput, pp 2015–2028

  • Jalil S, Usman M (2020) A review of phishing URL detection using machine learning classifiers. Springer Adv Intell Syst Comput 1251:646–665

    Article  Google Scholar 

  • Jeeva C, Rajsingh EB (2016) Intelligent phishing url detection using association rule mining. SpringerOpen Human-Centric Comput Inf Sci 6:10

    Article  Google Scholar 

  • Joshi A, Pattanshetti TR (2019) Phishing attack detection using feature selection techniques. In: Proceedings of International Conference on communication and information processing (ICCIP), May 2019, pp 949–952

  • Korkmaz M, Sahingoz OK, Diri B (2020) Detection of phishing websites by using machine learning-based URL analysis. In: IEEE 11th International Conference on computing, communication and networking technologies (ICCCNT), pp 1–7

  • Kulkarni A, Brown LL (2019) Phishing websites detection using machine learning. Int J Adv Comput Sci Appl (IJACSA) 10/7:8–13

    Google Scholar 

  • Li JH, Wang SD (2017) Phishbox: an approach for phishing validation and detection. In: 2017 IEEE 15th Int. Conf. on Dependable, Autonomic and Secure Computing, 15th Int. Conf. on Pervasive Intelligence and Computing, 3rd Int. Conf. on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), Orlando, FL, USA., 6 November 2017, pp 557–564

  • Li Y, Yang Z, Chen X et al (2019) A stacking model using URL and HTML features for phishing webpage detection. Elsevier Future Gener Comput Syst 94:27–39

    Article  Google Scholar 

  • Opara C, Wei B, Chen Y (2020) HTMLPhish: enabling phishing web page detection by applying deep learning techniques on HTML analysis. In: IEEE International Joint Conference on neural networks (IJCNN), pp 1–8, 2020

  • Pandey A, Gill N, Sai Prasad Nadendla K, Sumaiya Thaseen I (2019) Identification of phishing attack in websites using random forest-SVM hybrid model. In: Springer intelligent systems design and applications (ISDA), pp 120–128

  • PhishTank (2022) Verified phishing URLs. https://www.phishtank.com/. Accessed 5 Aug 2021

  • Rao RS, Vaishnavi T, Pais AR (2019) CatchPhish: detection of phishing websites by inspecting URLs. Springer J Ambient Intell Humaniz Comput 11:813–825

    Article  Google Scholar 

  • Sadique F, Kaul R, Badsha S, Sengupta S (2020) An automated framework for real-time phishing URL detection. In: IEEE 10th annual computing and communication workshop and conference (CCWC), pp 0335–0341

  • Sahingoz OK, Buber E, Demir O, Diri B (2019) Machine learning based phishing detection from URLs. ScienceDirect J Expert Syst Appl 117:345–357

    Article  Google Scholar 

  • Shahrivari V, Darabi MM, Izadi M (2020) Phishing detection using machine learning techniques. arXiv 2009.11116

  • Srinivasa Rao RS, Pais AR (2018) Detection of phishing websites using an efficient feature-based machine learning framework. Springer Neural Comput Appl 31:3851–3873

    Google Scholar 

  • Tan CL, Chiew KL, Wong K, Sze SN (2016) PhishWHO: phishing webpage detection via identity keywords extraction and target domain name finder. Elsevier Decis Support Syst 88:18–27

    Article  Google Scholar 

  • UCI (2022) UC Irvine Machine Learning Repository. https://archive.ics.uci.edu/ml/index.php/. Accessed 5 Aug 2021

  • Webroot (2020) Webroot threat report. https://mypage.webroot.com/rs/557-FSI-195/images/2020%20Webroot%20Threat%20Report_US_FINAL.pdf. Accessed 5 Aug 2021

  • Yang P, Zhao G, Zeng P (2019) Phishing website detection based on multidimensional features driven by deep learning. IEEE Access J Mag 7:15196–15209

    Article  Google Scholar 

  • Zhu E, Chen Y, Ye C, Li X, Liu F (2019) OFS-NN: an effective phishing websites detection model based on optimal feature selection and neural network. IEEE Access J Mag 7:73271–73284

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alvis Fong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 79 KB)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jalil, S., Usman, M. & Fong, A. Highly accurate phishing URL detection based on machine learning. J Ambient Intell Human Comput 14, 9233–9251 (2023). https://doi.org/10.1007/s12652-022-04426-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-022-04426-3

Keywords

Navigation