Skip to main content
Log in

Application of word embedding and machine learning in detecting phishing websites

  • Published:
Telecommunication Systems Aims and scope Submit manuscript

Abstract

Phishing is an attack whose aim is to gain personal information such as passwords, credit card details etc. from online users by deceiving them through fake websites, emails or any legitimate internet service. There exists many techniques to detect phishing sites such as third-party based techniques, source code based methods and URL based methods but still users are getting trapped into revealing their sensitive information. In this paper, we propose a new technique which detects phishing sites with word embeddings using plain text and domain specific text extracted from the source code. We applied various word embedding for the evaluation of our model using ensemble and multimodal approaches. From the experimental evaluation, we observed that multimodal with domain specific text achieved a significant accuracy of 99.34% with TPR of 99.59%, FPR of 0.93%, and MCC of 98.68%

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. https://www.antiphishing.org/resources/apwg-reports/.

  2. https://www.proofpoint.com/us/security-awareness/post/2019-state-phish-attack-rates-rise-account-compromise-soars.

  3. https://securelist.com/spam-and-phishing-in-q1-2019/90795/.

  4. https://developers.google.com/safe-browsing.

  5. https://support.mozilla.org/en-US/kb/how-does-phishing-and-malware-protection-work.

References

  1. Afzal, S., Asim, M., Javed, A. R., Beg, M. O., & Baker, T. (2021). Urldeepdetect: A deep learning approach for detecting malicious urls using semantic vector models. Journal of Network and Systems Management, 29(3), 1–27.

    Article  Google Scholar 

  2. Aleroud, A., & Zhou, L. (2017). Phishing environments, techniques, and countermeasures: A survey. Computers & Security, 68, 160–196.

    Article  Google Scholar 

  3. Azeez, N., Misra, S., Margaret, I. A., & Fernandez-Sanz, L. (2021). Adopting automated whitelist approach for detecting phishing attacks. Computers & Security, 108, 102328

  4. Basit, A., Zafar, M., Liu, X., Javed, A. R., Jalil, Z., & Kifayat, K. (2021). A comprehensive survey of AI-enabled phishing attacks detection techniques. Telecommunication Systems, 76(1), 139–154.

    Article  Google Scholar 

  5. Belabed, A., Aïmeur, E., & Chikh, A. (2012). A personalized whitelist approach for phishing webpage detection. In 2012 7th international conference on availability (pp. 249–254). IEEE: Reliability and Security.

  6. Cao, Y., Han, W., & Le, Y. (2008). Anti-phishing based on automated individual white-list. In Proceedings of the 4th ACM workshop on Digital identity management (DIM ’08) (pp. 51–60). Association for Computing Machinery, New York, NY. https://doi.org/10.1145/1456424.1456434

  7. Cheng, Y., Chai, T., Zhang, Z., Lu, K., & Du, Y. (2021). Detecting malicious domain names with abnormal whois records using feature-based rules. The Computer Journal.

  8. Chiew, K. L., Yong, K. S. C., & Tan, C. L. (2018). A survey of phishing attacks: Their types, vectors and technical approaches. Expert Systems with Applications, 106, 1–20.

    Article  Google Scholar 

  9. Fang, Y., Zhang, C., Huang, C., Liu, L., & Yang, Y. (2019). Phishing email detection using improved rcnn model with multilevel vectors and attention mechanism. IEEE Access, 7, 56,329–56,340.

  10. Gastellier-Prevost, S., Granadillo, GG., & Laurent, M. (2011). Decisive heuristics to differentiate legitimate from phishing sites. In 2011 conference on network and information systems security (pp 1–9). https://doi.org/10.1109/SAR-SSI.2011.5931389

  11. Gowtham, R., & Krishnamurthi, I. (2014). A comprehensive and efficacious architecture for detecting phishing webpages. Computers & Security, 40, 23–37.

    Article  Google Scholar 

  12. He, M., Horng, S. J., Fan, P., Khan, M. K., Run, R. S., Lai, J. L., Chen, R. J., & Sutanto, A. (2011). An efficient phishing webpage detector. Expert Systems with Applications, 38(10), 12,018–12,027.

  13. Jain, A. K., & Gupta, B. (2021). A survey of phishing attack techniques, defence mechanisms and open research challenges. Enterprise Information Systems, 1–39.

  14. Li, Y., Yang, Z., Chen, X., Yuan, H., & Liu, W. (2019). A stacking model using URL and HTML features for phishing webpage detection. Future Generation Computer Systems, 94, 27–39.

    Article  Google Scholar 

  15. Mohammad, R. M., Thabtah, F., & McCluskey, L. (2012). An assessment of features related to phishing websites using an automated technique. In 2012 international conference for internet technology and secured transactions (pp. 492–497). IEEE.

  16. Mourtaji, Y., Bouhorma, M., Alghazzawi, D., Aldabbagh, G., & Alghamdi, A. (2021). Hybrid rule-based solution for phishing URL detection using convolutional neural network. Wireless Communications and Mobile Computing, 2021, 8241104. https://doi.org/10.1155/2021/8241104.

    Article  Google Scholar 

  17. Prakash, P., Kumar, M., Kompella, R. R., & Gupta, M. (2010). Phishnet: Predictive blacklisting to detect phishing attacks. In 2010 proceedings IEEE INFOCOM (pp. 1–5). https://doi.org/10.1109/INFCOM.2010.5462216

  18. Rao, R. S., & Pais, A. R. (2017). An enhanced blacklist method to detect phishing websites. In International conference on information systems security (pp. 323–333). Springer.

  19. Rao, R. S., & Pais, A. R. (2019). Detection of phishing websites using an efficient feature-based machine learning framework. Neural Computing and Applications, 31(8), 3851–3873.

    Article  Google Scholar 

  20. Rao, R. S., & Pais, A. R. (2019). Two level filtering mechanism to detect phishing sites using lightweight visual similarity approach. Journal of Ambient Intelligence and Humanized Computing, 11, 1–20.

    Google Scholar 

  21. Rao, R. S., Vaishnavi, T., & Pais, A. R. (2019). Phishdump: A multi-model ensemble based technique for the detection of phishing sites in mobile devices. Pervasive and Mobile Computing, 60(101), 084.

    Google Scholar 

  22. Rao, R. S., Vaishnavi, T., & Pais, A. R. (2020). Catchphish: Detection of phishing websites by inspecting URLs. Journal of Ambient Intelligence and Humanized Computing, 11(2), 813–825.

    Article  Google Scholar 

  23. Rao, R. S., Pais, A. R., & Anand, P. (2021). A heuristic technique to detect phishing websites using TWSVM classifier. Neural Computing and Applications, 33(11), 5733–5752.

    Article  Google Scholar 

  24. Sahingoz, O. K., Buber, E., Demir, O., & Diri, B. (2019). Machine learning based phishing detection from URLs. Expert Systems with Applications, 117, 345–357.

    Article  Google Scholar 

  25. Shirazi, H., Bezawada, B., & Ray, I. (2018). “kn0w thy doma1n name” unbiased phishing detection using domain name based features. In Proceedings of the 23nd ACM on symposium on access control models and technologies (pp 69–75).

  26. Shirazi, H., Bezawada, B., & Ray, I. (2018). “kn0w thy doma1n name” unbiased phishing detection using domain name based features. In Proceedings of the 23nd ACM on symposium on access control models and technologies (pp. 69–75).

  27. da Silva, C. M. R., Feitosa, E. L., & Garcia, V. C. (2020). Heuristic-based strategy for phishing prediction: A survey of URL-based approach. Computers & Security, 88(101), 613.

    Google Scholar 

  28. Su, K. W., Wu, K. P., Lee, H. M., & Wei, T. E. (2013). Suspicious URL filtering based on logistic regression with multi-view analysis. In 2013 8th Asia joint conference on information security (pp. 77–84). IEEE.

  29. Tharani, J. S., & Arachchilage, N. A. (2020). Understanding phishers’ strategies of mimicking uniform resource locators to leverage phishing attacks: A machine learning approach. Security and Privacy, 3(5), e120.

  30. Vijayalakshmi, M., Shalinie, S. M., Yang, M. H., et al. (2020). Web phishing detection techniques: A survey on the state-of-the-art, taxonomy and future directions. IET Networks, 9(5), 235–246.

    Article  Google Scholar 

  31. Wang, S., Khan, S., Xu, C., Nazir, S., & Hafeez, A. (2020). Deep learning-based efficient model development for phishing detection using random forest and BLSTM classifiers. Complexity, 2020, 8694796. https://doi.org/10.1155/2020/8694796.

    Article  Google Scholar 

  32. Wang, Y., Agrawal, R., & Choi, B. Y. (2008). Light weight anti-phishing with user whitelisting in a web browser. In 2008 IEEE region 5 conference (pp. 1–4). IEEE.

  33. Wei, W., Ke, Q., Nowak, J., Korytkowski, M., Scherer, R., & Woźniak, M. (2020). Accurate and fast URL phishing detector: A convolutional neural network approach. Computer Networks, 178(107), 275.

    Google Scholar 

  34. Xiao, X., Zhang, D., Hu, G., Jiang, Y., & Xia, S. (2020). CNN-MHSA: A convolutional neural network and multi-head self-attention combined approach for detecting phishing websites. Neural Networks, 125, 303–312.

    Article  Google Scholar 

  35. Xiao, X., Xiao, W., Zhang, D., Zhang, B., Hu, G., Li, Q., & Xia, S. (2021). Phishing websites detection via CNN and multi-head self-attention on imbalanced datasets. Computers & Security, 108, 102372.

    Article  Google Scholar 

  36. Xu, L., Zhan, Z., Xu, S., & Ye, K. (2013). Cross-layer detection of malicious websites. In Proceedings of the 3rd ACM conference on data and application security and privacy, association for computing machinery, New York, CODASPY ’13 (pp. 141–152). https://doi.org/10.1145/2435349.2435366

  37. Yang, P., Zhao, G., & Zeng, P. (2019). Phishing website detection based on multidimensional features driven by deep learning. IEEE Access, 7, 15,196–15,209.

  38. Zamir, A., Khan, H. U., Iqbal, T., Yousaf, N., Aslam, F., Anjum, A., & Hamdani, M. (2020). Phishing web site detection using diverse machine learning algorithms. The Electronic Library, 38(1), 65–80. https://doi.org/10.1108/EL-05-2019-0118

  39. Zhang, D., Yan, Z., Jiang, H., & Kim, T. (2014). A domain-feature enhanced classification model for the detection of Chinese phishing e-business websites. Information & Management, 51(7), 845–853.

    Article  Google Scholar 

  40. Zhang, W., Jiang, Q., Chen, L., & Li, C. (2017). Two-stage elm for phishing web pages detection using hybrid features. World Wide Web, 20(4), 797–813.

    Article  Google Scholar 

  41. Zhang, X., Zeng, Y., Jin, X., Yan, Z., & Geng, G. (2017). Boosting the phishing detection performance by semantic analysis. In 2017 IEEE international conference on big data (big data) (pp. 1063–1070).

  42. Zhang, Y., Hong, J. I., Cranor, L. F. (2007). Cantina: A content-based approach to detecting phishing web sites. In Proceedings of the 16th international conference on World Wide Web (pp. 639–648).

Download references

Acknowledgements

The authors would like to thank Ministry of Electronics and Information Technology (Meity), Government of India for their support in part of the research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Routhu Srinivasa Rao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rao, R.S., Umarekar, A. & Pais, A.R. Application of word embedding and machine learning in detecting phishing websites. Telecommun Syst 79, 33–45 (2022). https://doi.org/10.1007/s11235-021-00850-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11235-021-00850-6

Keywords

Navigation