Application of word embedding and machine learning in detecting phishing websites

Rao, Routhu Srinivasa; Umarekar, Amey; Pais, Alwyn Roshan

doi:10.1007/s11235-021-00850-6

Application of word embedding and machine learning in detecting phishing websites

Published: 19 November 2021

Volume 79, pages 33–45, (2022)
Cite this article

Telecommunication Systems Aims and scope Submit manuscript

873 Accesses
9 Citations
Explore all metrics

Abstract

Phishing is an attack whose aim is to gain personal information such as passwords, credit card details etc. from online users by deceiving them through fake websites, emails or any legitimate internet service. There exists many techniques to detect phishing sites such as third-party based techniques, source code based methods and URL based methods but still users are getting trapped into revealing their sensitive information. In this paper, we propose a new technique which detects phishing sites with word embeddings using plain text and domain specific text extracted from the source code. We applied various word embedding for the evaluation of our model using ensemble and multimodal approaches. From the experimental evaluation, we observed that multimodal with domain specific text achieved a significant accuracy of 99.34% with TPR of 99.59%, FPR of 0.93%, and MCC of 98.68%

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Multidimensional Features Driven Phishing Detection Based on Deep Learning

A Survey on Phishing Website Detection Using Deep Neural Networks

Using Natural Language Processing for Phishing Detection

Notes

References

Afzal, S., Asim, M., Javed, A. R., Beg, M. O., & Baker, T. (2021). Urldeepdetect: A deep learning approach for detecting malicious urls using semantic vector models. Journal of Network and Systems Management, 29(3), 1–27.
Article Google Scholar
Aleroud, A., & Zhou, L. (2017). Phishing environments, techniques, and countermeasures: A survey. Computers & Security, 68, 160–196.
Article Google Scholar
Azeez, N., Misra, S., Margaret, I. A., & Fernandez-Sanz, L. (2021). Adopting automated whitelist approach for detecting phishing attacks. Computers & Security, 108, 102328
Basit, A., Zafar, M., Liu, X., Javed, A. R., Jalil, Z., & Kifayat, K. (2021). A comprehensive survey of AI-enabled phishing attacks detection techniques. Telecommunication Systems, 76(1), 139–154.
Article Google Scholar
Belabed, A., Aïmeur, E., & Chikh, A. (2012). A personalized whitelist approach for phishing webpage detection. In 2012 7th international conference on availability (pp. 249–254). IEEE: Reliability and Security.
Cao, Y., Han, W., & Le, Y. (2008). Anti-phishing based on automated individual white-list. In Proceedings of the 4th ACM workshop on Digital identity management (DIM ’08) (pp. 51–60). Association for Computing Machinery, New York, NY. https://doi.org/10.1145/1456424.1456434
Cheng, Y., Chai, T., Zhang, Z., Lu, K., & Du, Y. (2021). Detecting malicious domain names with abnormal whois records using feature-based rules. The Computer Journal.
Chiew, K. L., Yong, K. S. C., & Tan, C. L. (2018). A survey of phishing attacks: Their types, vectors and technical approaches. Expert Systems with Applications, 106, 1–20.
Article Google Scholar
Fang, Y., Zhang, C., Huang, C., Liu, L., & Yang, Y. (2019). Phishing email detection using improved rcnn model with multilevel vectors and attention mechanism. IEEE Access, 7, 56,329–56,340.
Gastellier-Prevost, S., Granadillo, GG., & Laurent, M. (2011). Decisive heuristics to differentiate legitimate from phishing sites. In 2011 conference on network and information systems security (pp 1–9). https://doi.org/10.1109/SAR-SSI.2011.5931389
Gowtham, R., & Krishnamurthi, I. (2014). A comprehensive and efficacious architecture for detecting phishing webpages. Computers & Security, 40, 23–37.
Article Google Scholar
He, M., Horng, S. J., Fan, P., Khan, M. K., Run, R. S., Lai, J. L., Chen, R. J., & Sutanto, A. (2011). An efficient phishing webpage detector. Expert Systems with Applications, 38(10), 12,018–12,027.
Jain, A. K., & Gupta, B. (2021). A survey of phishing attack techniques, defence mechanisms and open research challenges. Enterprise Information Systems, 1–39.
Li, Y., Yang, Z., Chen, X., Yuan, H., & Liu, W. (2019). A stacking model using URL and HTML features for phishing webpage detection. Future Generation Computer Systems, 94, 27–39.
Article Google Scholar
Mohammad, R. M., Thabtah, F., & McCluskey, L. (2012). An assessment of features related to phishing websites using an automated technique. In 2012 international conference for internet technology and secured transactions (pp. 492–497). IEEE.
Mourtaji, Y., Bouhorma, M., Alghazzawi, D., Aldabbagh, G., & Alghamdi, A. (2021). Hybrid rule-based solution for phishing URL detection using convolutional neural network. Wireless Communications and Mobile Computing, 2021, 8241104. https://doi.org/10.1155/2021/8241104.
Article Google Scholar
Prakash, P., Kumar, M., Kompella, R. R., & Gupta, M. (2010). Phishnet: Predictive blacklisting to detect phishing attacks. In 2010 proceedings IEEE INFOCOM (pp. 1–5). https://doi.org/10.1109/INFCOM.2010.5462216
Rao, R. S., & Pais, A. R. (2017). An enhanced blacklist method to detect phishing websites. In International conference on information systems security (pp. 323–333). Springer.
Rao, R. S., & Pais, A. R. (2019). Detection of phishing websites using an efficient feature-based machine learning framework. Neural Computing and Applications, 31(8), 3851–3873.
Article Google Scholar
Rao, R. S., & Pais, A. R. (2019). Two level filtering mechanism to detect phishing sites using lightweight visual similarity approach. Journal of Ambient Intelligence and Humanized Computing, 11, 1–20.
Google Scholar
Rao, R. S., Vaishnavi, T., & Pais, A. R. (2019). Phishdump: A multi-model ensemble based technique for the detection of phishing sites in mobile devices. Pervasive and Mobile Computing, 60(101), 084.
Google Scholar
Rao, R. S., Vaishnavi, T., & Pais, A. R. (2020). Catchphish: Detection of phishing websites by inspecting URLs. Journal of Ambient Intelligence and Humanized Computing, 11(2), 813–825.
Article Google Scholar
Rao, R. S., Pais, A. R., & Anand, P. (2021). A heuristic technique to detect phishing websites using TWSVM classifier. Neural Computing and Applications, 33(11), 5733–5752.
Article Google Scholar
Sahingoz, O. K., Buber, E., Demir, O., & Diri, B. (2019). Machine learning based phishing detection from URLs. Expert Systems with Applications, 117, 345–357.
Article Google Scholar
Shirazi, H., Bezawada, B., & Ray, I. (2018). “kn0w thy doma1n name” unbiased phishing detection using domain name based features. In Proceedings of the 23nd ACM on symposium on access control models and technologies (pp 69–75).
Shirazi, H., Bezawada, B., & Ray, I. (2018). “kn0w thy doma1n name” unbiased phishing detection using domain name based features. In Proceedings of the 23nd ACM on symposium on access control models and technologies (pp. 69–75).
da Silva, C. M. R., Feitosa, E. L., & Garcia, V. C. (2020). Heuristic-based strategy for phishing prediction: A survey of URL-based approach. Computers & Security, 88(101), 613.
Google Scholar
Su, K. W., Wu, K. P., Lee, H. M., & Wei, T. E. (2013). Suspicious URL filtering based on logistic regression with multi-view analysis. In 2013 8th Asia joint conference on information security (pp. 77–84). IEEE.
Tharani, J. S., & Arachchilage, N. A. (2020). Understanding phishers’ strategies of mimicking uniform resource locators to leverage phishing attacks: A machine learning approach. Security and Privacy, 3(5), e120.
Vijayalakshmi, M., Shalinie, S. M., Yang, M. H., et al. (2020). Web phishing detection techniques: A survey on the state-of-the-art, taxonomy and future directions. IET Networks, 9(5), 235–246.
Article Google Scholar
Wang, S., Khan, S., Xu, C., Nazir, S., & Hafeez, A. (2020). Deep learning-based efficient model development for phishing detection using random forest and BLSTM classifiers. Complexity, 2020, 8694796. https://doi.org/10.1155/2020/8694796.
Article Google Scholar
Wang, Y., Agrawal, R., & Choi, B. Y. (2008). Light weight anti-phishing with user whitelisting in a web browser. In 2008 IEEE region 5 conference (pp. 1–4). IEEE.
Wei, W., Ke, Q., Nowak, J., Korytkowski, M., Scherer, R., & Woźniak, M. (2020). Accurate and fast URL phishing detector: A convolutional neural network approach. Computer Networks, 178(107), 275.
Google Scholar
Xiao, X., Zhang, D., Hu, G., Jiang, Y., & Xia, S. (2020). CNN-MHSA: A convolutional neural network and multi-head self-attention combined approach for detecting phishing websites. Neural Networks, 125, 303–312.
Article Google Scholar
Xiao, X., Xiao, W., Zhang, D., Zhang, B., Hu, G., Li, Q., & Xia, S. (2021). Phishing websites detection via CNN and multi-head self-attention on imbalanced datasets. Computers & Security, 108, 102372.
Article Google Scholar
Xu, L., Zhan, Z., Xu, S., & Ye, K. (2013). Cross-layer detection of malicious websites. In Proceedings of the 3rd ACM conference on data and application security and privacy, association for computing machinery, New York, CODASPY ’13 (pp. 141–152). https://doi.org/10.1145/2435349.2435366
Yang, P., Zhao, G., & Zeng, P. (2019). Phishing website detection based on multidimensional features driven by deep learning. IEEE Access, 7, 15,196–15,209.
Zamir, A., Khan, H. U., Iqbal, T., Yousaf, N., Aslam, F., Anjum, A., & Hamdani, M. (2020). Phishing web site detection using diverse machine learning algorithms. The Electronic Library, 38(1), 65–80. https://doi.org/10.1108/EL-05-2019-0118
Zhang, D., Yan, Z., Jiang, H., & Kim, T. (2014). A domain-feature enhanced classification model for the detection of Chinese phishing e-business websites. Information & Management, 51(7), 845–853.
Article Google Scholar
Zhang, W., Jiang, Q., Chen, L., & Li, C. (2017). Two-stage elm for phishing web pages detection using hybrid features. World Wide Web, 20(4), 797–813.
Article Google Scholar
Zhang, X., Zeng, Y., Jin, X., Yan, Z., & Geng, G. (2017). Boosting the phishing detection performance by semantic analysis. In 2017 IEEE international conference on big data (big data) (pp. 1063–1070).
Zhang, Y., Hong, J. I., Cranor, L. F. (2007). Cantina: A content-based approach to detecting phishing web sites. In Proceedings of the 16th international conference on World Wide Web (pp. 639–648).

Download references

Acknowledgements

The authors would like to thank Ministry of Electronics and Information Technology (Meity), Government of India for their support in part of the research.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, GMR Institute of Technology, Rajam, Andhra Pradesh, 532127, India
Routhu Srinivasa Rao
Information Security Research Lab, Department of Computer Science and Engineering, National Institute of Technology, Surathkal, Karnataka, 575025, India
Amey Umarekar & Alwyn Roshan Pais

Authors

Routhu Srinivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar
Amey Umarekar
View author publications
You can also search for this author in PubMed Google Scholar
Alwyn Roshan Pais
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Routhu Srinivasa Rao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rao, R.S., Umarekar, A. & Pais, A.R. Application of word embedding and machine learning in detecting phishing websites. Telecommun Syst 79, 33–45 (2022). https://doi.org/10.1007/s11235-021-00850-6

Download citation

Accepted: 19 October 2021
Published: 19 November 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s11235-021-00850-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Application of word embedding and machine learning in detecting phishing websites

Abstract

Access this article

Similar content being viewed by others

Multidimensional Features Driven Phishing Detection Based on Deep Learning

A Survey on Phishing Website Detection Using Deep Neural Networks

Using Natural Language Processing for Phishing Detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Application of word embedding and machine learning in detecting phishing websites

Abstract

Access this article

Similar content being viewed by others

Multidimensional Features Driven Phishing Detection Based on Deep Learning

A Survey on Phishing Website Detection Using Deep Neural Networks

Using Natural Language Processing for Phishing Detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation