Abstract
Phishing is an attack whose aim is to gain personal information such as passwords, credit card details etc. from online users by deceiving them through fake websites, emails or any legitimate internet service. There exists many techniques to detect phishing sites such as third-party based techniques, source code based methods and URL based methods but still users are getting trapped into revealing their sensitive information. In this paper, we propose a new technique which detects phishing sites with word embeddings using plain text and domain specific text extracted from the source code. We applied various word embedding for the evaluation of our model using ensemble and multimodal approaches. From the experimental evaluation, we observed that multimodal with domain specific text achieved a significant accuracy of 99.34% with TPR of 99.59%, FPR of 0.93%, and MCC of 98.68%
Similar content being viewed by others
Notes
References
Afzal, S., Asim, M., Javed, A. R., Beg, M. O., & Baker, T. (2021). Urldeepdetect: A deep learning approach for detecting malicious urls using semantic vector models. Journal of Network and Systems Management, 29(3), 1–27.
Aleroud, A., & Zhou, L. (2017). Phishing environments, techniques, and countermeasures: A survey. Computers & Security, 68, 160–196.
Azeez, N., Misra, S., Margaret, I. A., & Fernandez-Sanz, L. (2021). Adopting automated whitelist approach for detecting phishing attacks. Computers & Security, 108, 102328
Basit, A., Zafar, M., Liu, X., Javed, A. R., Jalil, Z., & Kifayat, K. (2021). A comprehensive survey of AI-enabled phishing attacks detection techniques. Telecommunication Systems, 76(1), 139–154.
Belabed, A., Aïmeur, E., & Chikh, A. (2012). A personalized whitelist approach for phishing webpage detection. In 2012 7th international conference on availability (pp. 249–254). IEEE: Reliability and Security.
Cao, Y., Han, W., & Le, Y. (2008). Anti-phishing based on automated individual white-list. In Proceedings of the 4th ACM workshop on Digital identity management (DIM ’08) (pp. 51–60). Association for Computing Machinery, New York, NY. https://doi.org/10.1145/1456424.1456434
Cheng, Y., Chai, T., Zhang, Z., Lu, K., & Du, Y. (2021). Detecting malicious domain names with abnormal whois records using feature-based rules. The Computer Journal.
Chiew, K. L., Yong, K. S. C., & Tan, C. L. (2018). A survey of phishing attacks: Their types, vectors and technical approaches. Expert Systems with Applications, 106, 1–20.
Fang, Y., Zhang, C., Huang, C., Liu, L., & Yang, Y. (2019). Phishing email detection using improved rcnn model with multilevel vectors and attention mechanism. IEEE Access, 7, 56,329–56,340.
Gastellier-Prevost, S., Granadillo, GG., & Laurent, M. (2011). Decisive heuristics to differentiate legitimate from phishing sites. In 2011 conference on network and information systems security (pp 1–9). https://doi.org/10.1109/SAR-SSI.2011.5931389
Gowtham, R., & Krishnamurthi, I. (2014). A comprehensive and efficacious architecture for detecting phishing webpages. Computers & Security, 40, 23–37.
He, M., Horng, S. J., Fan, P., Khan, M. K., Run, R. S., Lai, J. L., Chen, R. J., & Sutanto, A. (2011). An efficient phishing webpage detector. Expert Systems with Applications, 38(10), 12,018–12,027.
Jain, A. K., & Gupta, B. (2021). A survey of phishing attack techniques, defence mechanisms and open research challenges. Enterprise Information Systems, 1–39.
Li, Y., Yang, Z., Chen, X., Yuan, H., & Liu, W. (2019). A stacking model using URL and HTML features for phishing webpage detection. Future Generation Computer Systems, 94, 27–39.
Mohammad, R. M., Thabtah, F., & McCluskey, L. (2012). An assessment of features related to phishing websites using an automated technique. In 2012 international conference for internet technology and secured transactions (pp. 492–497). IEEE.
Mourtaji, Y., Bouhorma, M., Alghazzawi, D., Aldabbagh, G., & Alghamdi, A. (2021). Hybrid rule-based solution for phishing URL detection using convolutional neural network. Wireless Communications and Mobile Computing, 2021, 8241104. https://doi.org/10.1155/2021/8241104.
Prakash, P., Kumar, M., Kompella, R. R., & Gupta, M. (2010). Phishnet: Predictive blacklisting to detect phishing attacks. In 2010 proceedings IEEE INFOCOM (pp. 1–5). https://doi.org/10.1109/INFCOM.2010.5462216
Rao, R. S., & Pais, A. R. (2017). An enhanced blacklist method to detect phishing websites. In International conference on information systems security (pp. 323–333). Springer.
Rao, R. S., & Pais, A. R. (2019). Detection of phishing websites using an efficient feature-based machine learning framework. Neural Computing and Applications, 31(8), 3851–3873.
Rao, R. S., & Pais, A. R. (2019). Two level filtering mechanism to detect phishing sites using lightweight visual similarity approach. Journal of Ambient Intelligence and Humanized Computing, 11, 1–20.
Rao, R. S., Vaishnavi, T., & Pais, A. R. (2019). Phishdump: A multi-model ensemble based technique for the detection of phishing sites in mobile devices. Pervasive and Mobile Computing, 60(101), 084.
Rao, R. S., Vaishnavi, T., & Pais, A. R. (2020). Catchphish: Detection of phishing websites by inspecting URLs. Journal of Ambient Intelligence and Humanized Computing, 11(2), 813–825.
Rao, R. S., Pais, A. R., & Anand, P. (2021). A heuristic technique to detect phishing websites using TWSVM classifier. Neural Computing and Applications, 33(11), 5733–5752.
Sahingoz, O. K., Buber, E., Demir, O., & Diri, B. (2019). Machine learning based phishing detection from URLs. Expert Systems with Applications, 117, 345–357.
Shirazi, H., Bezawada, B., & Ray, I. (2018). “kn0w thy doma1n name” unbiased phishing detection using domain name based features. In Proceedings of the 23nd ACM on symposium on access control models and technologies (pp 69–75).
Shirazi, H., Bezawada, B., & Ray, I. (2018). “kn0w thy doma1n name” unbiased phishing detection using domain name based features. In Proceedings of the 23nd ACM on symposium on access control models and technologies (pp. 69–75).
da Silva, C. M. R., Feitosa, E. L., & Garcia, V. C. (2020). Heuristic-based strategy for phishing prediction: A survey of URL-based approach. Computers & Security, 88(101), 613.
Su, K. W., Wu, K. P., Lee, H. M., & Wei, T. E. (2013). Suspicious URL filtering based on logistic regression with multi-view analysis. In 2013 8th Asia joint conference on information security (pp. 77–84). IEEE.
Tharani, J. S., & Arachchilage, N. A. (2020). Understanding phishers’ strategies of mimicking uniform resource locators to leverage phishing attacks: A machine learning approach. Security and Privacy, 3(5), e120.
Vijayalakshmi, M., Shalinie, S. M., Yang, M. H., et al. (2020). Web phishing detection techniques: A survey on the state-of-the-art, taxonomy and future directions. IET Networks, 9(5), 235–246.
Wang, S., Khan, S., Xu, C., Nazir, S., & Hafeez, A. (2020). Deep learning-based efficient model development for phishing detection using random forest and BLSTM classifiers. Complexity, 2020, 8694796. https://doi.org/10.1155/2020/8694796.
Wang, Y., Agrawal, R., & Choi, B. Y. (2008). Light weight anti-phishing with user whitelisting in a web browser. In 2008 IEEE region 5 conference (pp. 1–4). IEEE.
Wei, W., Ke, Q., Nowak, J., Korytkowski, M., Scherer, R., & Woźniak, M. (2020). Accurate and fast URL phishing detector: A convolutional neural network approach. Computer Networks, 178(107), 275.
Xiao, X., Zhang, D., Hu, G., Jiang, Y., & Xia, S. (2020). CNN-MHSA: A convolutional neural network and multi-head self-attention combined approach for detecting phishing websites. Neural Networks, 125, 303–312.
Xiao, X., Xiao, W., Zhang, D., Zhang, B., Hu, G., Li, Q., & Xia, S. (2021). Phishing websites detection via CNN and multi-head self-attention on imbalanced datasets. Computers & Security, 108, 102372.
Xu, L., Zhan, Z., Xu, S., & Ye, K. (2013). Cross-layer detection of malicious websites. In Proceedings of the 3rd ACM conference on data and application security and privacy, association for computing machinery, New York, CODASPY ’13 (pp. 141–152). https://doi.org/10.1145/2435349.2435366
Yang, P., Zhao, G., & Zeng, P. (2019). Phishing website detection based on multidimensional features driven by deep learning. IEEE Access, 7, 15,196–15,209.
Zamir, A., Khan, H. U., Iqbal, T., Yousaf, N., Aslam, F., Anjum, A., & Hamdani, M. (2020). Phishing web site detection using diverse machine learning algorithms. The Electronic Library, 38(1), 65–80. https://doi.org/10.1108/EL-05-2019-0118
Zhang, D., Yan, Z., Jiang, H., & Kim, T. (2014). A domain-feature enhanced classification model for the detection of Chinese phishing e-business websites. Information & Management, 51(7), 845–853.
Zhang, W., Jiang, Q., Chen, L., & Li, C. (2017). Two-stage elm for phishing web pages detection using hybrid features. World Wide Web, 20(4), 797–813.
Zhang, X., Zeng, Y., Jin, X., Yan, Z., & Geng, G. (2017). Boosting the phishing detection performance by semantic analysis. In 2017 IEEE international conference on big data (big data) (pp. 1063–1070).
Zhang, Y., Hong, J. I., Cranor, L. F. (2007). Cantina: A content-based approach to detecting phishing web sites. In Proceedings of the 16th international conference on World Wide Web (pp. 639–648).
Acknowledgements
The authors would like to thank Ministry of Electronics and Information Technology (Meity), Government of India for their support in part of the research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rao, R.S., Umarekar, A. & Pais, A.R. Application of word embedding and machine learning in detecting phishing websites. Telecommun Syst 79, 33–45 (2022). https://doi.org/10.1007/s11235-021-00850-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11235-021-00850-6