Abstract
Phishing is a fraudulent practice aimed at convincing individuals to reveal sensitive information, such as account credentials or credit card details, by clicking the links of malicious websites. To reduce the impacts of phishing, the timely identification of these websites is essential. For this purpose, machine learning models are often devised. In this paper, we address the problem of website phishing detection by proposing an explainable machine learning model based on bag of words features extracted from the content of the webpages. To select the most important features to be used in the model, we propose to employ the Lorenz Zonoid, the multidimensional generalization of the Gini coefficient. The resulting model is characterized by a good accuracy and it provides explanations of which words are most likely associated with phishing websites. In addition, the number of features retained is significantly reduced, thus making the model parsimonious and easier to interpret.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Blum, A., Wardman, B., Solorio, T., Warner, G.: Lexical feature based phishing URL detection using online learning. In: Proceedings of the ACM Conference on Computer and Communications Security, pp. 54–60 (2010)
Bracke, P., Datta, A., Jung, C., Shayak, S.: Machine learning explainability in finance: an application to default risk analysis. Staff Working Paper, Bank of England (816) (2019)
Bussmann, N., Giudici, P., Marinelli, D., Papenbrock, J.: Explainable AI in credit risk management. Comput. Econ. 57(1), 203–216 (2021)
Calzarossa, M., Giudici, P., Zieni, R.: Explainable machine learning for phishing feature detection. Qual. Reliab. Eng. Int. (2023)
Corona, I., et al.: DeltaPhish: detecting phishing webpages in compromised websites. In: Foley, S.N., Gollmann, D., Snekkenes, E. (eds.) ESORICS 2017. LNCS, vol. 10492, pp. 370–388. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66402-6_22
Galego Hernandes, P., Floret, C., Cardozo De Almeida, K., Da Silva, V., Papa, J., Pontara Da Costa, K.: Phishing detection using URL-based XAI techniques. In: Proceedings of the IEEE Symposium Series on Computational Intelligence - SSCI. IEEE (2021)
Giudici, P., Raffinetti, E.: Lorenz model selection. J. Classif. 37(2), 754–768 (2020)
Giudici, P., Raffinetti, E.: Shapley-Lorenz explainable artificial intelligence. Expert Syst. Appl. 158(895), 1–9 (2021)
Jain, A., Gupta, B.: A machine learning based approach for phishing detection using hyperlinks information. J. Ambient. Intell. Humaniz. Comput. 10, 2015–2028 (2019)
Koshevoy, G., Mosler, K.: The Lorenz Zonoid of a multivariate distribution. J. Am. Stat. Assoc. 91(434), 873–882 (1996)
Le, A., Markopoulou, A., Faloutsos, M.: PhishDef: URL names say it all. In: Proceedings of the 30th IEEE International Conference on Computer Communications - INFOCOM, pp. 191–195. IEEE (2011)
Ma, J., Saul, L., Savage, S., Voelker, G.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD, pp. 1245–1254. ACM (2009)
Ma, J., Saul, L., Savage, S., Voelker, G.: Learning to detect malicious URLs. ACM Trans. Intell. Syst. Technol. 2(3) (2011)
Marchal, S., Francois, J., State, R., Engel, T.: PhishStorm: detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manage. 11(4), 458–471 (2014)
Rao, R., Pais, A., Anand, P.: A heuristic technique to detect phishing websites using TWSVM classifier. Neural Comput. Appl. 33(11), 5733–5752 (2021)
Sagi, O., Rokach, L.: Explainable decision forest: transforming a decision forest into an interpretable tree. Inf. Fusion 61, 124–138 (2020)
Shapley, L.: A value for \(n\)-person games. In: Contributions to the Theory of Games II, pp. 307–317 (1953)
Singh, A.: Dataset of malicious and benign webpages. Mendeley Data (2020). https://data.mendeley.com/datasets/gdx3pkwp47/2
Singh, A.: Malicious and benign webpages dataset. Data Brief 32, 106304 (2020)
Tupsamudre, H., Singh, A.K., Lodha, S.: Everything is in the name – a URL based approach for phishing detection. In: Dolev, S., Hendler, D., Lodha, S., Yung, M. (eds.) CSCML 2019. LNCS, vol. 11527, pp. 231–248. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20951-3_21
Verma, R., Dyer, K.: On the character of phishing URLs: accurate and robust statistical learning classifiers. In: Proceedings of the 5th ACM Conference on Data and Application Security and Privacy - CODASPY, pp. 111–122. ACM (2015)
Zieni, R., Massari, L., Calzarossa, M.: Phishing or not phishing? A survey on the detection of phishing websites. IEEE Access 11, 18499–18519 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Calzarossa, M.C., Giudici, P., Zieni, R. (2023). Explainable Machine Learning for Bag of Words-Based Phishing Detection. In: Longo, L. (eds) Explainable Artificial Intelligence. xAI 2023. Communications in Computer and Information Science, vol 1901. Springer, Cham. https://doi.org/10.1007/978-3-031-44064-9_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-44064-9_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44063-2
Online ISBN: 978-3-031-44064-9
eBook Packages: Computer ScienceComputer Science (R0)