Explainable Machine Learning for Bag of Words-Based Phishing Detection

Calzarossa, Maria Carla; Giudici, Paolo; Zieni, Rasha

doi:10.1007/978-3-031-44064-9_28

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1901))

Included in the following conference series:

World Conference on Explainable Artificial Intelligence

596 Accesses

Abstract

Phishing is a fraudulent practice aimed at convincing individuals to reveal sensitive information, such as account credentials or credit card details, by clicking the links of malicious websites. To reduce the impacts of phishing, the timely identification of these websites is essential. For this purpose, machine learning models are often devised. In this paper, we address the problem of website phishing detection by proposing an explainable machine learning model based on bag of words features extracted from the content of the webpages. To select the most important features to be used in the model, we propose to employ the Lorenz Zonoid, the multidimensional generalization of the Gini coefficient. The resulting model is characterized by a good accuracy and it provides explanations of which words are most likely associated with phishing websites. In addition, the number of features retained is significantly reduced, thus making the model parsimonious and easier to interpret.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://docs.apwg.org/reports/apwg_trends_report_q4_2022.pdf.

References

Blum, A., Wardman, B., Solorio, T., Warner, G.: Lexical feature based phishing URL detection using online learning. In: Proceedings of the ACM Conference on Computer and Communications Security, pp. 54–60 (2010)
Google Scholar
Bracke, P., Datta, A., Jung, C., Shayak, S.: Machine learning explainability in finance: an application to default risk analysis. Staff Working Paper, Bank of England (816) (2019)
Google Scholar
Bussmann, N., Giudici, P., Marinelli, D., Papenbrock, J.: Explainable AI in credit risk management. Comput. Econ. 57(1), 203–216 (2021)
Article Google Scholar
Calzarossa, M., Giudici, P., Zieni, R.: Explainable machine learning for phishing feature detection. Qual. Reliab. Eng. Int. (2023)
Google Scholar
Corona, I., et al.: DeltaPhish: detecting phishing webpages in compromised websites. In: Foley, S.N., Gollmann, D., Snekkenes, E. (eds.) ESORICS 2017. LNCS, vol. 10492, pp. 370–388. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66402-6_22
Chapter Google Scholar
Galego Hernandes, P., Floret, C., Cardozo De Almeida, K., Da Silva, V., Papa, J., Pontara Da Costa, K.: Phishing detection using URL-based XAI techniques. In: Proceedings of the IEEE Symposium Series on Computational Intelligence - SSCI. IEEE (2021)
Google Scholar
Giudici, P., Raffinetti, E.: Lorenz model selection. J. Classif. 37(2), 754–768 (2020)
Article MathSciNet MATH Google Scholar
Giudici, P., Raffinetti, E.: Shapley-Lorenz explainable artificial intelligence. Expert Syst. Appl. 158(895), 1–9 (2021)
Google Scholar
Jain, A., Gupta, B.: A machine learning based approach for phishing detection using hyperlinks information. J. Ambient. Intell. Humaniz. Comput. 10, 2015–2028 (2019)
Article Google Scholar
Koshevoy, G., Mosler, K.: The Lorenz Zonoid of a multivariate distribution. J. Am. Stat. Assoc. 91(434), 873–882 (1996)
Article MathSciNet MATH Google Scholar
Le, A., Markopoulou, A., Faloutsos, M.: PhishDef: URL names say it all. In: Proceedings of the 30th IEEE International Conference on Computer Communications - INFOCOM, pp. 191–195. IEEE (2011)
Google Scholar
Ma, J., Saul, L., Savage, S., Voelker, G.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD, pp. 1245–1254. ACM (2009)
Google Scholar
Ma, J., Saul, L., Savage, S., Voelker, G.: Learning to detect malicious URLs. ACM Trans. Intell. Syst. Technol. 2(3) (2011)
Google Scholar
Marchal, S., Francois, J., State, R., Engel, T.: PhishStorm: detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manage. 11(4), 458–471 (2014)
Article Google Scholar
Rao, R., Pais, A., Anand, P.: A heuristic technique to detect phishing websites using TWSVM classifier. Neural Comput. Appl. 33(11), 5733–5752 (2021)
Article Google Scholar
Sagi, O., Rokach, L.: Explainable decision forest: transforming a decision forest into an interpretable tree. Inf. Fusion 61, 124–138 (2020)
Article Google Scholar
Shapley, L.: A value for \(n\)-person games. In: Contributions to the Theory of Games II, pp. 307–317 (1953)
Google Scholar
Singh, A.: Dataset of malicious and benign webpages. Mendeley Data (2020). https://data.mendeley.com/datasets/gdx3pkwp47/2
Singh, A.: Malicious and benign webpages dataset. Data Brief 32, 106304 (2020)
Article Google Scholar
Tupsamudre, H., Singh, A.K., Lodha, S.: Everything is in the name – a URL based approach for phishing detection. In: Dolev, S., Hendler, D., Lodha, S., Yung, M. (eds.) CSCML 2019. LNCS, vol. 11527, pp. 231–248. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20951-3_21
Chapter Google Scholar
Verma, R., Dyer, K.: On the character of phishing URLs: accurate and robust statistical learning classifiers. In: Proceedings of the 5th ACM Conference on Data and Application Security and Privacy - CODASPY, pp. 111–122. ACM (2015)
Google Scholar
Zieni, R., Massari, L., Calzarossa, M.: Phishing or not phishing? A survey on the detection of phishing websites. IEEE Access 11, 18499–18519 (2023)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
Maria Carla Calzarossa & Rasha Zieni
Department of Economics and Management, University of Pavia, Pavia, Italy
Paolo Giudici

Authors

Maria Carla Calzarossa
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Giudici
View author publications
You can also search for this author in PubMed Google Scholar
Rasha Zieni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maria Carla Calzarossa .

Editor information

Editors and Affiliations

Technological University Dublin, Dublin, Ireland
Luca Longo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Calzarossa, M.C., Giudici, P., Zieni, R. (2023). Explainable Machine Learning for Bag of Words-Based Phishing Detection. In: Longo, L. (eds) Explainable Artificial Intelligence. xAI 2023. Communications in Computer and Information Science, vol 1901. Springer, Cham. https://doi.org/10.1007/978-3-031-44064-9_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-44064-9_28
Published: 30 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44063-2
Online ISBN: 978-3-031-44064-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Explainable Machine Learning for Bag of Words-Based Phishing Detection