Skip to main content

Explainable Machine Learning for Bag of Words-Based Phishing Detection

  • Conference paper
  • First Online:
Explainable Artificial Intelligence (xAI 2023)

Abstract

Phishing is a fraudulent practice aimed at convincing individuals to reveal sensitive information, such as account credentials or credit card details, by clicking the links of malicious websites. To reduce the impacts of phishing, the timely identification of these websites is essential. For this purpose, machine learning models are often devised. In this paper, we address the problem of website phishing detection by proposing an explainable machine learning model based on bag of words features extracted from the content of the webpages. To select the most important features to be used in the model, we propose to employ the Lorenz Zonoid, the multidimensional generalization of the Gini coefficient. The resulting model is characterized by a good accuracy and it provides explanations of which words are most likely associated with phishing websites. In addition, the number of features retained is significantly reduced, thus making the model parsimonious and easier to interpret.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://docs.apwg.org/reports/apwg_trends_report_q4_2022.pdf.

References

  1. Blum, A., Wardman, B., Solorio, T., Warner, G.: Lexical feature based phishing URL detection using online learning. In: Proceedings of the ACM Conference on Computer and Communications Security, pp. 54–60 (2010)

    Google Scholar 

  2. Bracke, P., Datta, A., Jung, C., Shayak, S.: Machine learning explainability in finance: an application to default risk analysis. Staff Working Paper, Bank of England (816) (2019)

    Google Scholar 

  3. Bussmann, N., Giudici, P., Marinelli, D., Papenbrock, J.: Explainable AI in credit risk management. Comput. Econ. 57(1), 203–216 (2021)

    Article  Google Scholar 

  4. Calzarossa, M., Giudici, P., Zieni, R.: Explainable machine learning for phishing feature detection. Qual. Reliab. Eng. Int. (2023)

    Google Scholar 

  5. Corona, I., et al.: DeltaPhish: detecting phishing webpages in compromised websites. In: Foley, S.N., Gollmann, D., Snekkenes, E. (eds.) ESORICS 2017. LNCS, vol. 10492, pp. 370–388. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66402-6_22

    Chapter  Google Scholar 

  6. Galego Hernandes, P., Floret, C., Cardozo De Almeida, K., Da Silva, V., Papa, J., Pontara Da Costa, K.: Phishing detection using URL-based XAI techniques. In: Proceedings of the IEEE Symposium Series on Computational Intelligence - SSCI. IEEE (2021)

    Google Scholar 

  7. Giudici, P., Raffinetti, E.: Lorenz model selection. J. Classif. 37(2), 754–768 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  8. Giudici, P., Raffinetti, E.: Shapley-Lorenz explainable artificial intelligence. Expert Syst. Appl. 158(895), 1–9 (2021)

    Google Scholar 

  9. Jain, A., Gupta, B.: A machine learning based approach for phishing detection using hyperlinks information. J. Ambient. Intell. Humaniz. Comput. 10, 2015–2028 (2019)

    Article  Google Scholar 

  10. Koshevoy, G., Mosler, K.: The Lorenz Zonoid of a multivariate distribution. J. Am. Stat. Assoc. 91(434), 873–882 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  11. Le, A., Markopoulou, A., Faloutsos, M.: PhishDef: URL names say it all. In: Proceedings of the 30th IEEE International Conference on Computer Communications - INFOCOM, pp. 191–195. IEEE (2011)

    Google Scholar 

  12. Ma, J., Saul, L., Savage, S., Voelker, G.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD, pp. 1245–1254. ACM (2009)

    Google Scholar 

  13. Ma, J., Saul, L., Savage, S., Voelker, G.: Learning to detect malicious URLs. ACM Trans. Intell. Syst. Technol. 2(3) (2011)

    Google Scholar 

  14. Marchal, S., Francois, J., State, R., Engel, T.: PhishStorm: detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manage. 11(4), 458–471 (2014)

    Article  Google Scholar 

  15. Rao, R., Pais, A., Anand, P.: A heuristic technique to detect phishing websites using TWSVM classifier. Neural Comput. Appl. 33(11), 5733–5752 (2021)

    Article  Google Scholar 

  16. Sagi, O., Rokach, L.: Explainable decision forest: transforming a decision forest into an interpretable tree. Inf. Fusion 61, 124–138 (2020)

    Article  Google Scholar 

  17. Shapley, L.: A value for \(n\)-person games. In: Contributions to the Theory of Games II, pp. 307–317 (1953)

    Google Scholar 

  18. Singh, A.: Dataset of malicious and benign webpages. Mendeley Data (2020). https://data.mendeley.com/datasets/gdx3pkwp47/2

  19. Singh, A.: Malicious and benign webpages dataset. Data Brief 32, 106304 (2020)

    Article  Google Scholar 

  20. Tupsamudre, H., Singh, A.K., Lodha, S.: Everything is in the name – a URL based approach for phishing detection. In: Dolev, S., Hendler, D., Lodha, S., Yung, M. (eds.) CSCML 2019. LNCS, vol. 11527, pp. 231–248. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20951-3_21

    Chapter  Google Scholar 

  21. Verma, R., Dyer, K.: On the character of phishing URLs: accurate and robust statistical learning classifiers. In: Proceedings of the 5th ACM Conference on Data and Application Security and Privacy - CODASPY, pp. 111–122. ACM (2015)

    Google Scholar 

  22. Zieni, R., Massari, L., Calzarossa, M.: Phishing or not phishing? A survey on the detection of phishing websites. IEEE Access 11, 18499–18519 (2023)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maria Carla Calzarossa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Calzarossa, M.C., Giudici, P., Zieni, R. (2023). Explainable Machine Learning for Bag of Words-Based Phishing Detection. In: Longo, L. (eds) Explainable Artificial Intelligence. xAI 2023. Communications in Computer and Information Science, vol 1901. Springer, Cham. https://doi.org/10.1007/978-3-031-44064-9_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44064-9_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44063-2

  • Online ISBN: 978-3-031-44064-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics