Skip to main content

Using Natural Language Processing for Phishing Detection

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1488))

Abstract

We live in a world where computers are constantly changing the way we do things. People spend many hours on their phones or computers, whether it be for work or leisure purposes. The danger is that these unsuspecting users can be targeted for attacks at any time and can fall victim to many types of scams or phishing attacks. These attacks can be harmful to the user by getting valuable credentials, money or even installing malicious software on their devices, all while the user is unaware of what has just happened. In a business environment these can lead to mass data breeches which could end up costing a company millions of euros. Many users are not trained to recognize phishing texts, so an alternative solution is needed to help prevent users from falling into these traps. In this paper we will be investigating Natural Language Processing (NLP), a subsection of Machine Learning (ML) to try generate solutions to the problem of phishing. We will investigating different NLP solutions: Word2Vec, Doc2Vec and BERT, and different ML solutions: RNN, LSTM, CNN and TD-IDF. All of these different approaches provide good classification results ranging from f1-scores of 90.03–98.94.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://monkey.org/~jose/phishing/.

  2. 2.

    https://www.kaggle.com/rtatman/fraudulent-email-corpus.

References

  1. The Growing Role of Machine Learning in Cybersecurity, June 2019. https://www.securityroundtable.org/the-growing-role-of-machine-learning-in-cybersecurity/

  2. The Real Reason For Successful Phishing Attacks. https://blog.usecure.io/the-real-reason-why-phishing-attacks-are-so-successful

  3. Reinheimer, B., et al.: An investigation of phishing awareness and education over time: when and how to best remind users. In: Sixteenth Symposium on Usable Privacy and Security (SOUPS 2020), pp. 259–284. USENIX Association, August 2020. ISBN 978-1-939133-16-8. https://www.usenix.org/conference/soups2020/presentation/reinheimer

  4. Kemp, S.: Digital trends 2020, January 2020. https://thenextweb.com/growth-quarters/2020/01/30/digital-trends-2020-every-single-stat-you-need-to-know-about-the-internet/

  5. What is Cyber Security?, April 2021. https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security

  6. Leswing, K.: Twitter hackers who targeted Elon Musk and others, July 2020. https://www.cnbc.com/2020/07/16/twitter-hackers-made-121000-in-bitcoin-analysis-shows.html

  7. What is the Definition of Machine Learning?, May 2020. https://www.expert.ai/blog/machine-learning-definition/

  8. Yse, D.L.: Your guide to natural language processing (NLP). https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1

  9. Google code archive - long-term storage for google code project hosting. https://code.google.com/archive/p/word2vec/. Accessed 20 Dec 2020

  10. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. arXiv:1405.4053. version: 2

  11. Venkatachalam, M.: Recurrent neural networks. https://towardsdatascience.com/recurrent-neural-networks-d4642c9bc7ce

  12. Understanding LSTM networks - colah’s blog. https://colah.github.io/posts/2015-08-Understanding-LSTMs/

  13. Open sourcing BERT: State-of-the-art pre-training for natural language processing. Google AI Blog. http://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html. Accessed 20 Dec 2020

  14. Saha, S.: A comprehensive guide to convolutional neural networks - the ELI5 way. https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

  15. What is TF-IDF? https://monkeylearn.com/blog/what-is-tf-idf/. Section: Machine Learning

  16. Nazario, J.: Phishing Email Dataset. https://monkey.org/~jose/phishing/

  17. Tatman, R.: Fraudulent E-mail Corpus. https://kaggle.com/rtatman/fraudulent-email-corpus

  18. CALO Project. Enron Email Dataset. https://www.cs.cmu.edu/~enron/

  19. Lin, J.: lintool/Enron2mbox, July 2018. https://github.com/lintool/Enron2mbox. original-date: 2016–10-16T16:39:08Z

  20. Mailbox - Manipulate mailboxes in various formats - Python 3.9.1 documentation. https://docs.python.org/3/library/mailbox.html

  21. Email - An email and MIME handling package - Python 3.9.1 documentation. https://docs.python.org/3/library/email.html. Accessed 11 Feb 2021

  22. Accessing Text Corpora and Lexical Resources. https://www.nltk.org/book/ch02.html

Download references

Acknowledgments

This work was partially supported by the Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), within project “CybersSeCIP” (NORTE-01-0145-FEDER-000044).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richard Adolph Aires Jonker .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jonker, R.A.A., Poudel, R., Pedrosa, T., Lopes, R.P. (2021). Using Natural Language Processing for Phishing Detection. In: Pereira, A.I., et al. Optimization, Learning Algorithms and Applications. OL2A 2021. Communications in Computer and Information Science, vol 1488. Springer, Cham. https://doi.org/10.1007/978-3-030-91885-9_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91885-9_40

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91884-2

  • Online ISBN: 978-3-030-91885-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics