Using Natural Language Processing for Phishing Detection

Jonker, Richard Adolph Aires; Poudel, Roshan; Pedrosa, Tiago; Lopes, Rui Pedro

doi:10.1007/978-3-030-91885-9_40

Using Natural Language Processing for Phishing Detection

Conference paper
First Online: 01 January 2022

1337 Accesses
2 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1488))

Abstract

We live in a world where computers are constantly changing the way we do things. People spend many hours on their phones or computers, whether it be for work or leisure purposes. The danger is that these unsuspecting users can be targeted for attacks at any time and can fall victim to many types of scams or phishing attacks. These attacks can be harmful to the user by getting valuable credentials, money or even installing malicious software on their devices, all while the user is unaware of what has just happened. In a business environment these can lead to mass data breeches which could end up costing a company millions of euros. Many users are not trained to recognize phishing texts, so an alternative solution is needed to help prevent users from falling into these traps. In this paper we will be investigating Natural Language Processing (NLP), a subsection of Machine Learning (ML) to try generate solutions to the problem of phishing. We will investigating different NLP solutions: Word2Vec, Doc2Vec and BERT, and different ML solutions: RNN, LSTM, CNN and TD-IDF. All of these different approaches provide good classification results ranging from f1-scores of 90.03–98.94.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

The Growing Role of Machine Learning in Cybersecurity, June 2019. https://www.securityroundtable.org/the-growing-role-of-machine-learning-in-cybersecurity/
The Real Reason For Successful Phishing Attacks. https://blog.usecure.io/the-real-reason-why-phishing-attacks-are-so-successful
Reinheimer, B., et al.: An investigation of phishing awareness and education over time: when and how to best remind users. In: Sixteenth Symposium on Usable Privacy and Security (SOUPS 2020), pp. 259–284. USENIX Association, August 2020. ISBN 978-1-939133-16-8. https://www.usenix.org/conference/soups2020/presentation/reinheimer
Kemp, S.: Digital trends 2020, January 2020. https://thenextweb.com/growth-quarters/2020/01/30/digital-trends-2020-every-single-stat-you-need-to-know-about-the-internet/
What is Cyber Security?, April 2021. https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
Leswing, K.: Twitter hackers who targeted Elon Musk and others, July 2020. https://www.cnbc.com/2020/07/16/twitter-hackers-made-121000-in-bitcoin-analysis-shows.html
What is the Definition of Machine Learning?, May 2020. https://www.expert.ai/blog/machine-learning-definition/
Yse, D.L.: Your guide to natural language processing (NLP). https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1
Google code archive - long-term storage for google code project hosting. https://code.google.com/archive/p/word2vec/. Accessed 20 Dec 2020
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. arXiv:1405.4053. version: 2
Venkatachalam, M.: Recurrent neural networks. https://towardsdatascience.com/recurrent-neural-networks-d4642c9bc7ce
Understanding LSTM networks - colah’s blog. https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Open sourcing BERT: State-of-the-art pre-training for natural language processing. Google AI Blog. http://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html. Accessed 20 Dec 2020
Saha, S.: A comprehensive guide to convolutional neural networks - the ELI5 way. https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
What is TF-IDF? https://monkeylearn.com/blog/what-is-tf-idf/. Section: Machine Learning
Nazario, J.: Phishing Email Dataset. https://monkey.org/~jose/phishing/
Tatman, R.: Fraudulent E-mail Corpus. https://kaggle.com/rtatman/fraudulent-email-corpus
CALO Project. Enron Email Dataset. https://www.cs.cmu.edu/~enron/
Lin, J.: lintool/Enron2mbox, July 2018. https://github.com/lintool/Enron2mbox. original-date: 2016–10-16T16:39:08Z
Mailbox - Manipulate mailboxes in various formats - Python 3.9.1 documentation. https://docs.python.org/3/library/mailbox.html
Email - An email and MIME handling package - Python 3.9.1 documentation. https://docs.python.org/3/library/email.html. Accessed 11 Feb 2021
Accessing Text Corpora and Lexical Resources. https://www.nltk.org/book/ch02.html

Download references

Acknowledgments

This work was partially supported by the Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), within project “CybersSeCIP” (NORTE-01-0145-FEDER-000044).

Author information

Authors and Affiliations

Research Centre in Digitalization and Intelligent Robotics (CeDRI), Instituto Politécnico de Bragança, Bragança, Portugal
Richard Adolph Aires Jonker, Roshan Poudel, Tiago Pedrosa & Rui Pedro Lopes

Authors

Richard Adolph Aires Jonker
View author publications
You can also search for this author in PubMed Google Scholar
Roshan Poudel
View author publications
You can also search for this author in PubMed Google Scholar
Tiago Pedrosa
View author publications
You can also search for this author in PubMed Google Scholar
Rui Pedro Lopes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Richard Adolph Aires Jonker .

Editor information

Editors and Affiliations

Instituto Politécnico de Bragança, Bragança, Portugal
Ana I. Pereira
Instituto Politécnico de Bragança, Bragança, Portugal
Florbela P. Fernandes
Instituto Politécnico de Bragança, Bragança, Portugal
João P. Coelho
Instituto Politécnico de Bragança, Bragança, Portugal
João P. Teixeira
Instituto Politécnico de Bragança, Bragança, Portugal
Maria F. Pacheco
Instituto Politécnico de Bragança, Bragança, Portugal
Paulo Alves
Instituto Politécnico de Bragança, Bragança, Portugal
Rui P. Lopes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jonker, R.A.A., Poudel, R., Pedrosa, T., Lopes, R.P. (2021). Using Natural Language Processing for Phishing Detection. In: Pereira, A.I., et al. Optimization, Learning Algorithms and Applications. OL2A 2021. Communications in Computer and Information Science, vol 1488. Springer, Cham. https://doi.org/10.1007/978-3-030-91885-9_40

Download citation

DOI: https://doi.org/10.1007/978-3-030-91885-9_40
Published: 01 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91884-2
Online ISBN: 978-3-030-91885-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics