Abstract
We live in a world where computers are constantly changing the way we do things. People spend many hours on their phones or computers, whether it be for work or leisure purposes. The danger is that these unsuspecting users can be targeted for attacks at any time and can fall victim to many types of scams or phishing attacks. These attacks can be harmful to the user by getting valuable credentials, money or even installing malicious software on their devices, all while the user is unaware of what has just happened. In a business environment these can lead to mass data breeches which could end up costing a company millions of euros. Many users are not trained to recognize phishing texts, so an alternative solution is needed to help prevent users from falling into these traps. In this paper we will be investigating Natural Language Processing (NLP), a subsection of Machine Learning (ML) to try generate solutions to the problem of phishing. We will investigating different NLP solutions: Word2Vec, Doc2Vec and BERT, and different ML solutions: RNN, LSTM, CNN and TD-IDF. All of these different approaches provide good classification results ranging from f1-scores of 90.03–98.94.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
The Growing Role of Machine Learning in Cybersecurity, June 2019. https://www.securityroundtable.org/the-growing-role-of-machine-learning-in-cybersecurity/
The Real Reason For Successful Phishing Attacks. https://blog.usecure.io/the-real-reason-why-phishing-attacks-are-so-successful
Reinheimer, B., et al.: An investigation of phishing awareness and education over time: when and how to best remind users. In: Sixteenth Symposium on Usable Privacy and Security (SOUPS 2020), pp. 259–284. USENIX Association, August 2020. ISBN 978-1-939133-16-8. https://www.usenix.org/conference/soups2020/presentation/reinheimer
Kemp, S.: Digital trends 2020, January 2020. https://thenextweb.com/growth-quarters/2020/01/30/digital-trends-2020-every-single-stat-you-need-to-know-about-the-internet/
What is Cyber Security?, April 2021. https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
Leswing, K.: Twitter hackers who targeted Elon Musk and others, July 2020. https://www.cnbc.com/2020/07/16/twitter-hackers-made-121000-in-bitcoin-analysis-shows.html
What is the Definition of Machine Learning?, May 2020. https://www.expert.ai/blog/machine-learning-definition/
Yse, D.L.: Your guide to natural language processing (NLP). https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1
Google code archive - long-term storage for google code project hosting. https://code.google.com/archive/p/word2vec/. Accessed 20 Dec 2020
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. arXiv:1405.4053. version: 2
Venkatachalam, M.: Recurrent neural networks. https://towardsdatascience.com/recurrent-neural-networks-d4642c9bc7ce
Understanding LSTM networks - colah’s blog. https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Open sourcing BERT: State-of-the-art pre-training for natural language processing. Google AI Blog. http://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html. Accessed 20 Dec 2020
Saha, S.: A comprehensive guide to convolutional neural networks - the ELI5 way. https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
What is TF-IDF? https://monkeylearn.com/blog/what-is-tf-idf/. Section: Machine Learning
Nazario, J.: Phishing Email Dataset. https://monkey.org/~jose/phishing/
Tatman, R.: Fraudulent E-mail Corpus. https://kaggle.com/rtatman/fraudulent-email-corpus
CALO Project. Enron Email Dataset. https://www.cs.cmu.edu/~enron/
Lin, J.: lintool/Enron2mbox, July 2018. https://github.com/lintool/Enron2mbox. original-date: 2016–10-16T16:39:08Z
Mailbox - Manipulate mailboxes in various formats - Python 3.9.1 documentation. https://docs.python.org/3/library/mailbox.html
Email - An email and MIME handling package - Python 3.9.1 documentation. https://docs.python.org/3/library/email.html. Accessed 11 Feb 2021
Accessing Text Corpora and Lexical Resources. https://www.nltk.org/book/ch02.html
Acknowledgments
This work was partially supported by the Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), within project “CybersSeCIP” (NORTE-01-0145-FEDER-000044).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Jonker, R.A.A., Poudel, R., Pedrosa, T., Lopes, R.P. (2021). Using Natural Language Processing for Phishing Detection. In: Pereira, A.I., et al. Optimization, Learning Algorithms and Applications. OL2A 2021. Communications in Computer and Information Science, vol 1488. Springer, Cham. https://doi.org/10.1007/978-3-030-91885-9_40
Download citation
DOI: https://doi.org/10.1007/978-3-030-91885-9_40
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91884-2
Online ISBN: 978-3-030-91885-9
eBook Packages: Computer ScienceComputer Science (R0)