Machine learning based phishing detection from URLs

doi:10.1016/j.eswa.2018.09.029

Expert Systems with Applications

Volume 117, 1 March 2019, Pages 345-357

https://doi.org/10.1016/j.eswa.2018.09.029 Get rights and content

Highlights

•
Use of 7 different classification algorithms and NLP based features.
•
A Big URL Data Set is produced and shared (36,400 legitimate and 37,175 phishing).
•
Real-time and language-independent classification algorithms.
•
Feature-rich classifiers with Word Vectors, NLP-based and Hybrid features.
•
The proposed approach reaches 97.98% accuracy rate.

Abstract

Due to the rapid growth of the Internet, users change their preference from traditional shopping to the electronic commerce. Instead of bank/shop robbery, nowadays, criminals try to find their victims in the cyberspace with some specific tricks. By using the anonymous structure of the Internet, attackers set out new techniques, such as phishing, to deceive victims with the use of false websites to collect their sensitive information such as account IDs, usernames, passwords, etc. Understanding whether a web page is legitimate or phishing is a very challenging problem, due to its semantics-based attack structure, which mainly exploits the computer users’ vulnerabilities. Although software companies launch new anti-phishing products, which use blacklists, heuristics, visual and machine learning-based approaches, these products cannot prevent all of the phishing attacks. In this paper, a real-time anti-phishing system, which uses seven different classification algorithms and natural language processing (NLP) based features, is proposed. The system has the following distinguishing properties from other studies in the literature: language independence, use of a huge size of phishing and legitimate data, real-time execution, detection of new websites, independence from third-party services and use of feature-rich classifiers. For measuring the performance of the system, a new dataset is constructed, and the experimental results are tested on it. According to the experimental and comparative results from the implemented classification algorithms, Random Forest algorithm with only NLP based features gives the best performance with the 97.98% accuracy rate for detection of phishing URLs.

Introduction

Due to the rapid developments of the global networking and communication technologies, lots of our daily life activities such as social networks, electronic banking, e-commerce, etc. are transferred to the cyberspace. The open, anonymous and uncontrolled infrastructure of the Internet enables an excellent platform for cyberattacks, which presents serious security vulnerabilities not only for networks but also for the standard computer users even for the experienced ones. Although carefulness and experience of the user are important, it is not possible to completely prevent users from falling to the phishing scam (Greene, Steves, & Theofanos, 2018). Because, to increase the success of the phishing attacks, attackers also get into consideration about the personality characteristics of the end user especially for deceiving the relatively experienced users (Curtis, Rajivan, Jones, & Gonzalez, 2018). End-user-targeted cyberattacks cause massive loss of sensitive/personal information and even money for individuals whose total amount can reach billions of dollars in a year (Shaikh, Shabut, & Hossain, 2016).

Phishing attacks’ analogy is derived from “fishing” for victims, this type of attacks has attracted a great deal of attention from researchers in recent years. It is also a promising and attractive technique for attackers (also named as phishers) who open some fraudulent websites, which have exactly similar design of the popular and legal sites on the Internet. Although these pages have similar graphical user interfaces, they must have different Uniform Resource Locators (URLs) from the original page. Mainly, a careful and experienced user can easily detect these malicious web pages by looking at the URLs. However, due to the speed of life, most of the times, end users do not investigate the whole address of their active web page, which is generally forwarded by other web pages, social networking tools or by simply an email message as depicted in Fig. 1. By using this type of fraudulent URLs, a phisher tries to capture some sensitive and personal information of the victim like financial data, personal information, username, password, etc. (Gupta, Arachchilage, & Psannis, 2018). In the case of entering this type of fraudulent site, which is believed to be the original website, computer users can easily give their sensitive information without any doubt. Because the entered web page seems exactly same with the original web page.

In a related study about the user experiences of phishing attacks (Volkamer, Renaud, Reinheimer, & Kunz, 2017) computer users fall for phishing due to the five main reasons:

•
Users don't have detailed knowledge about URLs,
•
Users don't know, which web pages can be trusted,
•
Users don't see the whole address of the web page, due to the redirection or hidden URLs,
•
Users don't have much time for consulting the URL, or accidentally enter some web pages,
•
Users cannot distinguish phishing web pages from the legitimate ones.

Anti-Phishing Working Group published a report about the position of the phishing attacks in the last quarter of 2016 (APWG, 2017). They emphasized that phishing attacks especially target the end users in developing countries, which are ordered as firstly China with the rate of 47.09% (infected computers) and then he is followed by Turkey and Taiwan with the rate of 42.88% and 38.98% respectively. Additionally, due to the increased use of smartphones, the end users are not so careful while checking their social networks in motion. Therefore, attackers target the mobile device users to increase the efficiency of their attacks (Goel & Jain, 2018).

In the literature, there are some studies, which are focused on detecting phishing attacks. In the recent surveys, authors discuss the general characteristics of the existing phishing techniques by categorizing the technical approaches used in these type of attacks, and some practical and effective combating techniques are highlighted (Chiew, Yong and Tan, 2018, Qabajeh, Thabtah and Chiclana, 2018).

Phishing attacks exploit the vulnerabilities of the human users, therefore, some additional support systems are needed for the protection of the systems/users. The protection mechanisms are classified into two main groups: by increasing the awareness of the users and by using some additional programs as depicted in Fig. 2. Due to the vulnerability of the end user, an attacker can even target some experienced users by using new techniques and before giving the sensitive information, he is believed that this page is legitimate. Therefore, software-based phishing detection systems are preferred as decision support systems for the user. Mostly preferred techniques are Black/White Lists (Cao, Han, & Le, 2008), Image Processing (Fu, Wenyin, & Deng, 2006), (Toolan & Carthy, 2009) of the web page, Natural Language Processing (Stone, 2007), Rules (Cook, Gurbani, & Daniluk, 2008), Machine Learning (Abu-Nimeh, Nappa, Wang, & Nair 2007), etc.

In one of the recent survey (Gupta et al., 2018) on phishing, authors emphasized that when some new solutions are proposed to overcome various phishing attacks, attackers came with the vulnerabilities of the solution and produced new attack types. Therefore, it is highly recommended to use hybrid models instead of a single approach by the security manager of the networks.

In this paper, we are focused on the real-time detection of phishing web pages by investigating the URL of the web page with different machine learning algorithms (seven of them implemented and compared in the paper) and different feature sets. In the execution of a learning algorithm, not only the dataset but also the extraction of the features from this dataset are crucial. Therefore, firstly we collect lots of legitimate and fraudulent web page URLs and construct our own dataset. After that, we defined three different types of feature sets as Word Vectors, NLP based and Hybrid features to measure the efficiency of the proposed system.

The rest of the paper is organized as follows: in the first following section, the related works about phishing detection are examined. Section 3 focuses on the factors that make the detection of phishing attack from URLs difficult. The details of the proposed system and acquisition of the dataset are detailed in Section 4. Some comparative experiments are conducted, and results are depicted in Section. 5. Advantages of the proposed system are discussed in Section 6. Finally, Conclusions and Future Works on this topic are presented.

Section snippets

Related works

The phishing detection systems are generally divided into two groups: List Based Detection Systems and Machine Learning Based Detection Systems.

URLs and Attackers’ techniques

Attackers use different types of techniques for not to be detected either by security mechanisms or system admins. In this section, some of these techniques will be detailed. To understand the approach of attackers, firstly, the components of URLs should be known. The basic structure of a URL is depicted in Fig. 3.

In the standard form, a URL starts with its protocol name used to access the web page. After that, the subdomain and the Second Level Domain (SLD) name, which commonly refers to the

The proposed system and data processing

The dataset and its processing are very important parts of the machine learning based systems. The performance and efficiency of the system are directly related to them. Therefore, in this section, these topics are detailed.

Experimental results

This section gives the experimental details of the proposed model's classification algorithms and used feature extraction types (NLP based features, Word Vectors, and Hybrid) are detailed. Then, the comparative test results between these algorithms with related features are depicted.

Advantages of the proposed approach

As can be seen from the design of the proposed system, comparison table of the machine learning based phishing detection systems in Table 1 and the experimental results, our model have six main advantages as listed below.

Language independence: In most of the phishing detection system, language is very critical for the execution of the system. However, in the proposed system, we are using only URLs whose texts are constructed with random and long strings, which contain some specific keywords in

Conclusion and future works

In this paper, we have implemented a phishing detection system by using seven different machine learning algorithms, as Decision Tree, Adaboost, K-star, kNN (n = 3), Random Forest, SMO and Naive Bayes, and different number/types of features as NLP based features, word vectors, and hybrid features. To increase the accuracy of the detection system, construction of an efficient feature list is a crucial task. Therefore, we have grouped our feature list in two different classes as NLP based

Acknowledgment

Thanks to Roksit for their support in the implementation of this work.

References (43)

K.L. Chiew et al.
A survey of phishing attacks: Their types, vectors and technical approaches
Expert Systems with Applications
(2018)
S.R. Curtis et al.
Phishing attempts among the dark triad: Patterns of attack and vulnerability
Computers in Human Behavior
(2018)
D. Goel et al.
Mobile phishing attacks and defence mechanisms: State of art and open research challenges
Computers & Security
(2018)
I. Qabajeh et al.
A recent review of conventional vs. automated cybersecurity anti-phishing techniques
Computer Science Review
(2018)
S. Smadi et al.
Detection of online phishing email using dynamic evolving neural network based on reinforcement learning
Decision Support Systems
(2018)
C.L. Tan et al.
Phishwho: Phishing webpage detection via identity keywords extraction and target domain name finder
Decision Support Systems
(2016)
M. Volkamer et al.
A. User experiences of torpedo: Tooltip-powered phishing email detection
Computers and Security
(2017)
S. Abu-Nimeh et al.
A comparison of machine learning techniques for phishing detection
M. Babagoli et al.
Heuristic nonlinear regression strategy for detecting phishing websites
Soft Computing
(2018)

E. Buber et al.

Detecting phishing attacks from URL by using NLP techniques

E. Buber et al.

NLP based phishing attack detection from URLs

Y. Cao et al.

Anti-phishing based on automated individual white-list

D.L. Cook et al.

Phishwish: A stateless phishing filter using minimal rules

Financial cryptography and data security, Berlin, Heidelberg: Springer

(2008)

F. Feng et al.

The application of a novel neural network in the detection of phishing websites

Journal of Ambient Intelligence and Humanized Computing

(2018)

A.Y. Fu et al.

Detecting phishing web pages with visual similarity assessment based on earth mover's distance

IEEE Transactions on Dependable and Secure Computing

(2006)

Gibberish detector

A small program to detect gibberish using a markov chain

Google Developers

K. Greene et al.

No phishing beyond this point

Computer

(2018)

B.B. Gupta et al.

Defending against phishing attacks: Taxonomy of methods, current issues and future directions

Telecommunication Systems

(2018)

Cited by (425)

Enhancing cybersecurity: A review and comparative analysis of convolutional neural network approaches for detecting URL-based phishing attacks
2024, e-Prime - Advances in Electrical Engineering, Electronics and Energy
Phishing attempts to mimic the official websites of businesses, including banks, e-commerce, government offices, and financial institutions. Phishing websites aim to collect and retrieve sensitive data from users, including passwords, credit card numbers, email addresses, personal information, and so on. The growing frequency of phishing attacks has prompted the development of numerous anti-phishing technologies. Because machine learning (ML) techniques perform better in categorization problems, they are used extensively. But the most crucial features are not extracted by the algorithms in use today, which could result in a false categorization. In addition, the complex algorithms contribute to the long reaction time. To solve these issues, this study suggests using a Bidirectional Long Short-Term Memory-based Gated Highway Attention Block Convolutional Neural Network (BiLSTM-GHA-CNN) to detect phishing URLs.
Phishing URL detection generalisation using Unsupervised Domain Adaptation
2024, Computer Networks
Phishing attacks are a prevailing problem in cybersecurity. In many data breaches, the initial entry can be traced back to phishing. URL-based phishing detection is one of the many ways of phishing attempt detection where solely the properties of the URLs are used to decide whether a given URL is phishing or not. While there are multiple existing works that use machine learning and deep learning to detect phishing URLs, in this paper, we show that such methods lack generalisation (i.e., they work effectively only when the test sets are split from the same training dataset). This is a significant issue since the vast majority of phishing attempts are short-lived and use freshly created domain names. Also, many network vantage points and middleboxes record URLs in slightly different formats and as such, URL data collected at various companies may be different. To address this, we propose an Unsupervised Domain Adaptation-based framework to increase the model transferability between datasets. We evaluate our approach using three datasets and show that the increase in cross-dataset F1 score performance is 0.06 on average and in some cases approximately as high as 0.2.
The applicability of a hybrid framework for automated phishing detection
2024, Computers and Security
Phishing attacks are a critical and escalating cybersecurity threat in the modern digital landscape. As cybercriminals continually adapt their techniques, automated phishing detection systems have become essential for safeguarding Internet users. However, many current systems rely on single-analysis models, making them vulnerable to sophisticated bypass attempts by hackers. This research delves into the potential of hybrid approaches, which combine multiple models to enhance both the robustness and effectiveness of phishing detection. It highlights existing hybrid models' limitations that focus primarily on effectiveness while ignoring broader applicability. To address these gaps, we introduce a novel framework explicitly designed for applicability in the real world, which poses the foundation for practical and robust phishing detection architectures. We develop a proof of concept to evaluate its effectiveness, robustness, and detection speed. Additionally, we introduce an innovative methodology for simulating bypass attacks on single-analysis base models. Our experiments demonstrate that the proposed hybrid framework outperforms individual models, displaying higher effectiveness, robustness against bypassing attempts, and real-time detection capabilities. Our proof of concept achieves an accuracy of 97.44% thereby outperforming the current state-of-the-art approach while requiring less computational time. The results provide insights into the multifaceted factors of hybrid models, extending beyond mere effectiveness, and emphasize the importance of holistic applicability in hybrid approaches to address the critical need for robust defenses against phishing attacks.
Anti-phishing: A comprehensive perspective
2024, Expert Systems with Applications
Phishing is a form of deception technique that attackers often use to acquire sensitive information related to individuals and organizations fraudulently. Although Phishing attacks have been known for more than two decades, and there is ongoing research for developing effective techniques against these attacks, the increasing trend of attacks confirms the lack of robust solutions and techniques against these attacks. According to Trend Micro, over 90 percent of all Cybersecurity attacks begin with spear Phishing emails and hence there is a need for comprehensive research in the area of anti-Phishing to improve the overall Cybersecurity landscape. This paper, therefore, performs a comprehensive study and analysis of past research work in anti-Phishing. The survey also tries to study various relationships such as those between the Phishers and the motives behind Phishing and explores/assesses various tactics that are employed for launching Phishing attacks. Highlighting the role of social and cognitive factors in the success of a Phishing attack which was not focused on in earlier reviews, is one of the major contributions of this work. The paper also provides a detailed understanding of the types of Phishers and the type of Phishing performed by them with a comprehensive classification of anti-Phishing detection/prevention/awareness solutions through a systematic literature review. The contributions of leading organizations and their active role through various anti-Phishing products are also discussed in this paper to bring light to the research and development happening in the industry with respect to anti-Phishing. Finally, the cyber laws to handle Phishing attacks in various countries have been presented for readers’ interest. We believe this survey brings new knowledge and a comprehensive perspective to its readers from academia and industry to explore new horizons for research activities in anti-Phishing.
Intrusion detection based on phishing detection with machine learning
2024, Measurement: Sensors
Machine learning technique which uses artificial neural networks to learn representations. Phishing is a form of fraud in which the attacker tries to learn credential information from the websites. Web phishing is to steal sensitive information such as usernames, passwords and credit card details by way of impersonating a authorized entity. The Hybrid Ensemble Feature Selection is a new feature selection method for machine learning-based phishing detection systems (HEFS). The first step of HEFS involves using a novel Cumulative Distribution Function gradient (CDF-g) algorithm to generate primary feature subsets, which are then fed into a data perturbation ensemble to generate secondary feature subsets. We present the results of our approach and compare them to a few previous studies, with the paper focusing primarily on phishing urls for detecting the unauthorised one by using phishing detection method.
A comprehensive examination of email spoofing: Issues and prospects for email security
2024, Computers and Security
Attackers are becoming more skilled in recent years, using sophisticated technology to produce look-alike emails that make it difficult to distinguish between real and fake ones. Most false emails can be detected, but certain undiscovered ones can be dangerous and compromise security. The attacker compromises SMTP to launch an email spoofing attack. This is not difficult given that it was designed without any security safeguards. Spoofers typically exploit the various fields in email headers. By taking advantage of loopholes in email security systems, attackers can create an ideal spoofing mail. As a result, it appears as a reliable source and succeeds in phishing attempts. An in-depth analysis of the email process, its protocols, and authentication mechanisms along with the security measures and adoption rates that led to a variety of spoofing attacks has been examined in our work. Our experiments on renowned mail service suppliers observed that some of them are still vulnerable to associated flaws. Further, we analyzed how different aspects such as age and education, determine whether or not a message is spoofed, and how malware uses email as a command and control to compromise the victim's device and seize control of it. Further, it offers a multitude of mitigation strategies against spoofing attempts that aid aspirants in future research.

View all citing articles on Scopus

View full text

Machine learning based phishing detection from URLs

Highlights

Abstract

Introduction

Section snippets

Related works

URLs and Attackers’ techniques

The proposed system and data processing

Experimental results

Advantages of the proposed approach

Conclusion and future works

Acknowledgment

Expert Systems with Applications

Computers in Human Behavior

Computers & Security

Computer Science Review

Decision Support Systems

Decision Support Systems

Computers and Security

A comparison of machine learning techniques for phishing detection

Heuristic nonlinear regression strategy for detecting phishing websites

Soft Computing

Detecting phishing attacks from URL by using NLP techniques

NLP based phishing attack detection from URLs

Anti-phishing based on automated individual white-list

Phishwish: A stateless phishing filter using minimal rules

Financial cryptography and data security, Berlin, Heidelberg: Springer

The application of a novel neural network in the detection of phishing websites

Journal of Ambient Intelligence and Humanized Computing

Detecting phishing web pages with visual similarity assessment based on earth mover's distance

IEEE Transactions on Dependable and Secure Computing

A small program to detect gibberish using a markov chain

Google Developers

No phishing beyond this point

Computer

Defending against phishing attacks: Taxonomy of methods, current issues and future directions

Telecommunication Systems