Machine learning based phishing detection from URLs
Introduction
Due to the rapid developments of the global networking and communication technologies, lots of our daily life activities such as social networks, electronic banking, e-commerce, etc. are transferred to the cyberspace. The open, anonymous and uncontrolled infrastructure of the Internet enables an excellent platform for cyberattacks, which presents serious security vulnerabilities not only for networks but also for the standard computer users even for the experienced ones. Although carefulness and experience of the user are important, it is not possible to completely prevent users from falling to the phishing scam (Greene, Steves, & Theofanos, 2018). Because, to increase the success of the phishing attacks, attackers also get into consideration about the personality characteristics of the end user especially for deceiving the relatively experienced users (Curtis, Rajivan, Jones, & Gonzalez, 2018). End-user-targeted cyberattacks cause massive loss of sensitive/personal information and even money for individuals whose total amount can reach billions of dollars in a year (Shaikh, Shabut, & Hossain, 2016).
Phishing attacks’ analogy is derived from “fishing” for victims, this type of attacks has attracted a great deal of attention from researchers in recent years. It is also a promising and attractive technique for attackers (also named as phishers) who open some fraudulent websites, which have exactly similar design of the popular and legal sites on the Internet. Although these pages have similar graphical user interfaces, they must have different Uniform Resource Locators (URLs) from the original page. Mainly, a careful and experienced user can easily detect these malicious web pages by looking at the URLs. However, due to the speed of life, most of the times, end users do not investigate the whole address of their active web page, which is generally forwarded by other web pages, social networking tools or by simply an email message as depicted in Fig. 1. By using this type of fraudulent URLs, a phisher tries to capture some sensitive and personal information of the victim like financial data, personal information, username, password, etc. (Gupta, Arachchilage, & Psannis, 2018). In the case of entering this type of fraudulent site, which is believed to be the original website, computer users can easily give their sensitive information without any doubt. Because the entered web page seems exactly same with the original web page.
In a related study about the user experiences of phishing attacks (Volkamer, Renaud, Reinheimer, & Kunz, 2017) computer users fall for phishing due to the five main reasons:
- •
Users don't have detailed knowledge about URLs,
- •
Users don't know, which web pages can be trusted,
- •
Users don't see the whole address of the web page, due to the redirection or hidden URLs,
- •
Users don't have much time for consulting the URL, or accidentally enter some web pages,
- •
Users cannot distinguish phishing web pages from the legitimate ones.
Anti-Phishing Working Group published a report about the position of the phishing attacks in the last quarter of 2016 (APWG, 2017). They emphasized that phishing attacks especially target the end users in developing countries, which are ordered as firstly China with the rate of 47.09% (infected computers) and then he is followed by Turkey and Taiwan with the rate of 42.88% and 38.98% respectively. Additionally, due to the increased use of smartphones, the end users are not so careful while checking their social networks in motion. Therefore, attackers target the mobile device users to increase the efficiency of their attacks (Goel & Jain, 2018).
In the literature, there are some studies, which are focused on detecting phishing attacks. In the recent surveys, authors discuss the general characteristics of the existing phishing techniques by categorizing the technical approaches used in these type of attacks, and some practical and effective combating techniques are highlighted (Chiew, Yong and Tan, 2018, Qabajeh, Thabtah and Chiclana, 2018).
Phishing attacks exploit the vulnerabilities of the human users, therefore, some additional support systems are needed for the protection of the systems/users. The protection mechanisms are classified into two main groups: by increasing the awareness of the users and by using some additional programs as depicted in Fig. 2. Due to the vulnerability of the end user, an attacker can even target some experienced users by using new techniques and before giving the sensitive information, he is believed that this page is legitimate. Therefore, software-based phishing detection systems are preferred as decision support systems for the user. Mostly preferred techniques are Black/White Lists (Cao, Han, & Le, 2008), Image Processing (Fu, Wenyin, & Deng, 2006), (Toolan & Carthy, 2009) of the web page, Natural Language Processing (Stone, 2007), Rules (Cook, Gurbani, & Daniluk, 2008), Machine Learning (Abu-Nimeh, Nappa, Wang, & Nair 2007), etc.
In one of the recent survey (Gupta et al., 2018) on phishing, authors emphasized that when some new solutions are proposed to overcome various phishing attacks, attackers came with the vulnerabilities of the solution and produced new attack types. Therefore, it is highly recommended to use hybrid models instead of a single approach by the security manager of the networks.
In this paper, we are focused on the real-time detection of phishing web pages by investigating the URL of the web page with different machine learning algorithms (seven of them implemented and compared in the paper) and different feature sets. In the execution of a learning algorithm, not only the dataset but also the extraction of the features from this dataset are crucial. Therefore, firstly we collect lots of legitimate and fraudulent web page URLs and construct our own dataset. After that, we defined three different types of feature sets as Word Vectors, NLP based and Hybrid features to measure the efficiency of the proposed system.
The rest of the paper is organized as follows: in the first following section, the related works about phishing detection are examined. Section 3 focuses on the factors that make the detection of phishing attack from URLs difficult. The details of the proposed system and acquisition of the dataset are detailed in Section 4. Some comparative experiments are conducted, and results are depicted in Section. 5. Advantages of the proposed system are discussed in Section 6. Finally, Conclusions and Future Works on this topic are presented.
Section snippets
Related works
The phishing detection systems are generally divided into two groups: List Based Detection Systems and Machine Learning Based Detection Systems.
URLs and Attackers’ techniques
Attackers use different types of techniques for not to be detected either by security mechanisms or system admins. In this section, some of these techniques will be detailed. To understand the approach of attackers, firstly, the components of URLs should be known. The basic structure of a URL is depicted in Fig. 3.
In the standard form, a URL starts with its protocol name used to access the web page. After that, the subdomain and the Second Level Domain (SLD) name, which commonly refers to the
The proposed system and data processing
The dataset and its processing are very important parts of the machine learning based systems. The performance and efficiency of the system are directly related to them. Therefore, in this section, these topics are detailed.
Experimental results
This section gives the experimental details of the proposed model's classification algorithms and used feature extraction types (NLP based features, Word Vectors, and Hybrid) are detailed. Then, the comparative test results between these algorithms with related features are depicted.
Advantages of the proposed approach
As can be seen from the design of the proposed system, comparison table of the machine learning based phishing detection systems in Table 1 and the experimental results, our model have six main advantages as listed below.
Language independence: In most of the phishing detection system, language is very critical for the execution of the system. However, in the proposed system, we are using only URLs whose texts are constructed with random and long strings, which contain some specific keywords in
Conclusion and future works
In this paper, we have implemented a phishing detection system by using seven different machine learning algorithms, as Decision Tree, Adaboost, K-star, kNN (n = 3), Random Forest, SMO and Naive Bayes, and different number/types of features as NLP based features, word vectors, and hybrid features. To increase the accuracy of the detection system, construction of an efficient feature list is a crucial task. Therefore, we have grouped our feature list in two different classes as NLP based
Acknowledgment
Thanks to Roksit for their support in the implementation of this work.
References (43)
- et al.
A survey of phishing attacks: Their types, vectors and technical approaches
Expert Systems with Applications
(2018) - et al.
Phishing attempts among the dark triad: Patterns of attack and vulnerability
Computers in Human Behavior
(2018) - et al.
Mobile phishing attacks and defence mechanisms: State of art and open research challenges
Computers & Security
(2018) - et al.
A recent review of conventional vs. automated cybersecurity anti-phishing techniques
Computer Science Review
(2018) - et al.
Detection of online phishing email using dynamic evolving neural network based on reinforcement learning
Decision Support Systems
(2018) - et al.
Phishwho: Phishing webpage detection via identity keywords extraction and target domain name finder
Decision Support Systems
(2016) - et al.
A. User experiences of torpedo: Tooltip-powered phishing email detection
Computers and Security
(2017) - et al.
A comparison of machine learning techniques for phishing detection
- et al.
Heuristic nonlinear regression strategy for detecting phishing websites
Soft Computing
(2018)
Detecting phishing attacks from URL by using NLP techniques
NLP based phishing attack detection from URLs
Anti-phishing based on automated individual white-list
Phishwish: A stateless phishing filter using minimal rules
Financial cryptography and data security, Berlin, Heidelberg: Springer
The application of a novel neural network in the detection of phishing websites
Journal of Ambient Intelligence and Humanized Computing
Detecting phishing web pages with visual similarity assessment based on earth mover's distance
IEEE Transactions on Dependable and Secure Computing
A small program to detect gibberish using a markov chain
Google Developers
No phishing beyond this point
Computer
Defending against phishing attacks: Taxonomy of methods, current issues and future directions
Telecommunication Systems
Cited by (425)
Enhancing cybersecurity: A review and comparative analysis of convolutional neural network approaches for detecting URL-based phishing attacks
2024, e-Prime - Advances in Electrical Engineering, Electronics and EnergyPhishing URL detection generalisation using Unsupervised Domain Adaptation
2024, Computer NetworksThe applicability of a hybrid framework for automated phishing detection
2024, Computers and SecurityAnti-phishing: A comprehensive perspective
2024, Expert Systems with ApplicationsIntrusion detection based on phishing detection with machine learning
2024, Measurement: SensorsA comprehensive examination of email spoofing: Issues and prospects for email security
2024, Computers and Security