Phishing websites detection via CNN and multi-head self-attention on imbalanced datasets☆
Introduction
The phishing website is an online social engineering attack leading to privacy leakage, identity theft and property damage by pretending to be a legitimate entity (Peng, Guangzhen, Peng, 2019, Verma, Das, 2017). Phishers aim to trick online users so as to catch their financial information such as credit card numbers, password, etc. Rajab (2018), which impose a great threat to Internet users, and this phenomenon is becoming more and more serious now. According to the Phishing Activity Trends Report (Report from APWG, 2018) which was published by Anti-Phishing Working Group (APWG), the total numbers of phishing sites detected by APWG in Q1, Q2, Q3, Q4 of 2018 reach to 263538, 233040, 151014 and 138328, separately. Meanwhile, the RSA Online Fraud Report (Rsa online fraud report, 2019d) reveals that the phishing attacks have cost global organizations $4.6 billion in losses in 2015, and this number has been increasing in recent years.
To detect phishing websites, industry and academic communities have made their great efforts. For example, Google has set up a blacklist which gathers a large number of reported phishing websites for phishing detection and applied the list in its own browser Chrome (Liang et al., 2016). Other companies take other resorts such as toolbars or browser extensions to identify and block phishing websites (Cui et al., 2017). Besides, corporations such as Panda (Panda security, 2019) and McAfee (Mcafee phishing protection, 2019) have integrated anti-phishing service into their anti-virus software.
In the meantime, many researches also have proposed various methods from different academic angles. For instance, the simplest practice, blacklist or whitelist, takes effect by setting up an illegal or legitimate Uniform Resource Locators (URLs) list. But the weakness of this approach is that the lists cannot cover all phishing websites and this practice cannot defy the tricks such as the zero-day attack (Aravindhan, Shanmugalakshmi, Ramya, Chinnaiyan, 2016, Mohammad, Thabtah, Mccluskey, 2012). In order to overcome this drawback, many machine learning methods, e.g., Naive Bayer, J48 tree, Random Forest, Logistic Regression, Support Vector Machine, AdaBoostM1 Zhang et al. (2011) and etc., have shown great advantage by extracting features from URLs or webpage contents and training the classifier to give a final verdict. In the extracting and training procedures, they basically rely on manually prepared expert knowledge, which may result the final verdict very subjective. As the improvement, deep learning methods, such as Long Short Term Memory (LSTM), Deep Belief Networks and Convolutional Neural Network (CNN) have been applied in phishing detection to avoid the subjectivity caused by the manually extracted features (Correa Bahnsen, Contreras Bohorquez, Villegas, Vargas, González, 2017, Peng, Guangzhen, Peng, 2019, Zhang, Li). The underlying principle in deep learning for this improvement is that the features are not designed by human engineers, but learned from the data by a general-purpose learning procedure (Lecun et al., 2015). However, these deep learning approaches still face the problem of unsatisfied accuracy.
Moreover, most of the machine learning and deep learning methods mentioned above do not consider the problem of imbalanced training datasets. The problem results from the fact that legitimate URLs are greatly more than the phishing ones. In this situation, the classifier learns more features from the majority class which may cause the biased results (Verma and Das, 2017).
In order to balance the training datasets and improve the accuracy of phishing websites identification, in this paper, we propose self-attention CNN, a high-accuracy phishing websites detection approach via CNN (Ketkar, 2017) and Multi-Head Self-Attention (Vaswani et al., 2017). self-attention CNN first takes Generative Adversarial Network (GAN) to produce phishing URLs so as to balance the datasets between phishing and legitimate URLs (Chawla, Bowyer, Lawrence, Philip Kegelmeyer, 2002, Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio, 2014). Next, we combine the deep learning network of CNN and multi-head self-attention together to build our classifier. There are four important blocks, i.e., the input block, the attention block, the feature block and the output block, in the classifier. On the balanced datasets, the input block transforms URLs into matrixes, which are duplicated. Subsequently, the two duplicated copies are respectively fed into the attention block and the feature block to get attention weights and learn features. At last, the output block gives the detection result. In the specific, we adopt the multi-head self-attention mechanism in the attention block, which can find the inner dependency relationships between different characters of URLs. This helps our method learn comprehensive URL representations. The extensive experiments show that our generated data can balance the dataset and our method can detect phishing websites more precisely. Accuracy of self-attention CNN achieves 95.6% which is higher than those of CNN-LSTM, single CNN and single LSTM by 1.4%, 4.6% and 2.1% respectively.
The main contributions of our work are as follows:
- •
We use GAN to generate synthetic URLs to make the training dataset balanced. These URLs are so similar with real-world phishing URLs that they greatly facilitate training an unbiased classifier.
- •
CNN and multi-head self-attention are combined to construct one new classifier, which can improve the results. To the best of our knowledge, this is the first attempt in phishing websites detection.
- •
We conduct a series of experiments, which illustrate that our approach obtains high accuracy in phishing URLs identification. The result also proves that our method is superior than the classic schemes.
To evaluate our proposed self-attention CNN model, we pose five questions to discuss the performance of our method:
Q1: Do different ratios between real-world legitimate URLs and real-world phishing URLs impact the classification results?
Q2: How about the experiments on the dataset including phishing URLs generated by GAN and real-world URLs and on the dataset with only real-world URLs?
Q3: How do parameters influence the performance of our classifier?
Q4: What’s the situation when our classifier is compared with the other existing schemes?
Q5: Are generative URLs created by GAN more useful for classification than those made by the other methods?
The rest of paper is organized as follows: Section 2 summarizes different methods for detecting phishing websites. Next Section 3 introduces some background, e.g. imbalanced data classification, Convolutional Neural Network and multi-head self-attention. In the following, Section 4 describes our method in detail including using GAN to balance the dataset and constructing our new network. Further, Section 5 shows our experiments and results on different datasets and makes some comparisons. Finally, Section 6 concludes the whole paper and points out our future work.
Section snippets
Related work
Nowadays, many efforts have shown their merits from different views, which mainly can be divided into four categories.
Blacklist-based methods are widely used by many companies and browsers. They record a lot of phishing websites via different techniques, such as searching known phishing characteristics in the web and etc. Correa Bahnsen et al. (2017), Zhang et al. (2008), Ma et al. (2009). For example, Google Safe Browsing holds its own blacklist to block recorded phishing websites when users
Background
In this section, we introduce the imbalanced data classification problem and list some approaches to solve the problem. Next, we illustrate the construction of CNN. Finally, multi-head self-attention is explained.
Methodology
In order to reach a high-accuracy result for phishing URL detection, we make improvement in the training dataset and the construction of the classifier. Briefly, we firstly use GAN to generate real phishing URLs and form the balanced training dataset along with real-world normal URLs. In the meanwhile, based on the balanced dataset, we design a CNN and multi-head self-attention combined classifier which can take URLs as input and give a verdict (positive/negative) as output. The detailed
Experiments
In this section, we first introduce datasets and metrics in our experiments. Then, we do some experiments to testify if GAN can be used to generate URLs for balancing datasets and training classifier more precisely. Finally, we compare some deep learning networks with our new network on different datasets, which proves our classifier is more accurate.
Conclusion and future works
In order to identify phishing websites with a high-accuracy rate, in this paper, we propose self-attention CNN, a CNN and multi-head self-attention combined deep learning approach. We first introduce GAN to generate phishing URLs, making the training dataset balanced. Then the new deep network involving CNN and multi-head self-attention is built to do phishing websites detection. Experiments show that the classifier on the dataset of generative URLs and real-world URLs can achieve better
CRediT authorship contribution statement
Xi Xiao: Conceptualization, Methodology. Wentao Xiao: Writing - original draft. Dianyan Zhang: Investigation, Writing - original draft. Bin Zhang: Project administration. Guangwu Hu: Writing - review & editing. Qing Li: Visualization, Validation. Shutao Xia: Project administration.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Xi Xiao is an associate professor in Graduate School at Shenzhen, Tsinghua University. He got his Ph.D. degree in 2011 in State Key Laboratory of Information Security, Graduate University of Chinese Academy of Sciences. His research interests focus on information security and the computer network.
References (52)
- et al.
Using case-based reasoning for phishing detection
Procedia Comput. Sci.
(2017) - et al.
Accurate and fast URL phishing detector: a convolutional neural network approach
Comput. Netw.
(2020) - et al.
CANTINA: a content-based approach to detecting phishing web sites
(2007) - 5000 BEST WEBSITES homepage, http://5000best.com/websites/;...
- et al.
Certain investigation on web application security: phishing detection and phishing target discovery
(2016) - et al.
Lexical feature based phishing URL detection using online learning
(2010) - Chan P., Stolfo S.. Toward scalable learning with non-uniform class and cost distributions: a case study in credit card...
- et al.
SMOTE: synthetic minority over-sampling technique
J Artif Intell Res (JAIR)
(2002) - Chen W., Xu A.W., Wei Z., Xu C.. Phishing detection research based on PSO-BP neural...
- et al.
Pruning support vectors for imbalanced data classification
(2005)
DeltaPhish: detecting phishing webpages in compromised websites
Classifying phishing URLs using recurrent neural networks
Tracking phishing attacks over time
Mobile malware attacks and defense
Web phishing detection based on graph mining
Generative adversarial nets
ArXiv
Identification of phishing webpages and its target domains by analyzing the feign relationship
J. Inf. Secur. Appl.
A comprehensive and efficacious architecture for detecting phishing webpages
Comput. Secur.
Deep residual learning for image recognition
Towards detection of phishing websites on client-side using machine learning based approach
Telecommun. Syst.
Knowledge distillation using output errors for self-attention end-to-end models
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PhishDef: URL names say it all
Proceedings - IEEE INFOCOM
Deep learning
Nature
Cracking classifiers for evasion: a case study on the google’s phishing pages filter
International Conference on World Wide Web
Cited by (45)
Enhancing cybersecurity: A review and comparative analysis of convolutional neural network approaches for detecting URL-based phishing attacks
2024, e-Prime - Advances in Electrical Engineering, Electronics and EnergyMicro LED defect detection with self-attention mechanism-based neural network
2024, Digital Signal Processing: A Review JournalIntelligent vineyard blade density measurement method incorporating a lightweight vision transformer
2024, Egyptian Informatics JournalThe applicability of a hybrid framework for automated phishing detection
2024, Computers and SecurityPhishHunter: Detecting camouflaged IDN-based phishing attacks via Siamese neural network
2024, Computers and SecurityDetect malicious websites by building a neural network to capture global and local features of websites
2024, Computers and Security
Xi Xiao is an associate professor in Graduate School at Shenzhen, Tsinghua University. He got his Ph.D. degree in 2011 in State Key Laboratory of Information Security, Graduate University of Chinese Academy of Sciences. His research interests focus on information security and the computer network.
Wentao Xiao is pursuing his Master degree in computer technology at University of Tsinghua. His research interests focus on machine learning, deep learning, and cyberspace security.
Dianyan Zhang is pursuing his Master degree in computer technology at University of Tsinghua. His research interests focus on network security and deep learning.
Bin Zhang received his Ph.D. degree in Department of Computer Science and Technology, Tsinghua University, China in 2012. He worked as a post doctor in Nanjing Telecommunication Technology Institute from 2014 to 2017. He is now a researcher in the Cyberspace Security Research Center of Peng Cheng Laboratory. His current research interests focus on network anomaly detection, Internet architecture, and its protocols, network traffic measurement, information privacy security, etc.
Guangwu Hu is an associate professor of Shenzhen Institute of Information Technology, China. He received his Ph.D. degree in computer science and technology from Tsinghua University in 2014. His research interests include software defined networking, Next-Generate Internet and Internet security.
Shutao Xia is a professor and a Ph.D. supervisor in Graduate School at Shenzhen, Tsinghua University. He got his Ph.D. degree in Nankai University 1997. He mainly engaged in information theory and coding, internet and big data. He has published more than 60 papers in top international journals and international conferences.
- ☆
This work is supported in part by the National Key Research and Development Program of China (2018YFB1800204, 2018YFB1800601), the National Natural Science Foundation of China (61972219, 61771273), Natural Science Foundation of Guangdong Province (2021A1515012640), and the R&D Program of Shenzhen (JCYJ20190813174403598, SGDX20190918101201696, JCYJ20190813165003837).