Phishing websites detection via CNN and multi-head self-attention on imbalanced datasets

doi:10.1016/j.cose.2021.102372

Computers & Security

Volume 108, September 2021, 102372

https://doi.org/10.1016/j.cose.2021.102372 Get rights and content

Abstract

Phishing websites belong to a social engineering attack where perpetrators fake legitimate websites to lure people to access so as to illegally acquire user’s identity, password, privacy and even properties. This attack imposes a great threat to people and becomes more and more severe. In order to identify phishing websites, many proposals have shown their merits. For example, the classical proposal CNN-LSTM received a very high precision by combining Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) together. However, despite CNN achieved great success in AI area, LSTM still exists the biases issue since it always treats the later features much more important than the former ones. In the meanwhile, as the self-attention mechanism can discover the text’s inner dependency relationships, it has been widely applied to various tasks of deep learning-based Natural Language Processing (NLP). If we treat a URL as a text string, this mechanism can learn comprehensive URL representations. In order to improve the accuracy for phishing websites detection further, in this paper, we propose a novel Convolutional Neural Network (CNN) with self-attention named self-attention CNN for phishing Uniform Resource Locators (URLs) identification. Specifically, self-attention CNN first leverages Generative Adversarial Network (GAN) to generate phishing URLs so as to balance the datasets of legitimate and phishing URLs. Then it utilizes CNN and multi-head self-attention to construct our new classifier which is comprised of four blocks, namely the input block, the attention block, the feature block and the output block. Finally, the trained classifier can give a high-accuracy result for an unknown website URL. Overall thorough experiments indicate that self-attention CNN achieves 95.6% accuracy, which outperforms CNN-LSTM, single CNN and single LSTM by 1.4%, 4.6% and 2.1% respectively.

Introduction

The phishing website is an online social engineering attack leading to privacy leakage, identity theft and property damage by pretending to be a legitimate entity (Peng, Guangzhen, Peng, 2019, Verma, Das, 2017). Phishers aim to trick online users so as to catch their financial information such as credit card numbers, password, etc. Rajab (2018), which impose a great threat to Internet users, and this phenomenon is becoming more and more serious now. According to the Phishing Activity Trends Report (Report from APWG, 2018) which was published by Anti-Phishing Working Group (APWG), the total numbers of phishing sites detected by APWG in Q1, Q2, Q3, Q4 of 2018 reach to 263538, 233040, 151014 and 138328, separately. Meanwhile, the RSA Online Fraud Report (Rsa online fraud report, 2019d) reveals that the phishing attacks have cost global organizations $4.6 billion in losses in 2015, and this number has been increasing in recent years.

To detect phishing websites, industry and academic communities have made their great efforts. For example, Google has set up a blacklist which gathers a large number of reported phishing websites for phishing detection and applied the list in its own browser Chrome (Liang et al., 2016). Other companies take other resorts such as toolbars or browser extensions to identify and block phishing websites (Cui et al., 2017). Besides, corporations such as Panda (Panda security, 2019) and McAfee (Mcafee phishing protection, 2019) have integrated anti-phishing service into their anti-virus software.

In the meantime, many researches also have proposed various methods from different academic angles. For instance, the simplest practice, blacklist or whitelist, takes effect by setting up an illegal or legitimate Uniform Resource Locators (URLs) list. But the weakness of this approach is that the lists cannot cover all phishing websites and this practice cannot defy the tricks such as the zero-day attack (Aravindhan, Shanmugalakshmi, Ramya, Chinnaiyan, 2016, Mohammad, Thabtah, Mccluskey, 2012). In order to overcome this drawback, many machine learning methods, e.g., Naive Bayer, J48 tree, Random Forest, Logistic Regression, Support Vector Machine, AdaBoostM1 Zhang et al. (2011) and etc., have shown great advantage by extracting features from URLs or webpage contents and training the classifier to give a final verdict. In the extracting and training procedures, they basically rely on manually prepared expert knowledge, which may result the final verdict very subjective. As the improvement, deep learning methods, such as Long Short Term Memory (LSTM), Deep Belief Networks and Convolutional Neural Network (CNN) have been applied in phishing detection to avoid the subjectivity caused by the manually extracted features (Correa Bahnsen, Contreras Bohorquez, Villegas, Vargas, González, 2017, Peng, Guangzhen, Peng, 2019, Zhang, Li). The underlying principle in deep learning for this improvement is that the features are not designed by human engineers, but learned from the data by a general-purpose learning procedure (Lecun et al., 2015). However, these deep learning approaches still face the problem of unsatisfied accuracy.

Moreover, most of the machine learning and deep learning methods mentioned above do not consider the problem of imbalanced training datasets. The problem results from the fact that legitimate URLs are greatly more than the phishing ones. In this situation, the classifier learns more features from the majority class which may cause the biased results (Verma and Das, 2017).

In order to balance the training datasets and improve the accuracy of phishing websites identification, in this paper, we propose self-attention CNN, a high-accuracy phishing websites detection approach via CNN (Ketkar, 2017) and Multi-Head Self-Attention (Vaswani et al., 2017). self-attention CNN first takes Generative Adversarial Network (GAN) to produce phishing URLs so as to balance the datasets between phishing and legitimate URLs (Chawla, Bowyer, Lawrence, Philip Kegelmeyer, 2002, Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio, 2014). Next, we combine the deep learning network of CNN and multi-head self-attention together to build our classifier. There are four important blocks, i.e., the input block, the attention block, the feature block and the output block, in the classifier. On the balanced datasets, the input block transforms URLs into matrixes, which are duplicated. Subsequently, the two duplicated copies are respectively fed into the attention block and the feature block to get attention weights and learn features. At last, the output block gives the detection result. In the specific, we adopt the multi-head self-attention mechanism in the attention block, which can find the inner dependency relationships between different characters of URLs. This helps our method learn comprehensive URL representations. The extensive experiments show that our generated data can balance the dataset and our method can detect phishing websites more precisely. Accuracy of self-attention CNN achieves 95.6% which is higher than those of CNN-LSTM, single CNN and single LSTM by 1.4%, 4.6% and 2.1% respectively.

The main contributions of our work are as follows:

•
We use GAN to generate synthetic URLs to make the training dataset balanced. These URLs are so similar with real-world phishing URLs that they greatly facilitate training an unbiased classifier.
•
CNN and multi-head self-attention are combined to construct one new classifier, which can improve the results. To the best of our knowledge, this is the first attempt in phishing websites detection.
•
We conduct a series of experiments, which illustrate that our approach obtains high accuracy in phishing URLs identification. The result also proves that our method is superior than the classic schemes.

To evaluate our proposed self-attention CNN model, we pose five questions to discuss the performance of our method:

Q1: Do different ratios between real-world legitimate URLs and real-world phishing URLs impact the classification results?

Q2: How about the experiments on the dataset including phishing URLs generated by GAN and real-world URLs and on the dataset with only real-world URLs?

Q3: How do parameters influence the performance of our classifier?

Q4: What’s the situation when our classifier is compared with the other existing schemes?

Q5: Are generative URLs created by GAN more useful for classification than those made by the other methods?

The rest of paper is organized as follows: Section 2 summarizes different methods for detecting phishing websites. Next Section 3 introduces some background, e.g. imbalanced data classification, Convolutional Neural Network and multi-head self-attention. In the following, Section 4 describes our method in detail including using GAN to balance the dataset and constructing our new network. Further, Section 5 shows our experiments and results on different datasets and makes some comparisons. Finally, Section 6 concludes the whole paper and points out our future work.

Section snippets

Related work

Nowadays, many efforts have shown their merits from different views, which mainly can be divided into four categories.

Blacklist-based methods are widely used by many companies and browsers. They record a lot of phishing websites via different techniques, such as searching known phishing characteristics in the web and etc. Correa Bahnsen et al. (2017), Zhang et al. (2008), Ma et al. (2009). For example, Google Safe Browsing holds its own blacklist to block recorded phishing websites when users

Background

In this section, we introduce the imbalanced data classification problem and list some approaches to solve the problem. Next, we illustrate the construction of CNN. Finally, multi-head self-attention is explained.

Methodology

In order to reach a high-accuracy result for phishing URL detection, we make improvement in the training dataset and the construction of the classifier. Briefly, we firstly use GAN to generate real phishing URLs and form the balanced training dataset along with real-world normal URLs. In the meanwhile, based on the balanced dataset, we design a CNN and multi-head self-attention combined classifier which can take URLs as input and give a verdict (positive/negative) as output. The detailed

Experiments

In this section, we first introduce datasets and metrics in our experiments. Then, we do some experiments to testify if GAN can be used to generate URLs for balancing datasets and training classifier more precisely. Finally, we compare some deep learning networks with our new network on different datasets, which proves our classifier is more accurate.

Conclusion and future works

In order to identify phishing websites with a high-accuracy rate, in this paper, we propose self-attention CNN, a CNN and multi-head self-attention combined deep learning approach. We first introduce GAN to generate phishing URLs, making the training dataset balanced. Then the new deep network involving CNN and multi-head self-attention is built to do phishing websites detection. Experiments show that the classifier on the dataset of generative URLs and real-world URLs can achieve better

CRediT authorship contribution statement

Xi Xiao: Conceptualization, Methodology. Wentao Xiao: Writing - original draft. Dianyan Zhang: Investigation, Writing - original draft. Bin Zhang: Project administration. Guangwu Hu: Writing - review & editing. Qing Li: Visualization, Validation. Shutao Xia: Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Xi Xiao is an associate professor in Graduate School at Shenzhen, Tsinghua University. He got his Ph.D. degree in 2011 in State Key Laboratory of Information Security, Graduate University of Chinese Academy of Sciences. His research interests focus on information security and the computer network.

References (52)

H. Abutair et al.
Using case-based reasoning for phishing detection
Procedia Comput. Sci.
(2017)
W. Wei et al.
Accurate and fast URL phishing detector: a convolutional neural network approach
Comput. Netw.
(2020)
Y. Zhang et al.
CANTINA: a content-based approach to detecting phishing web sites
(2007)
5000 BEST WEBSITES homepage, http://5000best.com/websites/;...
R. Aravindhan et al.
Certain investigation on web application security: phishing detection and phishing target discovery
(2016)
A. Blum et al.
Lexical feature based phishing URL detection using online learning
(2010)
Chan P., Stolfo S.. Toward scalable learning with non-uniform class and cost distributions: a case study in credit card...
N. Chawla et al.
SMOTE: synthetic minority over-sampling technique
J Artif Intell Res (JAIR)
(2002)
Chen W., Xu A.W., Wei Z., Xu C.. Phishing detection research based on PSO-BP neural...
X.w. Chen et al.
Pruning support vectors for imbalanced data classification
(2005)

I. Corona et al.

DeltaPhish: detecting phishing webpages in compromised websites

(2017)

A. Correa Bahnsen et al.

Classifying phishing URLs using recurrent neural networks

(2017)

Q. Cui et al.

Tracking phishing attacks over time

(2017)

K. Dunham

Mobile malware attacks and defense

(2009)

Z. Futai et al.

Web phishing detection based on graph mining

(2016)

I. Goodfellow et al.

Generative adversarial nets

ArXiv

(2014)

R. Gowtham et al.

Identification of phishing webpages and its target domains by analyzing the feign relationship

J. Inf. Secur. Appl.

(2017)

R. Gowtham et al.

A comprehensive and efficacious architecture for detecting phishing webpages

Comput. Secur.

(2014)

K. He et al.

Deep residual learning for image recognition

(2016)

A. Jain et al.

Towards detection of phishing websites on client-side using machine learning based approach

Telecommun. Syst.

(2017)

Ketkar N.. Convolutional neural...

H.G. Kim et al.

Knowledge distillation using output errors for self-attention end-to-end models

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

(2019)

Kim Y.. Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical...

A. Le et al.

PhishDef: URL names say it all

Proceedings - IEEE INFOCOM

(2010)

Y. Lecun et al.

Deep learning

Nature

(2015)

B. Liang et al.

Cracking classifiers for evasion: a case study on the google’s phishing pages filter

International Conference on World Wide Web

(2016)

Cited by (45)

Enhancing cybersecurity: A review and comparative analysis of convolutional neural network approaches for detecting URL-based phishing attacks
2024, e-Prime - Advances in Electrical Engineering, Electronics and Energy
Phishing attempts to mimic the official websites of businesses, including banks, e-commerce, government offices, and financial institutions. Phishing websites aim to collect and retrieve sensitive data from users, including passwords, credit card numbers, email addresses, personal information, and so on. The growing frequency of phishing attacks has prompted the development of numerous anti-phishing technologies. Because machine learning (ML) techniques perform better in categorization problems, they are used extensively. But the most crucial features are not extracted by the algorithms in use today, which could result in a false categorization. In addition, the complex algorithms contribute to the long reaction time. To solve these issues, this study suggests using a Bidirectional Long Short-Term Memory-based Gated Highway Attention Block Convolutional Neural Network (BiLSTM-GHA-CNN) to detect phishing URLs.
Micro LED defect detection with self-attention mechanism-based neural network
2024, Digital Signal Processing: A Review Journal
We propose a method utilizing a YOLO detector for the precise localization of defective chips and the identification of defect types within multi-scale multi-target images. To address the challenge of optimizing training costs and enhancing model generalization, we introduce an end-to-end deep neural network, CM-YOLOv5, specifically designed for chip detection. We incorporate a novel bottleneck layer, MA-CSP, in conjunction with Multi-Head Self-Attention mechanism (MHSA). Additionally, we propose a class-balanced loss function (CB-BCE Loss) to tackle the issue of uneven distribution of defective samples in the Micro LED dataset. To further enhance convergence speed and detection precision, we introduce the SIoU Loss combined with Meta-AconC. Our experimental results, conducted on the Micro LED dataset, demonstrate notable improvements with CM-YOLOv5 over the basic YOLOv5 algorithm. Specifically, CM-YOLOv5 exhibits a 3.8 % increase in mean average precision and a 3.7 % improvement in precision, surpassing current mainstream object detection algorithms, including YOLOR, YOLOX, and YOLOv6, etc., in terms of general evaluation metrics. Finally, upon deploying our proposed algorithm on the edge device NVIDIA Jetson Xavier NX, CM-YOLOv5 demonstrates commendable speed and detection performance in embedded scenarios.
Intelligent vineyard blade density measurement method incorporating a lightweight vision transformer
2024, Egyptian Informatics Journal
Under the new demand model of Agriculture 4.0, automated spraying is a very complex task in precision agriculture, which needs to be combined with a computerized vision perception system to distinguish the plant leaf density and execute the spraying operation in real-time accordingly. Aiming at the accurate determination of grape leaf density, an image leaf density determination method based on the lightweight Vision Transformer (ViT) architecture is proposed, which designs a fusion data augmentation method containing a dual augmentation spatial extension and weather data augmentation method, where the former adopts the pixel augmentation and spatial augmentation for the original image processing, and the latter realizes the data augmentation from the empirical point of view adapted to the agricultural operation environment, and fuses the two in order to expand the sample capacity of the grape leaf density image, which then enhances the model's generalization ability and robustness. The lightweight ViT model has self-attention that can automatically and efficiently extract high-frequency local feature representations and use the two-branch structure to mix high-frequency and low-frequency information to form grapevine-leaf density features in the region of interest. The semantic analysis of the feature extraction layer is parsed using t-SNE and histogram methods, which improves the transparency of the model from the multidimensional with frequency domain distribution space. The experimental results show that the fusion data augmentation method can effectively improve the model recognition accuracy, and the accuracy of comparing the included data augmentation methods is improved by 0.55 % and 3.46 %, respectively. The accuracy of recognizing all four types of grape leaf densities exceeded 94 %, and the MCC reached 90.39 %. In addition, the proposed lightweight ViT improves the accuracy by at least 0.34 % with FLOPs of only 0.6 G compared to the popular MobileViT. The proposed method of this work has high recognition speed and accuracy, which can provide practical technical support for plant protection spraying robots and improve the profitability of growers based on the reduction of pesticide residues.
The applicability of a hybrid framework for automated phishing detection
2024, Computers and Security
Phishing attacks are a critical and escalating cybersecurity threat in the modern digital landscape. As cybercriminals continually adapt their techniques, automated phishing detection systems have become essential for safeguarding Internet users. However, many current systems rely on single-analysis models, making them vulnerable to sophisticated bypass attempts by hackers. This research delves into the potential of hybrid approaches, which combine multiple models to enhance both the robustness and effectiveness of phishing detection. It highlights existing hybrid models' limitations that focus primarily on effectiveness while ignoring broader applicability. To address these gaps, we introduce a novel framework explicitly designed for applicability in the real world, which poses the foundation for practical and robust phishing detection architectures. We develop a proof of concept to evaluate its effectiveness, robustness, and detection speed. Additionally, we introduce an innovative methodology for simulating bypass attacks on single-analysis base models. Our experiments demonstrate that the proposed hybrid framework outperforms individual models, displaying higher effectiveness, robustness against bypassing attempts, and real-time detection capabilities. Our proof of concept achieves an accuracy of 97.44% thereby outperforming the current state-of-the-art approach while requiring less computational time. The results provide insights into the multifaceted factors of hybrid models, extending beyond mere effectiveness, and emphasize the importance of holistic applicability in hybrid approaches to address the critical need for robust defenses against phishing attacks.
PhishHunter: Detecting camouflaged IDN-based phishing attacks via Siamese neural network
2024, Computers and Security
Phishing is one of the significant threats to cybersecurity today, especially when attackers create Internationalized Domain Names (IDN) homographs to engage in phishing activities. IDN homograph takes advantage of some characters in different native languages in internationalized domain names that look similar to legitimate ones. Although researchers have proposed several enlightening detection methods, most of them focused on detecting typosquatting domain names. The ones focused on IDN homograph attack detection either need to enhance the generalization ability or improve detection performance caused by data imbalance. In this paper, we devised a Generative Adversarial Network with a Gradient Penalty (WGAN-GP) algorithm to solve the data imbalance problem. We transform domain names into images and calculate their similarity by Siamese neural networks. Our work can identify whether a domain name is IDN homograph or not effectively. We use the dataset generated based on Unicode tables, publicly available homograph tools, and the Internet traffic captured from the China Education Research Network backbone (CERNET) to evaluate the performance. Experimental results show that the proposed method improves the accuracy and reduces the false positive rate in detecting homograph domain names. In addition, it can also accurately identify typosquatting in phishing pages.
Detect malicious websites by building a neural network to capture global and local features of websites
2024, Computers and Security
With the development of the digital age, the Internet has become an integral part of our daily lives. However, it has also brought about a series of security challenges, among which malicious websites are particularly prominent. These websites often lure ignorant users by disguising themselves as legitimate services or through various fraudulent means to commit identity theft, distribute malware, or launch other forms of cyberattacks. Therefore the detection of malicious websites is very necessary. Traditionally, many malicious website detection methods rely on machine learning techniques, some of which require manual extraction of features, which may result in a time-consuming prediction process. Despite the existence of machine learning models that can automatically extract features, including unsupervised ones, capturing the subtleties of malicious website features is still a challenge. In recent years, deep learning has been gaining attention as a method for automated feature learning. It is capable of capturing and understanding the content of a website in greater depth, thus making classification and detection more accurate and efficient. Although deep learning shows its potential in capturing advanced features, its performance depends on the input data and the chosen model architecture. Both efficiently constructing feature representations of input data and building efficient model architectures to capture features are currently major challenges. For this reason, we propose a new approach for malicious website detection. This method uses wordpiece-level features to represent the information of malicious websites. Combination of multi-filter text convolutional neural network and multi-head self-attention mechanism is used for model construction. This enables the model to capture both global and local features of the input data. Compared to common deep learning methods, our approach captures the features of malicious websites better.

View all citing articles on Scopus

Wentao Xiao is pursuing his Master degree in computer technology at University of Tsinghua. His research interests focus on machine learning, deep learning, and cyberspace security.

Dianyan Zhang is pursuing his Master degree in computer technology at University of Tsinghua. His research interests focus on network security and deep learning.

Bin Zhang received his Ph.D. degree in Department of Computer Science and Technology, Tsinghua University, China in 2012. He worked as a post doctor in Nanjing Telecommunication Technology Institute from 2014 to 2017. He is now a researcher in the Cyberspace Security Research Center of Peng Cheng Laboratory. His current research interests focus on network anomaly detection, Internet architecture, and its protocols, network traffic measurement, information privacy security, etc.

Guangwu Hu is an associate professor of Shenzhen Institute of Information Technology, China. He received his Ph.D. degree in computer science and technology from Tsinghua University in 2014. His research interests include software defined networking, Next-Generate Internet and Internet security.

Shutao Xia is a professor and a Ph.D. supervisor in Graduate School at Shenzhen, Tsinghua University. He got his Ph.D. degree in Nankai University 1997. He mainly engaged in information theory and coding, internet and big data. He has published more than 60 papers in top international journals and international conferences.

^☆: This work is supported in part by the National Key Research and Development Program of China (2018YFB1800204, 2018YFB1800601), the National Natural Science Foundation of China (61972219, 61771273), Natural Science Foundation of Guangdong Province (2021A1515012640), and the R&D Program of Shenzhen (JCYJ20190813174403598, SGDX20190918101201696, JCYJ20190813165003837).

View full text

Phishing websites detection via CNN and multi-head self-attention on imbalanced datasets☆

Abstract

Introduction

Section snippets

Related work

Background

Methodology

Experiments

Conclusion and future works

CRediT authorship contribution statement

Declaration of Competing Interest

Procedia Comput. Sci.

Comput. Netw.

Certain investigation on web application security: phishing detection and phishing target discovery

Lexical feature based phishing URL detection using online learning

SMOTE: synthetic minority over-sampling technique

J Artif Intell Res (JAIR)

Pruning support vectors for imbalanced data classification

DeltaPhish: detecting phishing webpages in compromised websites

Classifying phishing URLs using recurrent neural networks

Tracking phishing attacks over time

Mobile malware attacks and defense

Web phishing detection based on graph mining

Generative adversarial nets

ArXiv

Identification of phishing webpages and its target domains by analyzing the feign relationship

J. Inf. Secur. Appl.

A comprehensive and efficacious architecture for detecting phishing webpages

Comput. Secur.

Deep residual learning for image recognition

Towards detection of phishing websites on client-side using machine learning based approach

Telecommun. Syst.

Knowledge distillation using output errors for self-attention end-to-end models

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

PhishDef: URL names say it all

Proceedings - IEEE INFOCOM

Deep learning

Nature

Cracking classifiers for evasion: a case study on the google’s phishing pages filter

International Conference on World Wide Web