Elsevier

Computers & Security

Volume 108, September 2021, 102372
Computers & Security

Phishing websites detection via CNN and multi-head self-attention on imbalanced datasets

https://doi.org/10.1016/j.cose.2021.102372Get rights and content

Abstract

Phishing websites belong to a social engineering attack where perpetrators fake legitimate websites to lure people to access so as to illegally acquire user’s identity, password, privacy and even properties. This attack imposes a great threat to people and becomes more and more severe. In order to identify phishing websites, many proposals have shown their merits. For example, the classical proposal CNN-LSTM received a very high precision by combining Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) together. However, despite CNN achieved great success in AI area, LSTM still exists the biases issue since it always treats the later features much more important than the former ones. In the meanwhile, as the self-attention mechanism can discover the text’s inner dependency relationships, it has been widely applied to various tasks of deep learning-based Natural Language Processing (NLP). If we treat a URL as a text string, this mechanism can learn comprehensive URL representations. In order to improve the accuracy for phishing websites detection further, in this paper, we propose a novel Convolutional Neural Network (CNN) with self-attention named self-attention CNN for phishing Uniform Resource Locators (URLs) identification. Specifically, self-attention CNN first leverages Generative Adversarial Network (GAN) to generate phishing URLs so as to balance the datasets of legitimate and phishing URLs. Then it utilizes CNN and multi-head self-attention to construct our new classifier which is comprised of four blocks, namely the input block, the attention block, the feature block and the output block. Finally, the trained classifier can give a high-accuracy result for an unknown website URL. Overall thorough experiments indicate that self-attention CNN achieves 95.6% accuracy, which outperforms CNN-LSTM, single CNN and single LSTM by 1.4%, 4.6% and 2.1% respectively.

Introduction

The phishing website is an online social engineering attack leading to privacy leakage, identity theft and property damage by pretending to be a legitimate entity (Peng, Guangzhen, Peng, 2019, Verma, Das, 2017). Phishers aim to trick online users so as to catch their financial information such as credit card numbers, password, etc. Rajab (2018), which impose a great threat to Internet users, and this phenomenon is becoming more and more serious now. According to the Phishing Activity Trends Report (Report from APWG, 2018) which was published by Anti-Phishing Working Group (APWG), the total numbers of phishing sites detected by APWG in Q1, Q2, Q3, Q4 of 2018 reach to 263538, 233040, 151014 and 138328, separately. Meanwhile, the RSA Online Fraud Report (Rsa online fraud report, 2019d) reveals that the phishing attacks have cost global organizations $4.6 billion in losses in 2015, and this number has been increasing in recent years.

To detect phishing websites, industry and academic communities have made their great efforts. For example, Google has set up a blacklist which gathers a large number of reported phishing websites for phishing detection and applied the list in its own browser Chrome (Liang et al., 2016). Other companies take other resorts such as toolbars or browser extensions to identify and block phishing websites (Cui et al., 2017). Besides, corporations such as Panda (Panda security, 2019) and McAfee (Mcafee phishing protection, 2019) have integrated anti-phishing service into their anti-virus software.

In the meantime, many researches also have proposed various methods from different academic angles. For instance, the simplest practice, blacklist or whitelist, takes effect by setting up an illegal or legitimate Uniform Resource Locators (URLs) list. But the weakness of this approach is that the lists cannot cover all phishing websites and this practice cannot defy the tricks such as the zero-day attack (Aravindhan, Shanmugalakshmi, Ramya, Chinnaiyan, 2016, Mohammad, Thabtah, Mccluskey, 2012). In order to overcome this drawback, many machine learning methods, e.g., Naive Bayer, J48 tree, Random Forest, Logistic Regression, Support Vector Machine, AdaBoostM1 Zhang et al. (2011) and etc., have shown great advantage by extracting features from URLs or webpage contents and training the classifier to give a final verdict. In the extracting and training procedures, they basically rely on manually prepared expert knowledge, which may result the final verdict very subjective. As the improvement, deep learning methods, such as Long Short Term Memory (LSTM), Deep Belief Networks and Convolutional Neural Network (CNN) have been applied in phishing detection to avoid the subjectivity caused by the manually extracted features (Correa Bahnsen, Contreras Bohorquez, Villegas, Vargas, González, 2017, Peng, Guangzhen, Peng, 2019, Zhang, Li). The underlying principle in deep learning for this improvement is that the features are not designed by human engineers, but learned from the data by a general-purpose learning procedure (Lecun et al., 2015). However, these deep learning approaches still face the problem of unsatisfied accuracy.

Moreover, most of the machine learning and deep learning methods mentioned above do not consider the problem of imbalanced training datasets. The problem results from the fact that legitimate URLs are greatly more than the phishing ones. In this situation, the classifier learns more features from the majority class which may cause the biased results (Verma and Das, 2017).

In order to balance the training datasets and improve the accuracy of phishing websites identification, in this paper, we propose self-attention CNN, a high-accuracy phishing websites detection approach via CNN (Ketkar, 2017) and Multi-Head Self-Attention (Vaswani et al., 2017). self-attention CNN first takes Generative Adversarial Network (GAN) to produce phishing URLs so as to balance the datasets between phishing and legitimate URLs (Chawla, Bowyer, Lawrence, Philip Kegelmeyer, 2002, Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio, 2014). Next, we combine the deep learning network of CNN and multi-head self-attention together to build our classifier. There are four important blocks, i.e., the input block, the attention block, the feature block and the output block, in the classifier. On the balanced datasets, the input block transforms URLs into matrixes, which are duplicated. Subsequently, the two duplicated copies are respectively fed into the attention block and the feature block to get attention weights and learn features. At last, the output block gives the detection result. In the specific, we adopt the multi-head self-attention mechanism in the attention block, which can find the inner dependency relationships between different characters of URLs. This helps our method learn comprehensive URL representations. The extensive experiments show that our generated data can balance the dataset and our method can detect phishing websites more precisely. Accuracy of self-attention CNN achieves 95.6% which is higher than those of CNN-LSTM, single CNN and single LSTM by 1.4%, 4.6% and 2.1% respectively.

The main contributions of our work are as follows:

  • We use GAN to generate synthetic URLs to make the training dataset balanced. These URLs are so similar with real-world phishing URLs that they greatly facilitate training an unbiased classifier.

  • CNN and multi-head self-attention are combined to construct one new classifier, which can improve the results. To the best of our knowledge, this is the first attempt in phishing websites detection.

  • We conduct a series of experiments, which illustrate that our approach obtains high accuracy in phishing URLs identification. The result also proves that our method is superior than the classic schemes.

To evaluate our proposed self-attention CNN model, we pose five questions to discuss the performance of our method:

Q1: Do different ratios between real-world legitimate URLs and real-world phishing URLs impact the classification results?

Q2: How about the experiments on the dataset including phishing URLs generated by GAN and real-world URLs and on the dataset with only real-world URLs?

Q3: How do parameters influence the performance of our classifier?

Q4: What’s the situation when our classifier is compared with the other existing schemes?

Q5: Are generative URLs created by GAN more useful for classification than those made by the other methods?

The rest of paper is organized as follows: Section 2 summarizes different methods for detecting phishing websites. Next Section 3 introduces some background, e.g. imbalanced data classification, Convolutional Neural Network and multi-head self-attention. In the following, Section 4 describes our method in detail including using GAN to balance the dataset and constructing our new network. Further, Section 5 shows our experiments and results on different datasets and makes some comparisons. Finally, Section 6 concludes the whole paper and points out our future work.

Section snippets

Related work

Nowadays, many efforts have shown their merits from different views, which mainly can be divided into four categories.

Blacklist-based methods are widely used by many companies and browsers. They record a lot of phishing websites via different techniques, such as searching known phishing characteristics in the web and etc. Correa Bahnsen et al. (2017), Zhang et al. (2008), Ma et al. (2009). For example, Google Safe Browsing holds its own blacklist to block recorded phishing websites when users

Background

In this section, we introduce the imbalanced data classification problem and list some approaches to solve the problem. Next, we illustrate the construction of CNN. Finally, multi-head self-attention is explained.

Methodology

In order to reach a high-accuracy result for phishing URL detection, we make improvement in the training dataset and the construction of the classifier. Briefly, we firstly use GAN to generate real phishing URLs and form the balanced training dataset along with real-world normal URLs. In the meanwhile, based on the balanced dataset, we design a CNN and multi-head self-attention combined classifier which can take URLs as input and give a verdict (positive/negative) as output. The detailed

Experiments

In this section, we first introduce datasets and metrics in our experiments. Then, we do some experiments to testify if GAN can be used to generate URLs for balancing datasets and training classifier more precisely. Finally, we compare some deep learning networks with our new network on different datasets, which proves our classifier is more accurate.

Conclusion and future works

In order to identify phishing websites with a high-accuracy rate, in this paper, we propose self-attention CNN, a CNN and multi-head self-attention combined deep learning approach. We first introduce GAN to generate phishing URLs, making the training dataset balanced. Then the new deep network involving CNN and multi-head self-attention is built to do phishing websites detection. Experiments show that the classifier on the dataset of generative URLs and real-world URLs can achieve better

CRediT authorship contribution statement

Xi Xiao: Conceptualization, Methodology. Wentao Xiao: Writing - original draft. Dianyan Zhang: Investigation, Writing - original draft. Bin Zhang: Project administration. Guangwu Hu: Writing - review & editing. Qing Li: Visualization, Validation. Shutao Xia: Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Xi Xiao is an associate professor in Graduate School at Shenzhen, Tsinghua University. He got his Ph.D. degree in 2011 in State Key Laboratory of Information Security, Graduate University of Chinese Academy of Sciences. His research interests focus on information security and the computer network.

References (52)

  • I. Corona et al.

    DeltaPhish: detecting phishing webpages in compromised websites

    (2017)
  • A. Correa Bahnsen et al.

    Classifying phishing URLs using recurrent neural networks

    (2017)
  • Q. Cui et al.

    Tracking phishing attacks over time

    (2017)
  • K. Dunham

    Mobile malware attacks and defense

    (2009)
  • Z. Futai et al.

    Web phishing detection based on graph mining

    (2016)
  • I. Goodfellow et al.

    Generative adversarial nets

    ArXiv

    (2014)
  • R. Gowtham et al.

    Identification of phishing webpages and its target domains by analyzing the feign relationship

    J. Inf. Secur. Appl.

    (2017)
  • R. Gowtham et al.

    A comprehensive and efficacious architecture for detecting phishing webpages

    Comput. Secur.

    (2014)
  • K. He et al.

    Deep residual learning for image recognition

    (2016)
  • A. Jain et al.

    Towards detection of phishing websites on client-side using machine learning based approach

    Telecommun. Syst.

    (2017)
  • Ketkar N.. Convolutional neural...
  • H.G. Kim et al.

    Knowledge distillation using output errors for self-attention end-to-end models

    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2019)
  • Kim Y.. Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical...
  • A. Le et al.

    PhishDef: URL names say it all

    Proceedings - IEEE INFOCOM

    (2010)
  • Y. Lecun et al.

    Deep learning

    Nature

    (2015)
  • B. Liang et al.

    Cracking classifiers for evasion: a case study on the google’s phishing pages filter

    International Conference on World Wide Web

    (2016)
  • Cited by (45)

    View all citing articles on Scopus

    Xi Xiao is an associate professor in Graduate School at Shenzhen, Tsinghua University. He got his Ph.D. degree in 2011 in State Key Laboratory of Information Security, Graduate University of Chinese Academy of Sciences. His research interests focus on information security and the computer network.

    Wentao Xiao is pursuing his Master degree in computer technology at University of Tsinghua. His research interests focus on machine learning, deep learning, and cyberspace security.

    Dianyan Zhang is pursuing his Master degree in computer technology at University of Tsinghua. His research interests focus on network security and deep learning.

    Bin Zhang received his Ph.D. degree in Department of Computer Science and Technology, Tsinghua University, China in 2012. He worked as a post doctor in Nanjing Telecommunication Technology Institute from 2014 to 2017. He is now a researcher in the Cyberspace Security Research Center of Peng Cheng Laboratory. His current research interests focus on network anomaly detection, Internet architecture, and its protocols, network traffic measurement, information privacy security, etc.

    Guangwu Hu is an associate professor of Shenzhen Institute of Information Technology, China. He received his Ph.D. degree in computer science and technology from Tsinghua University in 2014. His research interests include software defined networking, Next-Generate Internet and Internet security.

    Shutao Xia is a professor and a Ph.D. supervisor in Graduate School at Shenzhen, Tsinghua University. He got his Ph.D. degree in Nankai University 1997. He mainly engaged in information theory and coding, internet and big data. He has published more than 60 papers in top international journals and international conferences.

    This work is supported in part by the National Key Research and Development Program of China (2018YFB1800204, 2018YFB1800601), the National Natural Science Foundation of China (61972219, 61771273), Natural Science Foundation of Guangdong Province (2021A1515012640), and the R&D Program of Shenzhen (JCYJ20190813174403598, SGDX20190918101201696, JCYJ20190813165003837).

    View full text