Elsevier

Computers & Security

Volume 114, March 2022, 102584
Computers & Security

HDP-CNN: Highway deep pyramid convolution neural network combining word-level and character-level representations for phishing website detection

https://doi.org/10.1016/j.cose.2021.102584Get rights and content

Abstract

Phishing has become a prevailing method for attackers to steal users’ private data and commit fraud, posing a serious threat to Internet users. How to detect phishing websites has attracted great interests from both academia and industry. A popular approach is to use support vector machine (SVM) to detect phishing websites. However, this approach relies on extracting features designated by experts, and the prediction effectiveness of the model is greatly affected by the quality of feature extraction. In addition, it cannot handle features that are not identifiable. Deep learning methods therefore become popular as they do not require manual feature engineering. However, many deep learning methods can only learn feature information of uniform resource locators (URLs) at the character level, while ignoring the intrinsic connections of words. To address these limitations, we propose a novel highway deep pyramid convolution neural network (HDP-CNN), a deep convolutional network that combines character-level and word-level representation information. HDP-CNN first receives the URL string sequences as input, then performs character-level embedding and word-level embedding respectively. Afterward, it uses the Highway network to connect the character-level embedding representation and word-level embedding representation of the URL and extracts local features of different sizes from the region embedding layer. Finally, it passes them into the designed deep pyramid structure network to capture the global representation of the URL. Our experiments illustrate that the information expressed by embedding vectors of different granularities has subtle differences. By combining embedding feature information of different granularities, HDP-CNN exhibits better performance than methods based on single embedding feature information. In our experiments, we construct an imbalanced dataset that has the ratio of benign websites to phishing websites is close to 5:1. The experimental results demonstrate that our method outperforms other methods, with accuracy at 98.30%, true positive rate (TPR) at 99.18%, and true negative rate (TNR) at 94.34%.

Introduction

The development of the Internet has brought great social and economic progress, and the Internet has become an indispensable infrastructure. Unfortunately, technological advances have been accompanied by many complex security issues. In the hands of criminals, technology has been used to attack and defraud users. These include phishing, financial fraud, malware, privacy breaches, etc., which pose serious threats to Internet users (Wang et al., 2020).

As defined by the anti-phishing working group (APWG), phishing is a criminal act which uses social engineering and technology to steal users personal identity data and financial accounts by using spoofed uniform resource locator (URL) addresses and emails to lure users to fake websites and steal their accounts information and passwords (APWG, 2020). Those fake URLs used for cyber attacks and scams are called phishing URLs. These phishing URLs are also using secure sockets layer (SSL) or transport layer security (TLS) certificates to induce users into thinking websites are legitimate. The losses caused by phishing attacks every year are enormous. Therefore, many researchers and practitioners have been working to design a more effective method to detect phishing URLs.

Methods based on machine learning (Sahoo et al., 2019) are widely used for phishing website detection. Machine learning-based approaches require extracting those feature representations from URLs that can contribute to discern, and then establish a predictive model to train the data represented by these features. This requires researchers to have the relevant domain knowledge to extract relevant features, and different feature extraction methods will lead to different training results.

However, deep learning methods such as convolutional neural network (Kim, 2014) and long short-term memory network (Bahnsen et al., 2017) can automatically discover hidden features from the original URL for training. These methods do not require manual extraction of functions from URLs, and are less based on data pre-processing since neural networks are capable of extracting higher-level information from raw data. In general, deep learning methods may perform better compared with traditional machine learning methods. Yet precisely extracting semantic information from the URL’s character sequence is also a tough task for deep learning methods.

In this paper, we propose a novel highway deep pyramid convolution neural network (HDP-CNN) that combines character-level and word-level representations to detect phishing websites, while most methods use only character-level information, which enables our method to obtain richer information from URL strings and thus improve the detection of the model. Specifically, HDP-CNN takes URL strings as input, and performs character-level and word-level embedding. Then, the character embedding matrix and word embedding matrix are concatenated as the semantic representation of the URL, and a highway network is used to balance the weight of both. Then, we feed them into the CNN with different sizes of convolution kernels to extract local information of different lengths. Next, we connect the features extracted from different convolution kernels and input them into DPCNN to capture the global representation of the URL. Finally, through the fully connected layer, the result of whether the URL is a phishing website is produced. Our model uses word-level embedding combined with character-level embedding representation, which can not only overcome the influence of out-of-vocabulary words in word-level representation that cannot be processed, but also the weakness that character-level representation does not work well when dealing with long sentences. In addition, with the help of the highway network, can prevents character-level information from being overwhelmed by word-level information because the word list is much larger than the character list. Moreover, the DPCNN network structure can make the network layers deeper without adding much computational cost, which makes the model training converge quickly.

The main contributions of our work are as follows:

  • 1.

    We propose a novel highway deep pyramid convolution neural network (HDP-CNN) which is deep convolutional network that combines both character-level and word-level representation information to predict whether a given URL is a phishing or a benign website.

  • 2.

    We construct an imbalanced datasets to verify the performance of HDP-CNN. The datasets contains nearly 420,000 samples and the ratio of positive to negative samples is approximately 5:1.

  • 3.

    Our model was trained on a real datasets and the experimental results show that HDP-CNN outperforms other methods, with accuracy at 98.30%, true positive rate (TPR) at 99.18%, and true negative rate (TNR) at 94.34%.

The rest of this paper is organized as follows. Section 2 reviews the related work on phishing website detection. Section 3 describes the framework and details on our method, and Section 4 presents our experiments and make comparisons with other methods. Section 5 summarizes the paper with future work.

Section snippets

Related work

There has been many excellent researches in solving the problem of phishing website detection. We describe a deep learning method based phishing website detection method, which is a hot spot in current phishing website detection research.

Deep learning methods have achieved good results in many text-related tasks, such as text classification and machine translation. In the detection of phishing websites, we can treat URL strings as text sequences, use deep learning methods to learn feature

Problem formulation

Our goal is to predict whether a given URL is a phishing or a benign website. Therefore, we define the task as a classification problem. Specifically, for a given URLs dataset U, where each entry contains a URL string and a corresponding label (phishing or benign), U={(x,y)|x=xi,y=yi,iN}, x represents the URL string in the data set, y{0,1} represents the tag corresponding to the URL, where yi=0 represents the benign website, yi=1 represents the phishing website, and N means there are N data

Dataset and matrics

We crawled a large amount of real data from the Internet from PhishTank and Alexa websites to build a dataset containing nearly 420,000 samples. We divided the datasets into a training set, a validation set and a test set. The ratio of the three datasets is 7: 1.5: 1.5, that is, we will use 70% of the data to train HDP-CNN model, so the model can learn enough URL features to be able to generalize and predict whether an unknown URL in the test data is a phishing website.

In addition, the

Conclusion and future work

In this paper, we propose the HDP-CNN method for detecting phishing websites, which is a deep convolutional neural network that combines URL word-level representation and character-level representation. It consists of four modules: embedding module, highway network module, region embedding module and deep pyramid module. In addition, we constructed an imbalanced datasets that is close to the ratio of benign websites to phishing websites in the real internet environment to verify the performance

CRediT authorship contribution statement

Faan Zheng: Conceptualization, Methodology, Validation. Qiao Yan: Data curation, Writing – original draft. Victor C.M. Leung: Writing – review & editing. F. Richard Yu: Writing – review & editing. Zhong Ming: Supervision.

Declaration of Competing Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (61976142, 61836005, 61672358) and Shenzhen Science and Technology Plan Project (JCYJ20210324093609025).

Faan Zheng is a graduate student in the College of Computer Science and Software Engineering at Shenzhen University, Shenzhen, China. His research interests are in network security and machine learning. Email: [email protected]

References (21)

  • H. Wang et al.

    The impact of propagation delay to different selfish miners in proof-of-work blockchains

    Peer-to-Peer Networking and Applications

    (2020)
  • A. Aljofey et al.

    An effective phishing detection model based on character level convolutional neural network from url

    Electronics

    (2020)
  • APWG, 2020. Phishing activity trends report, 2nd quarter 2020....
  • A.C. Bahnsen et al.

    Classifying phishing urls using recurrent neural networks

    2017 APWG Symposium on Electronic Crime Research (eCrime)

    (2017)
  • S.-J. Bu et al.

    Deep character-level anomaly detection based on a convolutional autoencoder for zero-day phishing url detection

    Electronics

    (2021)
  • S.-J. Bu et al.

    Integrating deep learning with first-order logic programmed constraints for zero-day phishing attack detection

    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2021)
  • Y. Huang et al.

    Phishing url detection via capsule-based neural network

    2019 IEEE 13th International Conference on Anti-counterfeiting, Security, and Identification (ASID)

    (2019)
  • R. Johnson et al.

    Deep pyramid convolutional neural networks for text categorization

    Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics

    (2017)
  • Y. Kim

    Convolutional neural networks for sentence classification

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    (2014)
  • Le, H., Pham, Q., Sahoo, D., Hoi, S. C. H., 2018. Urlnet: Learning a url representation with deep learning for...
There are more references available in the full text version of this article.

Cited by (17)

  • Deep learning based phishing website detection

    2024, Telkomnika (Telecommunication Computing Electronics and Control)
  • Forecasting Stability of Smart Grids using Highway Deep Pyramid Convolutional Neural Network (HPDCNN) Approach

    2024, International Journal of Intelligent Systems and Applications in Engineering
View all citing articles on Scopus

Faan Zheng is a graduate student in the College of Computer Science and Software Engineering at Shenzhen University, Shenzhen, China. His research interests are in network security and machine learning. Email: [email protected]

Qiao Yan is a Professor in the College of Computer Science and Software Engineering at Shenzhen University, Shenzhen, China. Her research interests are in network security, software-defined networking and machine learning. Email: [email protected]

Victor C. M. Leung (Fellow, IEEE) is currently a Distinguished Professor of computer science and software engineering with Shenzhen University, Shenzhen, China. He is also an Emeritus Professor of electrical and computer engineering and the Director of the Laboratory for Wireless Networks and Mobile Systems, The University of British Columbia (UBC), Vancouver, Canada. He is a fellow of the Royal Society of Canada, Canadian Academy of Engineering, and Engineering Institute of Canada. His research is in the broad areas of wireless networks and mobile systems. Email: [email protected]

View full text